Package triegen
Package triegen implements a code generator for a trie for associating
unsigned integer values with UTF-8 encoded runes.
Many of the go.text packages use tries for storing per-rune information. A
trie is especially useful if many of the runes have the same value. If this
is the case, many blocks can be expected to be shared allowing for
information on many runes to be stored in little space.
As most of the lookups are done directly on []byte slices, the tries use the
UTF-8 bytes directly for the lookup. This saves a conversion from UTF-8 to
runes and contributes a little bit to better performance. It also naturally
provides a fast path for ASCII.
Space is also an issue. There are many code points defined in Unicode and as
a result tables can get quite large. So every byte counts. The triegen
package automatically chooses the smallest integer values to represent the
tables. Compacters allow further compression of the trie by allowing for
alternative representations of individual trie blocks.
triegen allows generating multiple tries as a single structure. This is
useful when, for example, one wants to generate tries for several languages
that have a lot of values in common. Some existing libraries for
internationalization store all per-language data as a dynamically loadable
chunk. The go.text packages are designed with the assumption that the user
typically wants to compile in support for all supported languages, in line
with the approach common to Go to create a single standalone binary. The
multi-root trie approach can give significant storage savings in this
scenario.
triegen generates both tables and code. The code is optimized to use the
automatically chosen data types. The following code is generated for a Trie
or multiple Tries named "foo":
type fooTrie
The trie type.
func newFooTrie(x int) *fooTrie
Trie constructor, where x is the index of the trie passed to Gen.
func (t *fooTrie) lookup(s []byte) (v uintX, sz int)
The lookup method, where uintX is automatically chosen.
func lookupString, lookupUnsafe and lookupStringUnsafe
Variants of the above.
var fooValues and fooIndex and any tables generated by Compacters.
The core trie data.
var fooTrieHandles
Indexes of starter blocks in case of multiple trie roots.
It is recommended that users test the generated trie by checking the returned
value for every rune. Such exhaustive tests are possible as the number of
runes in Unicode is limited.
▾ Example (Build)
Example_build shows how to build a simple trie. It assigns the value 1 to
100 random runes generated by randomRunes.
Code:
t := triegen.NewTrie("rand")
for r, _ := range randomRunes() {
t.Insert(r, 1)
}
sz, err := t.Gen(genWriter)
fmt.Printf("Trie size: %d bytes\n", sz)
fmt.Printf("Error: %v\n", err)
Output:
Trie size: 9280 bytes
Error: <nil>
▾ Example (Lookup)
Example_lookup demonstrates how to use the trie generated by Example_build.
Code:
trie := newRandTrie(0)
runes := randomRunes()
for r := rune(0); r <= unicode.MaxRune; r++ {
if v, _ := trie.lookupString(string(r)); v != runes[r] {
fmt.Println("FAILURE")
return
}
}
fmt.Println("SUCCESS")
Output:
SUCCESS
func Gen(w io.Writer, name string, tries []*Trie, opts ...Option) (sz int, err error)
Gen writes Go code for a shared trie lookup structure to w for the given
Tries. The generated trie type will be called nameTrie. newNameTrie(x) will
return the *nameTrie for tries[x]. A value can be looked up by using one of
the various lookup methods defined on nameTrie. It returns the table size of
the generated trie.
▾ Example (Build)
ExampleGen_build demonstrates the creation of multiple tries sharing common
blocks. ExampleGen_lookup demonstrates how to use the generated tries.
Code:
var tries []*triegen.Trie
rv := runeValues()
for _, c := range []struct {
include func(rune) bool
name string
}{
{func(r rune) bool { return true }, "all"},
{func(r rune) bool { return r < 0x80 }, "ASCII only"},
{func(r rune) bool { return r < 0x80 }, "ASCII only 2"},
{func(r rune) bool { return r <= 0xFFFF }, "BMP only"},
{func(r rune) bool { return r > 0xFFFF }, "No BMP"},
} {
t := triegen.NewTrie(c.name)
tries = append(tries, t)
for r, v := range rv {
if c.include(r) {
t.Insert(r, v)
}
}
}
sz, err := triegen.Gen(genWriter, "multi", tries)
fmt.Printf("Trie size: %d bytes\n", sz)
fmt.Printf("Error: %v\n", err)
Output:
Trie size: 18250 bytes
Error: <nil>
▾ Example (Lookup)
ExampleGen_lookup shows how to look up values in the trie generated by
ExampleGen_build.
Code:
rv := runeValues()
for i, include := range []func(rune) bool{
func(r rune) bool { return true },
func(r rune) bool { return r < 0x80 },
func(r rune) bool { return r < 0x80 },
func(r rune) bool { return r <= 0xFFFF },
func(r rune) bool { return r > 0xFFFF },
} {
t := newMultiTrie(i)
for r := rune(0); r <= unicode.MaxRune; r++ {
x := uint64(0)
if include(r) {
x = rv[r]
}
if v := t.lookupStringUnsafe(string(r)); x != v {
fmt.Println("FAILURE")
return
}
}
}
fmt.Println("SUCCESS")
Output:
SUCCESS
A Compacter generates an alternative, more space-efficient way to store a
trie value block. A trie value block holds all possible values for the last
byte of a UTF-8 encoded rune. Excluding ASCII characters, a trie value block
always has 64 values, as a UTF-8 encoding ends with a byte in [0x80, 0xC0).
type Compacter interface {
Size(v []uint64) (sz int, ok bool)
Store(v []uint64) uint32
Print(w io.Writer) error
Handler() string
}
▾ Example
Code:
package triegen_test
import (
"fmt"
"io"
"golang.org/x/text/internal/triegen"
)
func ExampleCompacter() {
t := triegen.NewTrie("root")
for r := rune(0); r < 10000; r += 64 {
t.Insert(r, 0x9015BADA55^uint64(r))
}
sz, _ := t.Gen(io.Discard)
fmt.Printf("Size normal: %5d\n", sz)
var c myCompacter
sz, _ = t.Gen(io.Discard, triegen.Compact(&c))
fmt.Printf("Size compacted: %5d\n", sz)
}
type myCompacter []uint64
func (c *myCompacter) Size(values []uint64) (sz int, ok bool) {
for _, v := range values[1:] {
if v != 0 {
return 0, false
}
}
return 8, true
}
func (c *myCompacter) Store(v []uint64) uint32 {
x := uint32(len(*c))
*c = append(*c, v[0])
return x
}
func (c *myCompacter) Print(w io.Writer) error {
fmt.Fprintln(w, "var firstValue = []uint64{")
for _, v := range *c {
fmt.Fprintf(w, "\t%#x,\n", v)
}
fmt.Fprintln(w, "}")
return nil
}
func (c *myCompacter) Handler() string {
return "getFirstValue"
}
An Option can be passed to Gen.
type Option func(b *builder) error
func Compact(c Compacter) Option
Compact configures the trie generator to use the given Compacter.
A Trie represents a single root node of a trie. A builder may build several
overlapping tries at once.
type Trie struct {
}
func NewTrie(name string) *Trie
NewTrie returns a new trie root.
func (*Trie) Gen
¶
func (t *Trie) Gen(w io.Writer, opts ...Option) (sz int, err error)
Gen is a convenience wrapper around the Gen func passing t as the only trie
and uses the name passed to NewTrie. It returns the size of the generated
tables.
func (t *Trie) Insert(r rune, value uint64)
Insert associates value with the given rune. Insert will panic if a non-zero
value is passed for an invalid rune.