...

Package norm

import "golang.org/x/text/unicode/norm"
Overview
Index
Examples

Overview ▾

Package norm contains types and functions for normalizing Unicode strings.

Index ▾

Constants
type Form
    func (f Form) Append(out []byte, src ...byte) []byte
    func (f Form) AppendString(out []byte, src string) []byte
    func (f Form) Bytes(b []byte) []byte
    func (f Form) FirstBoundary(b []byte) int
    func (f Form) FirstBoundaryInString(s string) int
    func (f Form) IsNormal(b []byte) bool
    func (f Form) IsNormalString(s string) bool
    func (f Form) LastBoundary(b []byte) int
    func (f Form) NextBoundary(b []byte, atEOF bool) int
    func (f Form) NextBoundaryInString(s string, atEOF bool) int
    func (f Form) Properties(s []byte) Properties
    func (f Form) PropertiesString(s string) Properties
    func (f Form) QuickSpan(b []byte) int
    func (f Form) QuickSpanString(s string) int
    func (f Form) Reader(r io.Reader) io.Reader
    func (Form) Reset()
    func (f Form) Span(b []byte, atEOF bool) (n int, err error)
    func (f Form) SpanString(s string, atEOF bool) (n int, err error)
    func (f Form) String(s string) string
    func (f Form) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error)
    func (f Form) Writer(w io.Writer) io.WriteCloser
type Iter
    func (i *Iter) Done() bool
    func (i *Iter) Init(f Form, src []byte)
    func (i *Iter) InitString(f Form, src string)
    func (i *Iter) Next() []byte
    func (i *Iter) Pos() int
    func (i *Iter) Seek(offset int64, whence int) (int64, error)
type Properties
    func (p Properties) BoundaryAfter() bool
    func (p Properties) BoundaryBefore() bool
    func (p Properties) CCC() uint8
    func (p Properties) Decomposition() []byte
    func (p Properties) LeadCCC() uint8
    func (p Properties) Size() int
    func (p Properties) TrailCCC() uint8

Examples

Form.NextBoundary
Iter

Package files

composition.go forminfo.go input.go iter.go normalize.go readwriter.go tables15.0.0.go transform.go trie.go

Constants

const (
    // Version is the Unicode edition from which the tables are derived.
    Version = "15.0.0"

    // MaxTransformChunkSize indicates the maximum number of bytes that Transform
    // may need to write atomically for any Form. Making a destination buffer at
    // least this size ensures that Transform can always make progress and that
    // the user does not need to grow the buffer on an ErrShortDst.
    MaxTransformChunkSize = 35 + maxNonStarters*4
)

GraphemeJoiner is inserted after maxNonStarters non-starter runes.

const GraphemeJoiner = "\u034F"

MaxSegmentSize is the maximum size of a byte buffer needed to consider any sequence of starter and non-starter runes for the purpose of normalization.

const MaxSegmentSize = maxByteBufferSize

type Form

A Form denotes a canonical representation of Unicode code points. The Unicode-defined normalization and equivalence forms are:

NFC   Unicode Normalization Form C
NFD   Unicode Normalization Form D
NFKC  Unicode Normalization Form KC
NFKD  Unicode Normalization Form KD

For a Form f, this documentation uses the notation f(x) to mean the bytes or string x converted to the given form. A position n in x is called a boundary if conversion to the form can proceed independently on both sides:

f(x) == append(f(x[0:n]), f(x[n:])...)

References: https://unicode.org/reports/tr15/ and https://unicode.org/notes/tn5/.

type Form int
const (
    NFC Form = iota
    NFD
    NFKC
    NFKD
)

func (Form) Append

func (f Form) Append(out []byte, src ...byte) []byte

Append returns f(append(out, b...)). The buffer out must be nil, empty, or equal to f(out).

func (Form) AppendString

func (f Form) AppendString(out []byte, src string) []byte

AppendString returns f(append(out, []byte(s))). The buffer out must be nil, empty, or equal to f(out).

func (Form) Bytes

func (f Form) Bytes(b []byte) []byte

Bytes returns f(b). May return b if f(b) = b.

func (Form) FirstBoundary

func (f Form) FirstBoundary(b []byte) int

FirstBoundary returns the position i of the first boundary in b or -1 if b contains no boundary.

func (Form) FirstBoundaryInString

func (f Form) FirstBoundaryInString(s string) int

FirstBoundaryInString returns the position i of the first boundary in s or -1 if s contains no boundary.

func (Form) IsNormal

func (f Form) IsNormal(b []byte) bool

IsNormal returns true if b == f(b).

func (Form) IsNormalString

func (f Form) IsNormalString(s string) bool

IsNormalString returns true if s == f(s).

func (Form) LastBoundary

func (f Form) LastBoundary(b []byte) int

LastBoundary returns the position i of the last boundary in b or -1 if b contains no boundary.

func (Form) NextBoundary

func (f Form) NextBoundary(b []byte, atEOF bool) int

NextBoundary reports the index of the boundary between the first and next segment in b or -1 if atEOF is false and there are not enough bytes to determine this boundary.

Example

Code:

s := norm.NFD.String("Mêlée")

for i := 0; i < len(s); {
    d := norm.NFC.NextBoundaryInString(s[i:], true)
    fmt.Printf("%[1]s: %+[1]q\n", s[i:i+d])
    i += d
}

Output:

M: "M"
ê: "e\u0302"
l: "l"
é: "e\u0301"
e: "e"

func (Form) NextBoundaryInString

func (f Form) NextBoundaryInString(s string, atEOF bool) int

NextBoundaryInString reports the index of the boundary between the first and next segment in b or -1 if atEOF is false and there are not enough bytes to determine this boundary.

func (Form) Properties

func (f Form) Properties(s []byte) Properties

Properties returns properties for the first rune in s.

func (Form) PropertiesString

func (f Form) PropertiesString(s string) Properties

PropertiesString returns properties for the first rune in s.

func (Form) QuickSpan

func (f Form) QuickSpan(b []byte) int

QuickSpan returns a boundary n such that b[0:n] == f(b[0:n]). It is not guaranteed to return the largest such n.

func (Form) QuickSpanString

func (f Form) QuickSpanString(s string) int

QuickSpanString returns a boundary n such that s[0:n] == f(s[0:n]). It is not guaranteed to return the largest such n.

func (Form) Reader

func (f Form) Reader(r io.Reader) io.Reader

Reader returns a new reader that implements Read by reading data from r and returning f(data).

func (Form) Reset

func (Form) Reset()

Reset implements the Reset method of the transform.Transformer interface.

func (Form) Span

func (f Form) Span(b []byte, atEOF bool) (n int, err error)

Span implements transform.SpanningTransformer. It returns a boundary n such that b[0:n] == f(b[0:n]). It is not guaranteed to return the largest such n.

func (Form) SpanString

func (f Form) SpanString(s string, atEOF bool) (n int, err error)

SpanString returns a boundary n such that s[0:n] == f(s[0:n]). It is not guaranteed to return the largest such n.

func (Form) String

func (f Form) String(s string) string

String returns f(s).

func (Form) Transform

func (f Form) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error)

Transform implements the Transform method of the transform.Transformer interface. It may need to write segments of up to MaxSegmentSize at once. Users should either catch ErrShortDst and allow dst to grow or have dst be at least of size MaxTransformChunkSize to be guaranteed of progress.

func (Form) Writer

func (f Form) Writer(w io.Writer) io.WriteCloser

Writer returns a new writer that implements Write(b) by writing f(b) to w. The returned writer may use an internal buffer to maintain state across Write calls. Calling its Close method writes any buffered data to w.

type Iter

An Iter iterates over a string or byte slice, while normalizing it to a given Form.

type Iter struct {
    // contains filtered or unexported fields
}

Example

Code:

package norm_test

import (
    "bytes"
    "fmt"
    "io"
    "unicode/utf8"

    "golang.org/x/text/unicode/norm"
)

// EqualSimple uses a norm.Iter to compare two non-normalized
// strings for equivalence.
func EqualSimple(a, b string) bool {
    var ia, ib norm.Iter
    ia.InitString(norm.NFKD, a)
    ib.InitString(norm.NFKD, b)
    for !ia.Done() && !ib.Done() {
        if !bytes.Equal(ia.Next(), ib.Next()) {
            return false
        }
    }
    return ia.Done() && ib.Done()
}

// FindPrefix finds the longest common prefix of ASCII characters
// of a and b.
func FindPrefix(a, b string) int {
    i := 0
    for ; i < len(a) && i < len(b) && a[i] < utf8.RuneSelf && a[i] == b[i]; i++ {
    }
    return i
}

// EqualOpt is like EqualSimple, but optimizes the special
// case for ASCII characters.
func EqualOpt(a, b string) bool {
    n := FindPrefix(a, b)
    a, b = a[n:], b[n:]
    var ia, ib norm.Iter
    ia.InitString(norm.NFKD, a)
    ib.InitString(norm.NFKD, b)
    for !ia.Done() && !ib.Done() {
        if !bytes.Equal(ia.Next(), ib.Next()) {
            return false
        }
        if n := int64(FindPrefix(a[ia.Pos():], b[ib.Pos():])); n != 0 {
            ia.Seek(n, io.SeekCurrent)
            ib.Seek(n, io.SeekCurrent)
        }
    }
    return ia.Done() && ib.Done()
}

var compareTests = []struct{ a, b string }{
    {"aaa", "aaa"},
    {"aaa", "aab"},
    {"a\u0300a", "\u00E0a"},
    {"a\u0300\u0320b", "a\u0320\u0300b"},
    {"\u1E0A\u0323", "\x44\u0323\u0307"},
    // A character that decomposes into multiple segments
    // spans several iterations.
    {"\u3304", "\u30A4\u30CB\u30F3\u30AF\u3099"},
}

func ExampleIter() {
    for i, t := range compareTests {
        r0 := EqualSimple(t.a, t.b)
        r1 := EqualOpt(t.a, t.b)
        fmt.Printf("%d: %v %v\n", i, r0, r1)
    }
    // Output:
    // 0: true true
    // 1: false false
    // 2: true true
    // 3: true true
    // 4: true true
    // 5: true true
}

func (*Iter) Done

func (i *Iter) Done() bool

Done returns true if there is no more input to process.

func (*Iter) Init

func (i *Iter) Init(f Form, src []byte)

Init initializes i to iterate over src after normalizing it to Form f.

func (*Iter) InitString

func (i *Iter) InitString(f Form, src string)

InitString initializes i to iterate over src after normalizing it to Form f.

func (*Iter) Next

func (i *Iter) Next() []byte

Next returns f(i.input[i.Pos():n]), where n is a boundary of i.input. For any input a and b for which f(a) == f(b), subsequent calls to Next will return the same segments. Modifying runes are grouped together with the preceding starter, if such a starter exists. Although not guaranteed, n will typically be the smallest possible n.

func (*Iter) Pos

func (i *Iter) Pos() int

Pos returns the byte position at which the next call to Next will commence processing.

func (*Iter) Seek

func (i *Iter) Seek(offset int64, whence int) (int64, error)

Seek sets the segment to be returned by the next call to Next to start at position p. It is the responsibility of the caller to set p to the start of a segment.

type Properties

Properties provides access to normalization properties of a rune.

type Properties struct {
    // contains filtered or unexported fields
}

func (Properties) BoundaryAfter

func (p Properties) BoundaryAfter() bool

BoundaryAfter returns true if runes cannot combine with or otherwise interact with this or previous runes.

func (Properties) BoundaryBefore

func (p Properties) BoundaryBefore() bool

BoundaryBefore returns true if this rune starts a new segment and cannot combine with any rune on the left.

func (Properties) CCC

func (p Properties) CCC() uint8

CCC returns the canonical combining class of the underlying rune.

func (Properties) Decomposition

func (p Properties) Decomposition() []byte

Decomposition returns the decomposition for the underlying rune or nil if there is none.

func (Properties) LeadCCC

func (p Properties) LeadCCC() uint8

LeadCCC returns the CCC of the first rune in the decomposition. If there is no decomposition, LeadCCC equals CCC.

func (Properties) Size

func (p Properties) Size() int

Size returns the length of UTF-8 encoding of the rune.

func (Properties) TrailCCC

func (p Properties) TrailCCC() uint8

TrailCCC returns the CCC of the last rune in the decomposition. If there is no decomposition, TrailCCC equals CCC.