Releases · clipperhouse/uax29

05 Jun 02:30

v1.9.1

2b27b05

Transforms Latest

Latest

Add ability to transform tokens, such as lowercasing, based on golang.org/x/text/transform.

The signature of SegmentAll() has changed, removing the variadic parameter at the end, a breaking change if you used that param. I’m not willing to go to v2 over that, hopefully not a problem. :)

Assets 2

29 May 01:10

clipperhouse

v1.8.0

619311b

Filters

Adds the ability to filter tokens. Use the Filter() method on Segmenter and Scanner.

Any func([]byte) bool can serve as a filter, with arbitrary logic. An example included filter is Wordlike, which removes whitespace and punctuation, returning only ‘words’ in the common sense.

See also the Contains() and Entirely() methods, which allow creation of filters based on Unicode categories.

Assets 2

18 May 17:07

clipperhouse

v1.7.0

fce0e38

New Segmenter

Introduces a new Segmenter type to tokenize []byte. Analogous to the existing Scanner, but does not require a io.Reader. Segmenter is maybe 10% faster, assuming you are operating on an existing []byte.

Also adds a SegmentAll([]byte) convenience method, if you’d like to do this as a one-liner and are not too concerned about allocations.

Assets 2

29 Jun 00:45

clipperhouse

v1.6.7

ac0dcd5

Handle more invalid UTF-8

Invalid UTF-8 input is considered undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

There are two basic tests in each package, called TestInvalidUTF8 and TestRandomBytes. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.

Assets 2

14 Jun 18:17

clipperhouse

v1.6.5

9dff2f6

v1.6.5

Many performance improvements, over 2× faster than the v1.6.0 release.

Assets 2

12 May 15:25

clipperhouse

v1.6.0

ceb970c

v1.6.0

Replace range tables with tries, using the triegen package from x/text. Performance improvement looks to be 30-40%.

This is a breaking change for consumers that depended on those range tables, which were exported. For normal usage of calling the scanner, the API is unchanged.

This should be a new major version but I can’t be bothered with the v2 renaming, in terms of discoverability, etc. Hopefully OK for most users.

Assets 2

07 May 15:50

clipperhouse

v1.5.0

23a354e

v1.5.0

The implementation is now purely based on bufio.SplitFunc and bufio.Scanner. Our custom scanner and rune buffer is gone.

Perf improvements are substantial for words & graphemes; sentences seems to have regressed. Allocations are down to single digits, and are independent of input size.

This is a breaking change — the API is smaller. However, the main use case of calling (e.g.) words.NewScanner and iterating over it should be unchanged for most consumers. The methods are the same.

So, I should increment the major version but the recommended v2 business is more than I want to deal with. Hopefully it’s OK for most.

Assets 2

01 May 14:08

clipperhouse

v1.2.1

50f2e1e

v1.2.1

Performance and docs tweaks.

Assets 2

28 Apr 15:05

clipperhouse

v1.2.0

e02aebd

v1.2.0

New SplitFunc for words, sentences and graphemes, for use with bufio.Scanner. Provided for convenience if you are using bufio.Scanner and want to drop in a replacement SplitFunc.

The ‘native’ (words|sentences|graphemes).NewScanner is ~10% faster, prefer that if you are starting from scratch.

Assets 2

27 Apr 20:47

clipperhouse

v1.1.0

e14cba8

v1.1.0

New unified uax29.Scanner, used by the subpackages, instead of each having its own Scanner type.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: clipperhouse/uax29

Transforms

Filters

New Segmenter

Handle more invalid UTF-8

v1.6.5

v1.6.0

v1.5.0

v1.2.1

v1.2.0

v1.1.0