Skip to content

Commit

Permalink
Update README regarding invalid UTF-8
Browse files Browse the repository at this point in the history
  • Loading branch information
clipperhouse committed Jun 29, 2020
1 parent b1dbb10 commit ac0dcd5
Show file tree
Hide file tree
Showing 4 changed files with 30 additions and 6 deletions.
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,12 @@ In my local benchmarking (Mac laptop), `uax29/words` processes around 16MM token

- There is [discussion](https://groups.google.com/d/msg/golang-nuts/_79vJ65KuXc/B_QgeU6rAgAJ) of implementing the above in Go’s [`x/text`](https://godoc.org/golang.org/x/text) package

### Invalid inputs

Invalid UTF-8 input is considered undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

There are two basic tests in each package, called `TestInvalidUTF8` and `TestRandomBytes`. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.

### See also

[jargon](https://github.com/clipperhouse/jargon), a text pipelines package for CLI and Go, which consumes this package.
Expand Down
12 changes: 9 additions & 3 deletions graphemes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,20 @@ if err := scanner.Err(); err != nil {

[GoDoc](https://godoc.org/github.com/clipperhouse/uax29/graphemes)

### Conformance

We use the official [test suite](https://unicode.org/reports/tr41/tr41-26.html#Tests29), thanks to [bleve](https://github.com/blevesearch/segment/blob/master/tables_test.go). Status:

![Go](https://github.com/clipperhouse/uax29/workflows/Go/badge.svg)

### Performance

`uax29` is designed to work in constant memory, regardless of input size. It buffers input and streams tokens. (For example, I am showing a maximum resident size of 8MB when processing a 300MB file, on my laptop.)

Execution time is `O(n)` on input size. It can be I/O bound; you can control I/O and performance implications by the `io.Reader` you pass to `NewScanner`.

### Conformance
### Invalid inputs

We use the official [test suite](https://unicode.org/reports/tr41/tr41-26.html#Tests29), thanks to [bleve](https://github.com/blevesearch/segment/blob/master/tables_test.go). Status:
Invalid UTF-8 input is considered undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

![Go](https://github.com/clipperhouse/uax29/workflows/Go/badge.svg)
There are two basic tests in each package, called `TestInvalidUTF8` and `TestRandomBytes`. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.
12 changes: 9 additions & 3 deletions sentences/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,20 @@ if err := scanner.Err(); err != nil {

[GoDoc](https://godoc.org/github.com/clipperhouse/uax29/sentences)

### Conformance

We use the official [test suite](https://unicode.org/reports/tr41/tr41-26.html#Tests29), thanks to [bleve](https://github.com/blevesearch/segment/blob/master/tables_test.go). Status:

![Go](https://github.com/clipperhouse/uax29/workflows/Go/badge.svg)

### Performance

`uax29` is designed to work in constant memory, regardless of input size. It buffers input and streams tokens.

Execution time is `O(n)` on input size. It can be I/O bound; you can control I/O and performance implications by the `io.Reader` you pass to `NewScanner`.

### Conformance
### Invalid inputs

We use the official [test suite](https://unicode.org/reports/tr41/tr41-26.html#Tests29), thanks to [bleve](https://github.com/blevesearch/segment/blob/master/tables_test.go). Status:
Invalid UTF-8 input is considered undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

![Go](https://github.com/clipperhouse/uax29/workflows/Go/badge.svg)
There are two basic tests in each package, called `TestInvalidUTF8` and `TestRandomBytes`. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.
6 changes: 6 additions & 0 deletions words/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,9 @@ We use the official [test suite](https://unicode.org/reports/tr41/tr41-26.html#T
Execution time is `O(n)` on input size. It can be I/O bound; I/O performance is determined by the `io.Reader` you pass to `NewScanner`.

In my local benchmarking (Mac laptop), `uax29/words` processes around 16MM tokens per second, or 40MB/s, of [wiki text](https://en.wikipedia.org/w/index.php?title=New_York_City&action=edit).

### Invalid inputs

Invalid UTF-8 input is considered undefined behavior. That said, we’ve worked to ensure that such inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

There are two basic tests in each package, called `TestInvalidUTF8` and `TestRandomBytes`. Those tests pass, returning the invalid bytes verbatim, without a guarantee as to how they will be segmented.

0 comments on commit ac0dcd5

Please sign in to comment.