Standard-compliant `wc` implementation #97

ashvardanian · 2024-02-20T05:54:32Z

The 4c738ea commit introduces a prototype for StringZilla-based Command-Line toolkit, including the wc utility replacement. The original prototype suggests a 3x performance improvement opportunity, but it can't currently handle multiple inputs and flags. Those must be easy to add in cli/wc.py purely in the Python layer. We are aiming to match the following specification:

$ wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
characters delimited by white space.

With no FILE, or when FILE is -, read standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Report any translation bugs to <https://translationproject.org/team/>
Full documentation <https://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'

The text was updated successfully, but these errors were encountered:

Added flags and file input functionalities to split. Updated usage commands in the README. First progress on #97 and #98, but still WIP.

ashvardanian · 2024-02-21T17:16:42Z

I've merged some intermediate patches by @ghazariann, but some parts have to be reimplemented. Like this:

    if args.max_line_length:
        max_line_length = max(len(line) for line in str(mapped_bytes).split("\n"))
        counts["max_line_length"] = max_line_length

It is expensive to convert to str and even more expensive to split it.

MarkReedZ · 2024-05-15T05:00:34Z

We're missing tests and don't handle locale.

Some thoughts on test

Stdin

Redirection - Note that --files0-from needs to pull a nul delimited list of filenames

find . -name '*.[ch]' -print0 |   wc -L --files0-from=-
cat xxx | wc -l

Word Count

We only count spaces so add tests for adjacent and other whitespace.

Line Count

If a file ends in a non-newline character, its trailing partial line is not counted.

Max Line Length

Tabs are set at every 8th column. Display widths of wide characters are considered. Non-printable characters are given 0 width.

Locale

-m --chars Print only the character counts, as per the current locale. ( utf-8, and utf-16 support needed ) Encoding errors are not counted. locale.getencoding / setencoding

We'd need to scan for non ascii codepoints in the input.

-w --words Uses locale specific whitespace.

wc likely doesn't really do this per locale. We'd need to test a few locales. To do this we'd have to scan for non ascii and have a list of unicode whitespace to compare.

References

https://www.gnu.org/software/coreutils/manual/html_node/wc-invocation.html#wc-invocation
https://www.mkssoftware.com/docs/man1/wc.1.asp#:~:text=wc%20counts%20the%20number%20of,16%2Dbit%20wide%20Unicode%20files.

ashvardanian · 2024-05-15T05:03:38Z

Can we detect those locale-based settings in the Python implementation of wc, without changing the core C implementation and the Python binding?

MarkReedZ · 2024-05-15T06:28:36Z

For counting characters we can locale.getencoding() in python then a naive approach would be len(bytes.decode('utf-8')) which would not be performant. Ultimately we'd want to be able to scan for unicode characters ( & 0x80 ) and consume them as the character could be 2-4 bytes. If the library does not have a way to find bytes with the first bit set (& 0x80) we'd have to add it.

For counting words I believe we want a find_charset function that we can use with the whitespace character set.

ashvardanian · 2024-05-15T06:43:21Z

For the first part we can temporarily compensate that by performing several runs over data - one for each multi-byte rune.

MarkReedZ · 2024-05-15T14:28:15Z

UTF-8 looks like this - you can count bits for the character size once you see the left most bit set. Languages like Chinese will be all unicode characters. I speak Chinese so optimized this in mrjson. I'll setup tests next.

    110xxxxx 10xxxxxx
    1110xxxx 10xxxxxx 10xxxxxx
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

ashvardanian added good first issue Good for newcomers python labels Feb 20, 2024

ashvardanian pushed a commit that referenced this issue Feb 21, 2024

Improve ws and split CLI (#99)

c878caf

Added flags and file input functionalities to split. Updated usage commands in the README. First progress on #97 and #98, but still WIP.

ashvardanian mentioned this issue Feb 21, 2024

Standard-compliant ws and split implementation (Issue 97 98) #99

Merged

MarkReedZ mentioned this issue May 15, 2024

The wc cli has been fixed up #152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standard-compliant `wc` implementation #97

Standard-compliant `wc` implementation #97

ashvardanian commented Feb 20, 2024

ashvardanian commented Feb 21, 2024

MarkReedZ commented May 15, 2024

ashvardanian commented May 15, 2024

MarkReedZ commented May 15, 2024

ashvardanian commented May 15, 2024

MarkReedZ commented May 15, 2024

Standard-compliant wc implementation #97

Standard-compliant wc implementation #97

Comments

ashvardanian commented Feb 20, 2024

ashvardanian commented Feb 21, 2024

MarkReedZ commented May 15, 2024

ashvardanian commented May 15, 2024

MarkReedZ commented May 15, 2024

ashvardanian commented May 15, 2024

MarkReedZ commented May 15, 2024

Standard-compliant `wc` implementation #97

Standard-compliant `wc` implementation #97