Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard-compliant wc implementation #97

Open
ashvardanian opened this issue Feb 20, 2024 · 6 comments
Open

Standard-compliant wc implementation #97

ashvardanian opened this issue Feb 20, 2024 · 6 comments
Labels
good first issue Good for newcomers python

Comments

@ashvardanian
Copy link
Owner

The 4c738ea commit introduces a prototype for StringZilla-based Command-Line toolkit, including the wc utility replacement. The original prototype suggests a 3x performance improvement opportunity, but it can't currently handle multiple inputs and flags. Those must be easy to add in cli/wc.py purely in the Python layer. We are aiming to match the following specification:

$ wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
characters delimited by white space.

With no FILE, or when FILE is -, read standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Report any translation bugs to <https://translationproject.org/team/>
Full documentation <https://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'
ashvardanian pushed a commit that referenced this issue Feb 21, 2024
Added flags and file input functionalities to split.
Updated usage commands in the README.
First progress on #97 and #98, but still WIP.
@ashvardanian
Copy link
Owner Author

I've merged some intermediate patches by @ghazariann, but some parts have to be reimplemented. Like this:

    if args.max_line_length:
        max_line_length = max(len(line) for line in str(mapped_bytes).split("\n"))
        counts["max_line_length"] = max_line_length

It is expensive to convert to str and even more expensive to split it.

@MarkReedZ
Copy link

We're missing tests and don't handle locale.

Some thoughts on test

Stdin

Redirection - Note that --files0-from needs to pull a nul delimited list of filenames

find . -name '*.[ch]' -print0 |   wc -L --files0-from=-
cat xxx | wc -l

Word Count

We only count spaces so add tests for adjacent and other whitespace.

Line Count

If a file ends in a non-newline character, its trailing partial line is not counted.

Max Line Length

Tabs are set at every 8th column. Display widths of wide characters are considered. Non-printable characters are given 0 width.

Locale

-m --chars Print only the character counts, as per the current locale. ( utf-8, and utf-16 support needed ) Encoding errors are not counted. locale.getencoding / setencoding

  • We'd need to scan for non ascii codepoints in the input.

-w --words Uses locale specific whitespace.

  • wc likely doesn't really do this per locale. We'd need to test a few locales. To do this we'd have to scan for non ascii and have a list of unicode whitespace to compare.

References

https://www.gnu.org/software/coreutils/manual/html_node/wc-invocation.html#wc-invocation
https://www.mkssoftware.com/docs/man1/wc.1.asp#:~:text=wc%20counts%20the%20number%20of,16%2Dbit%20wide%20Unicode%20files.

@ashvardanian
Copy link
Owner Author

Can we detect those locale-based settings in the Python implementation of wc, without changing the core C implementation and the Python binding?

@MarkReedZ
Copy link

For counting characters we can locale.getencoding() in python then a naive approach would be len(bytes.decode('utf-8')) which would not be performant. Ultimately we'd want to be able to scan for unicode characters ( & 0x80 ) and consume them as the character could be 2-4 bytes. If the library does not have a way to find bytes with the first bit set (& 0x80) we'd have to add it.

For counting words I believe we want a find_charset function that we can use with the whitespace character set.

@ashvardanian
Copy link
Owner Author

For the first part we can temporarily compensate that by performing several runs over data - one for each multi-byte rune.

@MarkReedZ
Copy link

UTF-8 looks like this - you can count bits for the character size once you see the left most bit set. Languages like Chinese will be all unicode characters. I speak Chinese so optimized this in mrjson. I'll setup tests next.

    110xxxxx 10xxxxxx
    1110xxxx 10xxxxxx 10xxxxxx
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers python
Projects
None yet
Development

No branches or pull requests

2 participants