Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grammar clarifications regarding tabs and carriage returns #38

Open
andrewrk opened this issue Jul 8, 2021 · 6 comments
Open

grammar clarifications regarding tabs and carriage returns #38

andrewrk opened this issue Jul 8, 2021 · 6 comments

Comments

@andrewrk
Copy link
Member

andrewrk commented Jul 8, 2021

NL = New Line (0x0a)
CR = Carriage Return (0x0d)
TAB = Tab (0x09)

Inside Line Comments and Documentation Comments

  • Any TAB is rejected by the grammar since it is ambiguous how it should be rendered.
  • CR directly preceding NL is unambiguously part of the newline sequence. It is accepted by the grammar and removed by zig fmt, leaving only NL.
  • CR anywhere else is rejected by the grammar.

Inside Multi-Line String Literals

  • zig fmt may not mangle multi line string literals, and therefore the control character TAB are rejected by the grammar inside multi-line string literals.
  • CR inside the multiline string literal is also rejected for the same reason
  • However CR directly before NL is interpreted as only a newline and not part of the multiline string. zig fmt will delete the CR.

For string literals that want to include CR, TAB, or any other control sequences, they will need to use regular string literals and the ++ operator, or @embedFile.

Whitespace

  • TAB used as whitespace is unambiguous. It is accepted by the grammar and replaced by the canonical whitespace by zig fmt.
  • CR used as whitespace, whether directly preceding NL or stray, is still unambiguously whitespace. It is accepted by the grammar and replaced by the canonical whitespace by zig fmt.

astronaute-meme

@vrischmann
Copy link

vrischmann commented Jul 15, 2021

This

zig fmt may not mangle multi line string literals, and therefore the control characters TAB and CR are rejected by the grammar inside multi-line string literals.

causes a problem when using git on windows, unless I'm missing something.

If you clone this gist (which contains no CR) on Windows with the default git configuration, it produces this:

PS C:\Users\vincent\dev\tmp\zig-invalid-bytes2> xxd .\main.zig
00000000: 666e 2063 7265 6174 6554 6573 7454 6162  fn createTestTab
00000010: 6c65 7328 2920 2176 6f69 6420 7b0d 0a20  les() !void {..
00000020: 2020 2063 6f6e 7374 2062 6172 203d 2026     const bar = &
00000030: 5b5f 5d5b 5d63 6f6e 7374 2075 387b 0d0a  [_][]const u8{..
00000040: 2020 2020 2020 2020 5c5c 666f 6f62 6172          \\foobar
00000050: 0d0a 2020 2020 7d3b 0d0a 2020 2020 5f20  ..    };..    _
00000060: 3d20 6261 723b 0d0a 7d                   = bar;..}

which does not compile:

PS C:\Users\vincent\dev\tmp\zig-invalid-bytes2> zig build-exe .\main.zig
.\main.zig:3:17: error: expected ',', found invalid bytes
        \\foobar
                ^
.\main.zig:3:18: note: invalid byte: '\n'
        \\foobar
                 ^

it is the intended behaviour but IMO it's surprising. Currently as far as I can see you need to either configure git with core.autocrlf=false or use a gitattributes files.

@thejoshwolfe
Copy link
Sponsor Contributor

It still surprises me that Git's default configuration is to give you the wrong bytes on windows. Can you clarify which client you're using? is it provided by GitHub? by git-scm.com? by a cygwin package manager (probably not this one)?

As much as I'd like to say that Git is wrong (which i kinda did in the last paragraph), the conflict is arising from the concept of a "text file", which many people think is an array of lines rather than a single byte array. Sometimes discussions of character encoding get tangled up in the concept of a "text file" too.

But Zig is not interested in all the complexity of "text files", and specifies always UTF-8, always spaces over tabs, always LF line endings. There's some room for accepting and canonicalizing (in zig fmt) unambiguous alternatives for everyone's convenience, but CRLF line endings are officially the wrong way to end lines in Zig (along with NEL, LS, and PS). While it bothers me emotionally that some Git cient on Windows is wasting our time here talking about CRLF line endings, it seems important to get this right to avoid even more wasted time for everyone in the future.

With that preamble ramble out of the way: i think we should definitely canonicalize CRLF line endings in multiline string literals to LF line endings.

@vrischmann
Copy link

I'm using the installer from git-scm.

@Vexu
Copy link
Member

Vexu commented Jul 15, 2021

Couldn't we just add an error like file uses tabs/carriage returns; run 'zig fmt {path-to-file}' to convert it to the canonical form. It would allow us to completely ban tabs/CRLF while also providing a pretty decent user experience.

If mangling the files is an issue, the feature can be put behind a --convert-whitespace flag.

@nektro
Copy link

nektro commented Apr 28, 2022

could even make zig build call zig fmt on the whole directory before running, or otherwise tell them to run it on the whole folder since its likely not the one file thats mangled

@andrewrk
Copy link
Member Author

Couldn't we just add an error like file uses tabs/carriage returns; run 'zig fmt {path-to-file}' to convert it to the canonical form. It would allow us to completely ban tabs/CRLF while also providing a pretty decent user experience.

Yes, I'm thinking along the same lines. Make it a hard error, just like unused locals, and then have --autofix (as well as zig fmt) make the compilation "just work" while modifying the source files to comply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants