Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status of Windows build #317

Open
6 of 10 tasks
jondegenhardt opened this issue Oct 4, 2020 · 6 comments
Open
6 of 10 tasks

Status of Windows build #317

jondegenhardt opened this issue Oct 4, 2020 · 6 comments
Labels

Comments

@jondegenhardt
Copy link
Contributor

jondegenhardt commented Oct 4, 2020

This issues tracks the status of native Windows builds for the toolkit.

Windows is not currently a supported platform. Windows users are encouraged to run Linux builds of the tools using either Windows Subsystem for Linux (WSL) or Docker for Windows.

Acknowledging the above caveat, it is useful to track known issues on Windows and resolve them over time. At present, the toolkit is built on Windows as part of CI tests, and large parts of the test suite run successfully. Cases where the test suite does not run successfully are primarily due to limitations in the test suite, not the tools. At this time there are no known bugs in the tools, but there are gaps in the runnable test suite and limited real-world testing.

The two main issues for having Windows support are having a complete CI test suite that runs on Windows, and resolving inconsistencies in Windows newline handling. Most other known issues are minor, though a couple complicate having a test suite shared across Windows and Unix platforms. With PRs #314 and #320 newline consistency issues are largely addressed.

Windows CI Test suite status

A Windows CI test suite was setup using GitHub Actions. See PRs #313 and #315. Current status:

  • Build (compile/link) w/ dub, DMD 64-bit - Done (passses)
  • Build (compile/link) w/ dub, LDC 64-bit - Done (passses)
  • Build (compile/link) w/ make, DMD 64-bit - Done (passses)
  • Build (compile/link) w/ make, LDC 64-bit - Done (passses)
  • Built-in unit tests w/ make unittest, DMD 64-bit - Passes, with minor caveats. See "Disabled unit tests" below.
  • Built-in unit tests w/ make unittest, LDC 64-bit - Fails in tsv-sample due to differences in floating point format output. As an example, a floating point value formatted as 0.75346697377181959 on Linux/MacOs is formatted as 0.75346697377181970 on Windows. This looks like a round-off difference in a mathematical calculation, nothing more.
  • Command line unit tests w/ make test, both DMD and LDC. These fail from the beginning because of newline differences between the golden set files from the repo and the results produced in the tests. In particular, the "error" and "help" tests that generate messages to end users are getting written using Windows newlines, but the "gold" files with expected results were generated on Unix. The main complication is that many expected results should be newline equivalent, so different strategies are needed for each.
  • Input files with Windows newlines - The test suite needs to include files with Windows newlines, for every tool.

Notes:

  • Disabled unit tests - A pair of rounding tests for common.tsv-utils.numerics.formatNumber function fail on Windows and were disabled. The cases are formatNumber(0.6, 0) and formatNumber(-0.6, 0). These should return "1" and "-1" respectively, but on Windows return "0" and "-0". This is due to incorrect results from the calls being made to std.format.format. std.format.format is likely calling snprintf from platform's C library. The disabled unit tests are in common.tsv-utils.numerics.d starting at lines 220 and 334.
    Update (03/14/21): This has been addressed in DMD 2.096, PR #7757 by moving the work into Phobos. Unit tests have been conditioned on version.

Windows Newline handling

On Unix and MacOs tsv-utils requires and generates Unix newlines. However, a newline handling policy has never been identified for running on a Windows platform. As a result, tools are inconsistent in the manner they handled Windows newlines when running on Windows. Some possible newline handling policies:

  1. Read and write Unix newlines only, on all platforms.
  2. Read and write using platform preferred linefeeds.
  3. Read either Windows or Unix linefeeds; Write Unix linefeeds
  4. Full customization via command line arguments.

Option 1 is simplest policy to support and what is being done initially. It is the easiest to enforce in the current code, and easiest to support in the current test suite. And, it is a reasonable choice in many environments, especially in circumstances where a mix of Unix and Windows platforms are in use. If data files are being shared, Unix newlines will normally be preferred. Option 1 is also consistent with other choices made in the toolkit. In particular, supporting only one file format (UTF-8 TSV), and delegating conversion to that format to other tools (e.g. dos2unix, csv2tsv).

Option 1 is largely in place with PRs #314 and #320, but the test suite still needs work to test it fully. Tasks:

Option 2 might be the preferred option in many traditional applications, but it is not clear if this is a good choice for data mining tools. In particular, it is very common to share data files between people, platforms, and tools. In such environments Unix newlines will be preferred. Switching to Windows newlines on Windows machines may be more an annoyance than a benefit.

Option 3, reading both newline forms, but writing Unix newlines, has some nice properties. And, it might be easier done than expected, as most tools use bufferedByLine. In particular, bufferedByLine.front handles newlines. However, a number of tools have their own reader functionality, so it would still be necessary to have a test suite for each tool. And, it is not really necessary given the availability of tools like dos2unix. Still, this option may be worth consideration.

Option 4, full customization of newline handling, would provide the most complete solution. However, it has a material downsides. It creates additional user complexity in the form of additional command line arguments. It also creates complexity in the tools and test suite. At present these downsides seem to outweigh the benefits.

Other issues

  • Floating point number formatting - This is described under "Disabled unit tests" above. This affects primarily tsv-pretty when printing floats in formatted forms. Appears to affect a relatively small number of forms, though it is undesirable when it occurs.
    Update (03/14/21): This has been addressed in DMD 2.096, PR #7757 by moving the work into Phobos. This should eliminate this problem, especially when the LDC release with this version is available.
@porteusconf
Copy link

Does the merge into v2.1.2 for #320 imply that, for newline handling going forward, it might be easiest to adopt option 3?

Option 3, reading both newline forms, but writing Unix newlines

In any case, while waiting for a full windows release/build, could just csv2tsv.exe be made availalble, assuming it passes any needed tests? This would enable Windows users to generate valid tsv from csv without excel, perhaps by:

  • validate foo.csv with some tool(s), for example https://csvlint.io/ But be warned, if you download the "standardized" csv they offer, I think it silently adds double-quotes around every field, including numbers. For example if foo.csv has a row foo,22 it becomes "foo","22" in the "standardized" csv file (not sure why).
  • csv2tsv.exe foo.csv > foo.tsv Note that csv2tsv by default removes double-quotes where not needed, so foo.tsv would be fooTAB22

May need some documentation on how to pass escaped command-line arguments to csv2tsv.exe in windows if using cmd or powershell...
For example, on linux/macos, we can create a file with scsv (semi-colon separated values) using something like these:

csv2tsv --tsv-delim   $";"    foo.csv  > foo.scsv
csv2tsv --tsv-delim   $';'    foo.csv  > foo.scsv
csv2tsv --tsv-delim    \;     foo.csv  > foo.scsv

And I'm thinking none of the above command-lines would work on windows. Perhaps ^; would work per which-symbol-is-escape-character-in-cmd

But if you need to specify tab as a command line argument, then instead of cmd windows folks may need to use powershell, which can escape tab as backtick-t

`t

per About special characters in PowerShell docs

Finally, another work-around avoiding both cmd and powershell completely, just install git-for-windows (choco install git or some other bash shell for windows). Then run csv2tsv.exe in that shell, if csv2tsv.exe can handle arguments passed to it from bash.exe.

@jondegenhardt
Copy link
Contributor Author

Hi @porteusconf. Thanks for the feedback and suggestions. Some comments in-line below.

Does the merge into v2.1.2 for #320 imply that, for newline handling going forward, it might be easiest to adopt option 3?

Option 3, reading both newline forms, but writing Unix newlines

Option 1, Unix newlines only on both input and output is by far the easiest (lowest investment cost). Option 3 is a fair bit more expensive. Much of this comes from increased test suite cost. Some because there are a several tools that have their own reader functionality (for example, tsv-sample).

A relevant question is how much additional benefit would be seen investing in option 3? It's a question I don't know the answer to. How many users, how prevalent are the data files, and how onerous are the alternatives, such as invoking dos2unix on the data first.

In any case, while waiting for a full windows release/build, could just csv2tsv.exe be made availalble, assuming it passes any needed tests? This would enable Windows users to generate valid tsv from csv without excel, ...

Well, I'm reluctant to create pre-built binary packages for only a single tool. However, I see the merit behind this idea, perhaps there are ways to get the desired effect.

First, note that nothing prevents cloning the git repo and building the tools on Windows. The test suite is not complete for Windows, but that doesn't mean the tools won't work properly. And to your point, csv2tsv would likely passes a more complete test suite simply because the csv2tsv test suite already includes examples of files with Windows newlines.

What could be done in this regard is to: (a) Publish test suite status info for csv2tsv by itself; (b) Add any missing csv2tsv tests; (c) Add specific instructions describing how to build on Windows.

perhaps by:

  • validate foo.csv with some tool(s), for example https://csvlint.io/ But be warned, if you download the "standardized" csv they offer, I think it silently adds double-quotes around every field, including numbers. For example if foo.csv has a row foo,22 it becomes "foo","22" in the "standardized" csv file (not sure why).
  • csv2tsv.exe foo.csv > foo.tsv Note that csv2tsv by default removes double-quotes where not needed, so foo.tsv would be fooTAB22

csv2tsv doesn't have any trouble reading any of these formats, but as you point out, it always generates escape-free TSV.

May need some documentation on how to pass escaped command-line arguments to csv2tsv.exe in windows if using cmd or powershell...

Good thoughts, thank you.

Finally, another work-around avoiding both cmd and powershell completely, just install git-for-windows (choco install git or some other bash shell for windows). Then run csv2tsv.exe in that shell, if csv2tsv.exe can handle arguments passed to it from bash.exe.

Agreed, it might make sense to include this option in the documentation.

@Imperatorn
Copy link

Status?

@jondegenhardt
Copy link
Contributor Author

Status?

Status as described in the main description is up-to-date. It is updated as things change. At present, there are no known failure cases on Windows. But, since the test suite doesn't run fully, it leaves unknowns. Also, there's a lack of real-world use on Windows, or at least use that gets reported. So it is more about unknowns at this point.

Do you have specific questions?

@Imperatorn
Copy link

No, I was just wondering why there weren't any Windows binaries. I've put them here for anyone interested:
https://github.com/Imperatorn/tsv-utils/releases

@jondegenhardt
Copy link
Contributor Author

@Imperatorn Great to know there's someone trying the tools on Windows! Report any problems you have, Windows related or otherwise!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants