Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why don't you use mmfile? #4

Open
wilzbach opened this issue May 17, 2016 · 4 comments
Open

why don't you use mmfile? #4

wilzbach opened this issue May 17, 2016 · 4 comments

Comments

@wilzbach
Copy link
Contributor

A couple of weeks ago I did a (noobish) benchmark that compared different D functions with C, C++ and Python.
By far the fastest was the following code:

scope mmFile = new MmFile(args[1]);
auto file = splitter(cast(string)mmFile[0..mmFile.length], '\n').filter!"!a.empty";
file.popFront; // header
// parse body (or whatever you want to do with it)
foreach(line; file){
        int[] csv = line.splitter(' ').map!(to!int).array;
        counter += csv.sum;

I am just wondering whether you just didn't know about this or there was any reason against it.
Memory-mapping also works well with extremely large files.

https://dlang.org/phobos/std_mmfile.html
https://en.wikipedia.org/wiki/Memory-mapped_file

@jondegenhardt
Copy link
Contributor

Thanks for comment and the tip. I know about memory mapped files, but I hadn't seen the mmfile facilities in Phobos. Even so, I'd be inclined to be cautious when they'd be used with machines and files that weren't my own. This is perhaps out of date info now, but in my prior experiences with memory mapped files it was always the case that system specific aspects mattered. And for these utilities, they sometimes get used with quite large files, multiple gigabytes.

Related approaches I deliberately chose not to do is slurp entire files into memory, or write my own buffering code.

This was partly philosophical. I wasn't setting out to create the absolute fastest utilities. I was trying to create utilities as they might be created by data scientists who might typically be using Python or similar and see how D's performance stacked up. D actually did pretty well. I had to avoid auto-decoding, and the csv2tsv converter is slow for reasons I haven't out yet. But overall pretty good.

But back to the approach rationale - The way I wrote it avoids a couple complications. One is that standard input and files work the same way, no special casing. The other is that reading entire files generally bypasses the system specific newline detection, so generic code needs to handle both different forms of newline (eg. CRLF on Windows, LF on Unix). And to be honest, I expect the underlying libraries to provide good buffering without needing to write my own.

There are definitely people who would disagree with these choices, and if my primary goal was the very fastest performing tools I would change a few things as well. As it is, they are actually pretty good. tsv-filter in particular runs very fast.

@wilzbach
Copy link
Contributor Author

wilzbach commented Jun 1, 2016

Thanks for comment and the tip. I know about memory mapped files, but I hadn't seen the mmfile facilities in Phobos. Even so, I'd be inclined to be cautious when they'd be used with machines and files that weren't my own.

As said I just measured that it's twice as fast in my simple experiments and I was interested whether you actually tried it.

And for these utilities, they sometimes get used with quite large files, multiple gigabytes.

" A possible benefit of memory-mapped files is a "lazy loading", thus using small amounts of RAM even for a very large file. Trying to load the entire contents of a file that is significantly larger than the amount of memory available can cause severe thrashing as the operating system reads from disk into memory and simultaneously writes pages from memory back to disk. Memory-mapping may not only bypass the page file completely, but the system only needs to load the smaller page-sized sections as data is being edited, similarly to demand paging scheme used for programs."

(from Wikipedia)

One is that standard input and files work the same way, no special casing.

If you see a file as a range, then there's not special casing ;-)

So generic code needs to handle both different forms of newline (eg. CRLF on Windows, LF on Unix)

byLine uses \n by default too ;-)

@dejlek
Copy link

dejlek commented Jul 16, 2020

@wilzbach - your code is mapping an entire file which is completely unnecessary, IMHO.
A good approach would be, I believe, to use iopipe ( https://github.com/schveiguy/iopipe ) as it does very clever buffering and many other goodies.

@jondegenhardt
Copy link
Contributor

jondegenhardt commented Jul 16, 2020

Hi @dejlek -- FWIW, I did experiment with memory mapped files at one point a good bit after @wilzbach's original suggestion. I haven't used them at this point, but there are places in the tools where it warrants consideration.

Mostly, at this point I haven't wanted to worry about the distinctions between reading infinite/indefinite size streams, streaming large vs small files, multiple files, and reading full files into memory. But also, I didn't see big performance wins on the tests I ran. I suspect this has more to do with the specific tests I ran than with the technique, but clearly it would take a bit more time investment to characterize the cases better.

There are cases in the toolset where MM files would really seem to make sense. For example, a couple of the sampling methods provided by tsv-sample require reading the full file into memory. When the data is backed by a single disk file (common), then a memory mapped file would seem the way to go.

As to iopipe - tests I've run indicate it has superior overall performance for handling input streams. I didn't compare against MM files, but iopipe is definitely on the right track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants