Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clevercsv sniffer slows to a crawl on large-ish files (e.g. FEC data) #15

Open
jlumbroso opened this issue May 19, 2020 · 5 comments
Open

Comments

@jlumbroso
Copy link

Hello,

This is a very neat project! I was thinking "I should collect a bunch of CSV files from the web and do statistics to see what the dialects are, and their predominance, to be able to better detect them" and then I found your paper and Python package! Congrats on this very nice contribution.

I am trying to see how clevercsv performs on FEC data. For instance, let's consider this file:

https://www.fec.gov/files/bulk-downloads/1980/indiv80.zip

$ head -5 fec-indiv-1979–1980.csv
C00078279|A|M11|P|80031492155|22Y||MCKENNON, K R|MIDLAND|MI|00000|||10031979|400|||||CONTRIBUTION REF TO INDIVIDUAL|3062020110011466469
C00078279|A|M11||79031415137|15||OREFFICE, P|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1500||||||3061920110000382948
C00078279|A|M11||79031415137|15||DOWNEY, J|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|300||||||3061920110000382949
C00078279|A|M11||79031415137|15||BLAIR, E|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1000||||||3061920110000382950
C00078287|A|Q1||79031231889|15||BLANCHARD, JOHN A|CHICAGO|IL|60685|||03201979|200||||||3061920110000383914

When I try to open the file with clevercsv, it takes an inordinate of time, and seems to be hanging. So I tried to use the sniffer as suggested in your example Binder.

# downloaded, unzipped and renamed to a CSV file from:
# https://www.fec.gov/files/bulk-downloads/1980/indiv80.zip
content = open("fec-indiv-1979–1980.csv").read()
clevercsv.Sniffer().sniff(content, verbose=True)

It prints out this:

Running normal form detection ...
Not normal, has potential escapechar.
Running data consistency measure ...

and then a while later (a few minutes later) starts printing:

Considering 92 dialects.
SimpleDialect(',', '', ''):	P =    22104.867952	T =        0.003101	Q =       68.546737
SimpleDialect(',', '"', ''):	P =    13927.762095	T =        0.003668	Q =       51.090510
SimpleDialect(',', '"', '/'):	P =    13839.682333	T =        0.002461	Q =       34.060128
SimpleDialect(',', "'", ''):	P =    12072.093333	T =        0.003278	Q =       39.571560
SimpleDialect(';', '', ''):	P =      106.613556	T =        0.000003	Q =        0.000345
SimpleDialect(';', '"', ''):	P =       99.261000	T =        0.000000	Q =        0.000000
SimpleDialect(';', '"', '/'):	P =       50.238917	skip.
SimpleDialect(';', "'", ''):	P =       49.981222	skip.
SimpleDialect('', '', ''):	P =      308.696000	T =        0.000000	Q =        0.000000
SimpleDialect('', '"', ''):	P =      194.530000	T =        0.000000	Q =        0.000000
SimpleDialect('', '"', '/'):	P =       96.652000	T =        0.000000	Q =        0.000000
SimpleDialect('', "'", ''):	P =      144.787000	T =        0.000000	Q =        0.000000
SimpleDialect(' ', '', ''):	P =    17818.683137	T =        0.346978	Q =     6182.686103
SimpleDialect(' ', '', '/'):	P =    17818.565863	T =        0.346984	Q =     6182.762051
SimpleDialect(' ', '"', ''):	P =    11300.749933	T =        0.353544	Q =     3995.309179
SimpleDialect(' ', '"', '/'):	P =    10372.973520	T =        0.355343	Q =     3685.960429
SimpleDialect(' ', "'", ''):	P =     7231.699311	T =        0.343354	Q =     2483.032090
SimpleDialect(' ', "'", '/'):	P =     7231.658120	T =        0.343362	Q =     2483.075319
SimpleDialect('#', '', ''):	P =      163.330000	skip.
SimpleDialect('#', '"', ''):	P =      103.253000	skip.
SimpleDialect('#', '"', '/'):	P =       67.761333	skip.
SimpleDialect('#', "'", ''):	P =       78.132000	skip.
SimpleDialect('$', '', ''):	P =      155.096500	skip.
SimpleDialect('$', '"', ''):	P =       97.764000	skip.
SimpleDialect('$', '"', '/'):	P =       64.601000	skip.
SimpleDialect('$', "'", ''):	P =       72.892500	skip.
SimpleDialect('%', '', ''):	P =      104.950222	skip.
SimpleDialect('%', '', '\\'):	P =      104.783889	skip.
SimpleDialect('%', '"', ''):	P =       65.896889	skip.
SimpleDialect('%', '"', '/'):	P =       65.765333	skip.
SimpleDialect('%', '"', '\\'):	P =       65.730556	skip.
SimpleDialect('%', "'", ''):	P =       49.648556	skip.
SimpleDialect('%', "'", '\\'):	P =       49.482222	skip.
SimpleDialect('&', '', ''):	P =     2940.570750	skip.
SimpleDialect('&', '', '/'):	P =     2940.446000	skip.
SimpleDialect('&', '"', ''):	P =     1936.209667	skip.
SimpleDialect('&', '"', '/'):	P =     1441.305200	skip.
SimpleDialect('&', "'", ''):	P =     1340.900250	skip.
SimpleDialect('&', "'", '/'):	P =     1340.775500	skip.
SimpleDialect('*', '', ''):	P =      156.344000	skip.
SimpleDialect('*', '"', ''):	P =       97.514500	skip.
SimpleDialect('*', '"', '/'):	P =       65.599000	skip.
SimpleDialect('*', "'", ''):	P =       73.142000	skip.
SimpleDialect('+', '', ''):	P =      156.344000	skip.
SimpleDialect('+', '', '\\'):	P =      156.094500	skip.
SimpleDialect('+', '"', ''):	P =       99.011500	skip.
SimpleDialect('+', '"', '/'):	P =       65.266333	skip.
SimpleDialect('+', '"', '\\'):	P =       99.011500	skip.
SimpleDialect('+', "'", ''):	P =       73.890500	skip.
SimpleDialect('+', "'", '\\'):	P =       73.890500	skip.
SimpleDialect('-', '', ''):	P =     1456.570500	skip.
SimpleDialect('-', '"', ''):	P =      914.921750	skip.
SimpleDialect('-', '"', '/'):	P =      700.916267	skip.
SimpleDialect('-', "'", ''):	P =      687.513667	skip.
SimpleDialect(':', '', ''):	P =      155.096500	skip.
SimpleDialect(':', '"', ''):	P =       97.514500	skip.
SimpleDialect(':', '"', '/'):	P =       64.933667	skip.
SimpleDialect('<', '', ''):	P =      155.845000	skip.
SimpleDialect('<', '"', ''):	P =       97.764000	skip.
SimpleDialect('<', '"', '/'):	P =       65.100000	skip.
SimpleDialect('<', "'", ''):	P =       73.391500	skip.
SimpleDialect('?', '', ''):	P =      155.595500	skip.
SimpleDialect('?', '"', ''):	P =       98.512500	skip.
SimpleDialect('?', '"', '/'):	P =       96.652000	skip.
SimpleDialect('@', '', ''):	P =      156.344000	skip.
SimpleDialect('@', '"', ''):	P =       98.762000	skip.
SimpleDialect('@', '"', '/'):	P =       64.767333	skip.
SimpleDialect('@', "'", ''):	P =       73.391500	skip.
SimpleDialect('\\', '', ''):	P =      105.282889	skip.
SimpleDialect('\\', '"', ''):	P =       66.063222	skip.
SimpleDialect('\\', '"', '/'):	P =       66.098000	skip.
SimpleDialect('\\', "'", ''):	P =       74.140000	skip.
SimpleDialect('^', '', ''):	P =      154.597500	skip.
SimpleDialect('^', '"', ''):	P =       97.514500	skip.
SimpleDialect('^', '"', '/'):	P =       64.601000	skip.
SimpleDialect('^', "'", ''):	P =       72.643000	skip.
SimpleDialect('_', '', ''):	P =      156.094500	skip.
SimpleDialect('_', '"', ''):	P =       98.013500	skip.
SimpleDialect('_', '"', '/'):	P =       65.100000	skip.
SimpleDialect('_', "'", ''):	P =       73.391500	skip.
SimpleDialect('|', '', ''):	P =   293996.190476	T =        0.946519	Q =   278273.106576
SimpleDialect('|', '', '/'):	P =   146998.094048	skip.
SimpleDialect('|', '', '@'):	P =   146998.094048	skip.
SimpleDialect('|', '', '\\'):	P =   146998.092857	skip.
SimpleDialect('|', '"', ''):	P =   185266.666667	skip.
SimpleDialect('|', '"', '/'):	P =    46024.763214	skip.
SimpleDialect('|', '"', '@'):	P =    92633.332143	skip.
SimpleDialect('|', '"', '\\'):	P =    92633.330952	skip.
SimpleDialect('|', "'", ''):	P =    12535.572981	skip.
SimpleDialect('|', "'", '/'):	P =    12535.572981	skip.
SimpleDialect('|', "'", '@'):	P =    12535.572765	skip.
SimpleDialect('|', "'", '\\'):	P =    12535.572981	skip.

I would say this takes about 30 minutes, and finally it concludes:

SimpleDialect('|', '', '')

I think I understand what's going on: You designed this for small-ish datasets, and so you reprocess the whole file for every dialog to determine what makes most sense.

I would be tempted to think this is because I feed the data as a variable content, following your example, rather than provide the filename directly. However when I tried to read the read_csv method directly with the filename, it also was really, very, very slow. So I think in all situations currently, clevercsv trips on this file, and more generally this type of file.

When I take the initiative to truncate the data arbitrarily, clevercsv works beautifully. But shouldn't the truncating be something the library, as opposed to the user does?

clevercsv.Sniffer().sniff(content[0:1000], verbose=True)

provides in a few seconds:

Running normal form detection ...
Didn't match any normal forms.
Running data consistency measure ...
Considering 4 dialects.
SimpleDialect(',', '', ''):	P =        4.500000	T =        0.000000	Q =        0.000000
SimpleDialect('', '', ''):	P =        0.009000	T =        0.000000	Q =        0.000000
SimpleDialect(' ', '', ''):	P =        1.562500	T =        0.312500	Q =        0.488281
SimpleDialect('|', '', ''):	P =        8.571429	T =        0.952381	Q =        8.163265
SimpleDialect('|', '', '')

@GjjvdBurg If this is not a known problem, may I suggest using some variation of "infinite binary search"?

  • We start with a small size, and we truncate the file to that small size.
  • The detected dialect may be wrong because we are only working on a small subset of the file, so we double the amount of content that we provide the sniffer, and see if it answers the same thing.
  • We repeat a predetermined number of times (for instance 4 times), until we've asserted the sniffer has detected the same dialect for larger portions of the file.

I have implemented this algorithm here:
https://gist.github.com/jlumbroso/c123a30a2380b58989c7b12fe4b4f49e

When I run it on the above mentioned file, it immediately (without futzing) produces the correct answer:

In[3]: probe_sniff(content)
Out[3]: SimpleDialect('|', '', '')

And on the off-chance you would like me to add this algorithm to the codebase, where would it go?

@GjjvdBurg
Copy link
Collaborator

Hi @jlumbroso, thanks for the detailed bug report!

You're describing an issue that I've been thinking about for a while, but never had the time to seriously investigate, so I'm glad that you raised this. Yes, CleverCSV can be quite slow for large files, and chunking the file during detection would be a good solution to this. I agree this is something the library should handle if it can.

I should mention though that the read_csv wrapper and the clevercsv command line application both take a num_chars argument, added specifically for large files (moreover the Sniffer docs suggest using a sample). But if that's not clear from the CleverCSV documentation, that's on me and should be addressed.

I would definitely be open to adding a fix for this into the package, but I'm a bit undecided about the best approach. In any case I'd want to do some testing on the accuracy of this method compared to reading the full file. An alternative that I've had in mind is to keep the Parser object in memory for each dialect and feed these additional characters in the same way that you propose. That way the type detection wouldn't have to re-run on cells it has already seen from a previous batch, and we could be a bit more clever about dropping unlikely dialects. On the other hand, that would probably require more significant refactoring/rewriting of the code, whereas your algorithm could wrap the Detector.detect method quite easily.

All that said, I did some debugging on the specific file you mention and it appears it suffers from the same issue as in #13, but regarding the url regex. With the modified url regex suggested in that issue, I can detect the dialect in about 5 minutes on my machine. While that's still way too long, it's an improvement over the 30 min you found. (Using 10000 characters gives the dialect in about 0.6 seconds).

So I'll first update the package with that regex, and improve the documentation w.r.t. using a smaller sample on big files. Then, I'd like to add some sort of benchmark to the repo that allows us to evaluate the performance of the chunking method (it'll likely be a few weeks before I get around to this, unfortunately). If you'd like to work on adding your algorithm to the package, it would be great if you can prepare a pull request. We can discuss the implementation details there more easily (again, I think wrapping the detect method is probably the easiest at this point). How does that sound?

GjjvdBurg added a commit that referenced this issue May 19, 2020
The url regex also leads to catastrophic backtracking for the
file in #15, which causes a massive slowdown.
GjjvdBurg added a commit that referenced this issue May 20, 2020
* Update URL regex to avoid catastrophic backtracking and increase
  performance. See [issue
  #13](#13) and
  [issue #15](#15).
  Thanks to @kaskawu for the fix and @jlumbroso for re-raising the issue.
* Add ``num_chars`` keyword argument to ``read_as_dicts`` and ``csv2df``
  wrappers.
* Improve documentation w.r.t. handling large files. Thanks to @jlumbroso for
  raising this issue.
@GjjvdBurg
Copy link
Collaborator

Hi @jlumbroso,

This took a bit longer than expected, but I've now added a comparison study to the repo (see here). This experiment evaluates the accuracy and runtime of dialect detection as a function of the number of lines used for the detection. I've included a version of the "infinite binary search" you suggested, dubbed clevercsv_grow in the figures.

I changed the parameters of your algorithm a bit when testing, as the results depend on how many lines are used initially and how many steps are taken. In the comparison clevercsv_grow starts with a buffer of 100 lines, which explains why the accuracy is the same as for CleverCSV when line_count <= 100. The most relevant figures are those for dialect detection accuracy for files with at least 1000 or 10000 lines. These figures show that using a growing buffer unfortunately incurs a performance hit of a few percent. However, as the runtime plots show it does make quite a difference for large files. A potential solution to this might be to try some sort of hybrid method that uses the binary search technique only when the number of lines is above some threshold.

I'm curious to hear your thoughts on this!

@jlumbroso
Copy link
Author

Dear @GjjvdBurg,

First, I apologize for leaving your first message without an answer. I think I wrote a long response and thought I had sent it, but since it's not submitted, I assume it just got lost in my tabs until the browser was closed and then forgotten... So I wanted to express my appreciation to you for following up on this.

Next, congratulations on continuing to being data-driven in every aspect of this module: Those comparison benchmarks are really very comprehensive and interesting!! 🥇 I think this interesting enough it could be a Medium article (let me know if you want to write one together).

Here are some of my thoughts, happy to discuss more!

For autodetection, accuracy >> efficiency. I think that for a robust auto-detection library like clevercsv, it is usually/always better to prioritize accuracy over efficiency (especially when we are talking about a few points of efficiency for a larger gain in accuracy). The targeted audience is one that is self-selected both 1) not to care about the details of parameter detection, and 2) possibly do not have the expertise to understand/correct when the detection is incorrect. The best analogy I can think of is encoding detection libraries: They are great, except when they don't work, because the kind of people that would use that usually are the ones who don't understand the errors that may occur (hence a bunch of UnicodeDecodeError questions on StackOverflow).

Why is clevercsv_grow() not dominant in terms of accuracy? My biggest question when I look at your plots (which may require a bit of investigation in the code—but I don't want to leave you hanging again! so will look at this later), is why is it that the clevercsv_grow() method not dominate the clevercsv() method? Indeed, the core detection is the same in both, it is just the size of the sample that grows, i.e., is doubled in clevercsv_grow():

Importantly, the reason I suggested clevercsv_grow() is because the provided clevercsv detection was using too small a sample, i.e., was using a fixed partial instead of an adaptive one. I am not seeing this phenomena in the plots (not even an improved accuracy of grow over standard), so I am wondering whether I was just unlucky with my file? Or is the clevercsv heuristic here smarter?

Minor nitpick: In the clevercsv_grow algorithm implementation, I would not call the variable current_lines which contains an integer that is the number of lines. The name current_lines makes me think this is going to contain the actual lines (either as a string with \n or as a list of lines). I would suggest a name that makes it clear it is a scalar. I used current_size but perhaps current_line_count may be better.

@GjjvdBurg
Copy link
Collaborator

Hi @jlumbroso,

Thanks for your response, needless to say I'm not that fast with responding myself either, for which I apologize. That doesn't mean however that I haven't thought about this problem in the meantime. I certainly think this would make an interesting (Medium) article, as there's definitely a non-trivial problem here. I agree with your point on accuracy being more important than efficiency. Ideally (as we discussed above), CleverCSV should be able to determine when it is confident in the detected dialect being the right one.

In general, the problem with reading a file partially is that it is possible that evidence for a certain dialect can be hidden many rows down in the file. For example, the imdb example only reveals that it uses the \ as escape character on line 66. Imagine if this would be on row 900,000 of a million-line file, you'd have to read almost the entire file to detect it correctly. This is generally the problem with partially reading a file.

Regarding the accuracy figure you show, there are a number of things going on here. First, I made a mistake in the implementation of clevercsv_grow, in the sense that the while loop didn't run as much as it should. This has since been fixed, so the updated figure now looks as follows:

Detection results for files with at least 1000 lines

What's noticable is that starting from n = 300 the clevercsv_grow method performs worse than the clevercsv method. This is caused by the fact that the number of lines is doubled at every iteration, and when it exceeds the line limit it doesn't loop again (so for n = 300, since we start with 100 lines, the loop does two iterations). A solution would be to use current_line_count = min(max_line_count, current_line_count * 2) instead. I'll run that overnight to see how that improves things, but this response is long overdue already.

Couple minor comments:

  • In the meantime I'm also collecting some more large files for testing, but this is still ongoing.
  • I've recently pushed a commit that speeds up the part of the code that generates the list of potential dialects, which helps a little.
  • Another approach I've been thinking about is to frame the problem as a multi-armed bandit problem, where the "reward" is the increase in the score function that is received when a fixed number of lines is processed. I've written a proof-of-concept here. It generally works alright, but it's too slow currently to put it through the entire test suite. It would be certainly possible to speed it up, but this would take a bit more time.

@GjjvdBurg
Copy link
Collaborator

For completeness, when I run clevercsv_grow with the fix proposed in the previous message (i.e., current_line_count = min(max_line_count, current_line_count * 2)), there is no difference in accuracy up to 1000 lines:

figure_accuracy_1000

but after that it maxes out in performance:

figure_accuracy_10000

And only after 3,500 lines or so there is a runtime advantage:

figure_runtime_10000

Now, of course, all this is dependent on the specific parameters I've chosen for clevercsv_grow for these experiments (starting with 100 lines, requiring 5 times the same dialect), so there is a chance that changing these parameters will affect the results. In fact, once I have the large-file dataset I mention above I might try it with lines_start = 1000, and see how well that does in comparison. (to be continued!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants