Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop when parsing lines longer than 10MB / ITERATION_CHUNK_SIZE #218

Open
sjoubert opened this issue Dec 21, 2023 · 1 comment
Open
Assignees
Labels

Comments

@sjoubert
Copy link

If a line is longer than ITERATION_CHUNK_SIZE (corner case, but unfortunately large dataset happen in the wild), CSVReader::read_row will loop forever, spawning read_csv thread infinitely.

For now I've patched this using:

diff --git a/single_include/csv.hpp b/single_include/csv.hpp
index 9cebfc2..e24ddff 100644
--- a/single_include/csv.hpp
+++ b/single_include/csv.hpp
@@ -7517,6 +7517,7 @@ namespace csv {
      *
      */
     CSV_INLINE bool CSVReader::read_row(CSVRow &row) {
+        bool readRequested = false;
         while (true) {
             if (this->records->empty()) {
                 if (this->records->is_waitable())
@@ -7530,7 +7531,15 @@ namespace csv {
                     if (this->read_csv_worker.joinable())
                         this->read_csv_worker.join();
 
+                    // More reading was already requested and not successful
+                    if (readRequested && this->records->empty())
+                        // Other error, e.g. line is too large for ITERATION_CHUNK_SIZE
+                        throw std::runtime_error(
+                            "End of file not reached and no more records parsed, this may be due to a line longer than "
+                            + std::to_string(internals::ITERATION_CHUNK_SIZE) + " bytes.");
+
                     this->read_csv_worker = std::thread(&CSVReader::read_csv, this, internals::ITERATION_CHUNK_SIZE);
+                    readRequested = true;
                 }
             }
             else if (this->records->front().size() != this->n_cols &&

This avoids spawning a new thread if the previous one was unsuccessful (here because the line is too long for the chunk and can't be parsed).

It looks like this should be handle better/elsewhere. But for now this seems to work for me.

@vincentlaucsb vincentlaucsb self-assigned this Mar 30, 2024
@vincentlaucsb
Copy link
Owner

I'll see what I can do. I appreciate the example patch you've provided.

I didn't expect people to try to use this library on embedded devices (where ITERATION_CHUNK_SIZE is too large) nor did I expect people to try to parse CSVs where there are single rows that that up more space than ITERATION_CHUNK_SIZE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants