Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading .bz2 files fails to decompress or segfaults #116

Open
unode opened this issue Jul 31, 2019 · 14 comments
Open

Reading .bz2 files fails to decompress or segfaults #116

unode opened this issue Jul 31, 2019 · 14 comments

Comments

@unode
Copy link
Member

unode commented Jul 31, 2019

This was tested using the 1.0.0 conda build (is this one just the wrapped static build?) as well as with several different 'static' and containerized versions from 0.9 to 1.0.1.

In all cases loading of data failed at the same step but depending on the version and how it was compiled two errors were seen:

...
[Wed 31-07-2019 11:24:23] Line 13: Created & opened temporary file /tmp/preprocessed.singles...fq12609-4.gz
/.singularity.d/runscript: line 3: 12609 Segmentation fault      (core dumped) ngless "$@"

and

...
[Wed 31-07-2019 11:23:51] Line 13: Created & opened temporary file /tmp/preprocessed.singles...fq8945-4.gz
Exiting after internal error. If you can reproduce this issue, please run your script with the --trace flag and report a bug at http://github.com/ngless-toolkit/ngless/issues
user error (BZ2_bzDecompress: -1)

We didn't try the docker containers but those also make use of the static builds so they should be equally affected.

I also tried using the same binary on the bz2 files in our testsuite and all worked fine which hints at some buffer or filesize related issue.

Currently in the process of creating a bz2 file that is big enough to trigger the error locally. If not too big I'll add this to the testsuite.

Credits to @jakob-wirbel for finding this bug.

@unode
Copy link
Member Author

unode commented Jul 31, 2019

Some interesting findings.

If using pbzip2 the parallel version of bzip2 to create the files, ngless is able to consume the files up to a certain size. In the test-case I setup locally a Fastq file with 9724 lines, (266413 bytes compressed, 900170 uncompressed) causes ngless to fail with BZ2_bzDecompress: -1. Regular unix bzip2 is able to decompress the file without problems.

On the other hand if using regular bzip2, tried as many as 90000 lines and ngless is still able to consume the files without error.

From pbzip2 manual page:

Files that are compressed with pbzip2 are broken up into pieces and each individual piece is compressed.
This is how pbzip2 runs faster on multiple CPUs since the pieces can be compressed simultaneously.
The final .bz2 file may be slightly larger than if it was compressed with the regular bzip2 program 
due to this file splitting (usually less than 0.2% larger). Files that are compressed with pbzip2 will 
also gain considerable speedup when decompressed using pbzip2.

Files that were compressed using bzip2 will not see speedup since bzip2 packages the data into a 
single chunk that cannot be split between processors. 

This might be what is causing the problem.

@unode
Copy link
Member Author

unode commented Jul 31, 2019

Also:

% file *
DRR171944_1.fastq.bz2:       bzip2 compressed data, block size = 900k
DRR171944_2.fastq.bz2:       bzip2 compressed data, block size = 900k
DRR171944.singles.fastq.bz2: bzip2 compressed data, block size = 900k

% pbzip2 --help
...
 -1 .. -9        set BWT block size to 100k .. 900k (default 900k)
 -b#             Block size in 100k steps (default 9 = 900k)
...

That block size value matches the 900170 uncompressed value above.

unode added a commit that referenced this issue Jul 31, 2019
See GitHub issue #116 for additional info
@unode
Copy link
Member Author

unode commented Jul 31, 2019

Commit c214d22 adds a compressed bz2 file that shows this symptoms.
One of the tests was also modified to use this and currently fails.
The test fails locally with the latest 1.0.1 static and with a build compiled from master.

@luispedro
Copy link
Member

Thanks! Fortunately, this fails on travis too, so we have a test.

@luispedro
Copy link
Member

Some other tests are now wrong because they all shared the same expected.fq file, but arguably they should not have been set up like this in the first place

@unode
Copy link
Member Author

unode commented Jul 31, 2019

Oops I'll fix that

@luispedro
Copy link
Member

Actually, I was fixing it on my side, so give me a few minutes.

@unode
Copy link
Member Author

unode commented Jul 31, 2019

ok

luispedro added a commit that referenced this issue Jul 31, 2019
Split the testing of mocat directory parsing from the bzip2 issue.

See discussion at #116
@luispedro
Copy link
Member

The other tests are fixed by making them as before and moving this issue to a new test.

For efficiency, it's good to have tests that cover a bunch of issues simultaneously, but this was the simplest way.

luispedro added a commit to luispedro/bzlib-conduit that referenced this issue Jul 31, 2019
Originally reported as a bug in NGLess (see
ngless-toolkit/ngless#116). After the original
report, @unode provided the following analysis:

> If using pbzip2 the parallel version of bzip2 to create the files,
> ngless is able to consume the files up to a certain size. In the
> test-case I setup locally a Fastq file with 9724 lines, (266413 bytes
> compressed, 900170 uncompressed) causes ngless to fail with
> BZ2_bzDecompress: -1. Regular unix bzip2 is able to decompress the file
> without problems.
>
> On the other hand if using regular bzip2, tried as many as 90000 lines
> and ngless is still able to consume the files without error.
@luispedro
Copy link
Member

This is an upstream issue, reported it there.

@luispedro
Copy link
Member

This has been merged upstream (snoyberg/bzlib-conduit#7). Once we have a new release and that makes it into the stackage LTS, we can just bump the version that NGLess uses and close here.

@uloeber
Copy link

uloeber commented Oct 14, 2022

Hi,
I see this is an old thread, but this happens to me using ngless 1.5 too. What is the fix?
Cheers,
Ulrike

@luispedro
Copy link
Member

luispedro commented Oct 23, 2022

Can you perhaps share one such file?

@luispedro luispedro reopened this Oct 23, 2022
@uloeber
Copy link

uloeber commented Oct 26, 2022

yes, I will send you a link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants