Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bzip2: slow compression performance #62

Open
janrueth opened this issue Dec 12, 2017 · 4 comments
Open

bzip2: slow compression performance #62

janrueth opened this issue Dec 12, 2017 · 4 comments

Comments

@janrueth
Copy link

I'm currently converting a python tool that uses bzip2 for compression to go.

I stumbled upon missing bzip2 compression in golang and found your package (thanks for implementing it!).

Yet, bzip2 compression speed seems to be way lower than with my python variant.
It seems that the speed is independent of the compression level that I apply.
According to pprof most of the time is spent in computeSA_byte which in turn spends most of its time in sortLMS2_int and induceSA_int.

My python implementation is at least 3x faster.

Do you have any hints how I can speed it up?

My application feeds lines of (json) data to the bzip2.Writer, i.e., basically a loop over the lines (actually they are streamed from somewhere else (so I don't know when this end)) calling the Write method for each line.
In contrast to my python implementation in which I can manually call a flush method when I expect no more data (basically after a timeout not receiving further lines), it seems that the go implementation calls flush after each call to write automatically (I have only briefly looked at the code but the above mentioned functions seem to originate from flush). Also I found that your implementation causes larger files compared to my python equivalent (which I suppose might also come from prematurely flushing data?!).

Thanks for any help
Jan

@janrueth
Copy link
Author

After further thinking about it I came to the conclusion that I should manually buffer my writes via a bufio.Writer.

However, I'm unsure what buffer sizes I should choose, can you give recommendation?
I read that for the highest compression level I should compress blocks auf 900k byte, should this be my buffer size, or can bzip2 profit from even larger bulk writes?

@dsnet dsnet changed the title bzip2: Low compression performance (speed) bzip2: slow compression performance Dec 12, 2017
@dsnet
Copy link
Owner

dsnet commented Dec 12, 2017

The Python bzip2 module is a wrapper around the C bzip2 library. My benchmarks show the C library is typically 2x faster, but it being 3x faster on certain workloads is definitely possible.

Unlike most other compression algorithms which rely on LZ77 as their primary backbone, bzip2 is unique in that it uses Burrows-Wheeler Transform as the primary mechanism for deduplicating strings. Since BWT is the slowest part of bzip2, attempts to improve encoding performance should tackle the problem at that stage.

There are many approaches to performing the BWT:

  • Option 1 is to use a suffix array construction algorithm (SACA) and convert it to a BWT. This is the approach taken, because it is simpler, but not as performant since it needs to double the input and also perform a second pass over the data in order convert the SA to BWT.
  • Option 2 is to modify the SACA to directly output BWT. This is more efficient, but requires a deep understanding on SACAs in order to modify its structure.

In attempting to implement BWT, there are several questions:

  • What SACA algorithm should be used?
  • How do we modify the algorithm to output BWT directly?

What SACA algorithm should be used?

SACA algorithms are a large area of active research. I'm not sure which algorithm the C implementation uses, but other implementations like Java uses DivSufSort.

The algorithm used in this repo is called SAIS, and is a transliteration of the C implementation here.

I chose SAIS because it is simple (~600 lines) and decently fast for it's simplicity. It is actually one of the faster algorithms since it is O(n), while many other SACAs like DivSufSort are O(n*log n). Unfortunately, the bzip2 format limits blocks to 900k, so SAIS is unable to show its O(n) advantage over DivSufSort since n is not large enough.

I did not use DivSufSort like the Java implementation because I was not comfortable with maintaining the complexity of the DivSufSort approach.

How do we modify the algorithm to output BWT directly?

I don't know. It's possible, but I haven't had the bandwidth to really pick apart any specific SACA to really how to alter it to output the BWT required by bzip2 directly.


I don't have the bandwidth these days keep working on bzip2, so I don't have any plans to improve the bzip2 encoder performance.

I think it would be nice to keep the algorithm as SAIS, but if you can figure out how to have SAIS directly output BWT, that would increase performance immediately by 2x since you won't need to duplicate the input string.

@dsnet
Copy link
Owner

dsnet commented Dec 12, 2017

After further thinking about it I came to the conclusion that I should manually buffer my writes via a bufio.Writer.

I'm not sure how a bufio.Writer benefits your use case since the bzip2.Writer does buffering internally.

Since the BWT algorithm used in this repo is O(n), the blocksize has little impact on the performance speed. Thus, it probably makes sense to use level 9 compression. It will cost you more memory, but will produce smaller output files.

@janrueth
Copy link
Author

I'm not sure how a bufio.Writer benefits your use case since the bzip2.Writer does buffering internally.

Yeah I also realized that when testing to buffer manually...

Unfortunately, apart from having no time, I'm not an expert in compression so I'm not up to improving this.

I now ended up streaming my data through a bzip2 process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants