Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zstd dictionary compression in a streaming platform #1645

Closed
ga92yup opened this issue Jun 10, 2019 · 3 comments
Closed

Zstd dictionary compression in a streaming platform #1645

ga92yup opened this issue Jun 10, 2019 · 3 comments
Labels

Comments

@ga92yup
Copy link

ga92yup commented Jun 10, 2019

Hi,

I am writing here, because I am not aware where else the question could be handled and maybe this is a limitation of Zstd. I am working on a dictionary compression with Zstd in Pulsar that is based on Netty and the
io.netty.buffer.ByteBuf datatype.

The issue is the following. When building the code locally I can access direct buffers of the input and the dictionary size is > 10000 bytes. On the network, where I simulate input and there is no direct buffer backing the dictionaries, the dictionary size gets as big as 500 bytes.
Could someone help me with explaining where the limitations are or maybe what I do wrong.

(the code examples are below).

Local:
public static byte[] getZstdDictionaryAsBytesFromList(List messages) {
int inputSize = messages.stream().mapToInt(b -> b.readableBytes()).sum();
ZstdDictTrainer trainer = new ZstdDictTrainer(inputSize, DICTIONARY_SIZE);
byte[] byteDictionary = null;

    for(ByteBuf buf: messages) {
        trainer.addSample(buf.array());
    }
    byteDictionary = trainer.trainSamples( true);
    return byteDictionary;
}

Network:
private CompletableFuture<byte []> buildDictionary(List messages) {
final CompletableFuture<byte []> future = new CompletableFuture<>();
int inputSize = messages.stream().mapToInt(b -> b.readableBytes()).sum();
this.dictionaryService.getDictionaryWorkerPool().submit(() -> {
ZstdDictTrainer trainer = new ZstdDictTrainer(inputSize, 16*1024);
for(ByteBuf msg: this.messageBuffer) {
byte[] msgArray;
if(msg.hasArray()) {
msgArray = msg.array();
} else {
int howMany = msg.readableBytes();
msgArray = new byte[howMany];
int readerIndex = msg.readerIndex();
msg.getBytes(readerIndex, msgArray);
}
trainer.addSample(msgArray);
}
byte[] byteDictionary = trainer.trainSamples( true);

Really sorry if I am flooding a forum for more specific issues.
Thanks a lot,
Milena

@felixhandte
Copy link
Contributor

Hi Milena,

It looks like you're using the zstd-jni binding maintained here? My ability to answer questions related to java or the zstd binding in that language is going to be very limited, I'm sorry to say. I'll do my best, though!

Can you tell me more about what you mean by "on the network, I simulate input"? The dictionary trainer's goal is to compile the repetitively occurring substrings in your sample inputs and assemble them into the dictionary that's returned (along with a structured header). If you are generating uncorrelated synthetic samples (e.g., by constructing them from a random number generator or by reading them from /dev/urandom), there will likely be no runs of bytes shared by multiple inputs, and so the returned dictionary will be empty, with just the header present (a few hundred bytes).

There are a few other possibilities that could explain what you're experiencing, but that's what fits best in my mind, so I'd like to eliminate that first.

@ga92yup
Copy link
Author

ga92yup commented Jun 10, 2019

Hi Felix,

thanks a lot for your reply. I now realize I might be in the wrong forum here, as the question seems to be quite Java specific. The input, I am simulating is not random but I am reading lines from the same .csv file. The lines share similarities, hence a good target for dictionary compression.
When I read the same lines (training data) from my local machine, I have:
On local machine: size of training data (bytes): 259475, dictionary size (bytes): 16384
On the network: size of training data (bytes): 62290, dictionary size (bytes): 449
The problem starts to look more and more to not have to do with Zstd but with data that gets lost by the conversion.

Regards,
Milena

@felixhandte
Copy link
Contributor

Ok, based on the suspicion that the issue you're experiencing is unrelated to zstd itself, I'm going to close this issue for the moment. If you find though that you really are presenting the same data to the trainer and getting different results, please re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants