Zstd dictionary compression in a streaming platform #1645

ga92yup · 2019-06-10T09:27:57Z

Hi,

I am writing here, because I am not aware where else the question could be handled and maybe this is a limitation of Zstd. I am working on a dictionary compression with Zstd in Pulsar that is based on Netty and the
io.netty.buffer.ByteBuf datatype.

The issue is the following. When building the code locally I can access direct buffers of the input and the dictionary size is > 10000 bytes. On the network, where I simulate input and there is no direct buffer backing the dictionaries, the dictionary size gets as big as 500 bytes.
Could someone help me with explaining where the limitations are or maybe what I do wrong.

(the code examples are below).

Local:
public static byte[] getZstdDictionaryAsBytesFromList(List messages) {
int inputSize = messages.stream().mapToInt(b -> b.readableBytes()).sum();
ZstdDictTrainer trainer = new ZstdDictTrainer(inputSize, DICTIONARY_SIZE);
byte[] byteDictionary = null;

    for(ByteBuf buf: messages) {
        trainer.addSample(buf.array());
    }
    byteDictionary = trainer.trainSamples( true);
    return byteDictionary;
}

Network:
private CompletableFuture<byte []> buildDictionary(List messages) {
final CompletableFuture<byte []> future = new CompletableFuture<>();
int inputSize = messages.stream().mapToInt(b -> b.readableBytes()).sum();
this.dictionaryService.getDictionaryWorkerPool().submit(() -> {
ZstdDictTrainer trainer = new ZstdDictTrainer(inputSize, 16*1024);
for(ByteBuf msg: this.messageBuffer) {
byte[] msgArray;
if(msg.hasArray()) {
msgArray = msg.array();
} else {
int howMany = msg.readableBytes();
msgArray = new byte[howMany];
int readerIndex = msg.readerIndex();
msg.getBytes(readerIndex, msgArray);
}
trainer.addSample(msgArray);
}
byte[] byteDictionary = trainer.trainSamples( true);

Really sorry if I am flooding a forum for more specific issues.
Thanks a lot,
Milena

The text was updated successfully, but these errors were encountered:

felixhandte · 2019-06-10T14:56:54Z

Hi Milena,

It looks like you're using the zstd-jni binding maintained here? My ability to answer questions related to java or the zstd binding in that language is going to be very limited, I'm sorry to say. I'll do my best, though!

Can you tell me more about what you mean by "on the network, I simulate input"? The dictionary trainer's goal is to compile the repetitively occurring substrings in your sample inputs and assemble them into the dictionary that's returned (along with a structured header). If you are generating uncorrelated synthetic samples (e.g., by constructing them from a random number generator or by reading them from /dev/urandom), there will likely be no runs of bytes shared by multiple inputs, and so the returned dictionary will be empty, with just the header present (a few hundred bytes).

There are a few other possibilities that could explain what you're experiencing, but that's what fits best in my mind, so I'd like to eliminate that first.

ga92yup · 2019-06-10T16:38:29Z

Hi Felix,

thanks a lot for your reply. I now realize I might be in the wrong forum here, as the question seems to be quite Java specific. The input, I am simulating is not random but I am reading lines from the same .csv file. The lines share similarities, hence a good target for dictionary compression.
When I read the same lines (training data) from my local machine, I have:
On local machine: size of training data (bytes): 259475, dictionary size (bytes): 16384
On the network: size of training data (bytes): 62290, dictionary size (bytes): 449
The problem starts to look more and more to not have to do with Zstd but with data that gets lost by the conversion.

Regards,
Milena

felixhandte · 2019-06-10T17:33:45Z

Ok, based on the suspicion that the issue you're experiencing is unrelated to zstd itself, I'm going to close this issue for the moment. If you find though that you really are presenting the same data to the trainer and getting different results, please re-open.

Cyan4973 added the question label Jun 10, 2019

felixhandte closed this as completed Jun 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zstd dictionary compression in a streaming platform #1645

Zstd dictionary compression in a streaming platform #1645

ga92yup commented Jun 10, 2019 •

edited

felixhandte commented Jun 10, 2019

ga92yup commented Jun 10, 2019 •

edited

felixhandte commented Jun 10, 2019

Zstd dictionary compression in a streaming platform #1645

Zstd dictionary compression in a streaming platform #1645

Comments

ga92yup commented Jun 10, 2019 • edited

felixhandte commented Jun 10, 2019

ga92yup commented Jun 10, 2019 • edited

felixhandte commented Jun 10, 2019

ga92yup commented Jun 10, 2019 •

edited

ga92yup commented Jun 10, 2019 •

edited