-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zstd dictionary compression in a streaming platform #1645
Comments
Hi Milena, It looks like you're using the zstd-jni binding maintained here? My ability to answer questions related to java or the zstd binding in that language is going to be very limited, I'm sorry to say. I'll do my best, though! Can you tell me more about what you mean by "on the network, I simulate input"? The dictionary trainer's goal is to compile the repetitively occurring substrings in your sample inputs and assemble them into the dictionary that's returned (along with a structured header). If you are generating uncorrelated synthetic samples (e.g., by constructing them from a random number generator or by reading them from There are a few other possibilities that could explain what you're experiencing, but that's what fits best in my mind, so I'd like to eliminate that first. |
Hi Felix, thanks a lot for your reply. I now realize I might be in the wrong forum here, as the question seems to be quite Java specific. The input, I am simulating is not random but I am reading lines from the same .csv file. The lines share similarities, hence a good target for dictionary compression. Regards, |
Ok, based on the suspicion that the issue you're experiencing is unrelated to zstd itself, I'm going to close this issue for the moment. If you find though that you really are presenting the same data to the trainer and getting different results, please re-open. |
Hi,
I am writing here, because I am not aware where else the question could be handled and maybe this is a limitation of Zstd. I am working on a dictionary compression with Zstd in Pulsar that is based on Netty and the
io.netty.buffer.ByteBuf datatype.
The issue is the following. When building the code locally I can access direct buffers of the input and the dictionary size is > 10000 bytes. On the network, where I simulate input and there is no direct buffer backing the dictionaries, the dictionary size gets as big as 500 bytes.
Could someone help me with explaining where the limitations are or maybe what I do wrong.
(the code examples are below).
Local:
public static byte[] getZstdDictionaryAsBytesFromList(List messages) {
int inputSize = messages.stream().mapToInt(b -> b.readableBytes()).sum();
ZstdDictTrainer trainer = new ZstdDictTrainer(inputSize, DICTIONARY_SIZE);
byte[] byteDictionary = null;
Network:
private CompletableFuture<byte []> buildDictionary(List messages) {
final CompletableFuture<byte []> future = new CompletableFuture<>();
int inputSize = messages.stream().mapToInt(b -> b.readableBytes()).sum();
this.dictionaryService.getDictionaryWorkerPool().submit(() -> {
ZstdDictTrainer trainer = new ZstdDictTrainer(inputSize, 16*1024);
for(ByteBuf msg: this.messageBuffer) {
byte[] msgArray;
if(msg.hasArray()) {
msgArray = msg.array();
} else {
int howMany = msg.readableBytes();
msgArray = new byte[howMany];
int readerIndex = msg.readerIndex();
msg.getBytes(readerIndex, msgArray);
}
trainer.addSample(msgArray);
}
byte[] byteDictionary = trainer.trainSamples( true);
Really sorry if I am flooding a forum for more specific issues.
Thanks a lot,
Milena
The text was updated successfully, but these errors were encountered: