Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow serialization with large Array data (> 60 Mbytes) #269

Open
hernanmd opened this issue Feb 10, 2023 · 14 comments
Open

Slow serialization with large Array data (> 60 Mbytes) #269

hernanmd opened this issue Feb 10, 2023 · 14 comments
Labels
pinned Never mark this issue stale

Comments

@hernanmd
Copy link

It seems that Fuel takes a long time serializing an Array of Arrays containing multiple Strings each one. When serialization passes around 60 Mbytes, it starts writing data in chunks of 100Kbytes and very slowly:

To reproduce please download this Fuel serialization: https://filesender.renater.fr/?s=download&token=b3809275-6258-4ee4-8d93-3b8871b7a0bd

Materialize it in a fresh Pharo 11 image:

FLMaterializer 
	materializeFromFileNamed: 'acImdb_49582_nodups_noartfcts_tokenized.fuel'.

and then from the Inspector window evaluate:

FLSerializer 
	serialize: self 
	toFileNamed: 'acImdb_49582_nodups_noartfcts_tokenized_test.fuel'.

We tested in Pharo 11 under macOS.

@theseion
Copy link
Owner

Interesting. I don't see any reason for Fuel to do that. There must be another mechanism at play. I currently don't have to look into this tough, in a week or two maybe. @tinchodias, do you have some time to spare maybe?

@theseion
Copy link
Owner

Can you give me a rough timing estimate of what you mean by "slowly"?

@hernanmd
Copy link
Author

Can you give me a rough timing estimate of what you mean by "slowly"?

It was around 100K/sec

@theseion
Copy link
Owner

This is very interesting. I found the issue to be time spent in FLLargeIdentityDictionary. Replacing one particular use of FLLargeIdentityDictionary with IdentityDictionary cut time down to ~40 seconds. FLLargeIdentityDictionary was specifically designed to be faster for large collections, so it's very curious that this happens.

@theseion
Copy link
Owner

Still investigating but interesting find: in Pharo 11 the string hashes appear to be badly distributed.

@hernanmd
Copy link
Author

Probably @guillep will be interested

@theseion
Copy link
Owner

I'm not sure the hash distribution is bad per se, it's just that the hashes have only uneven numbers when using modulo 4096.

It looks like FLLargeIdentityDictionary has been slower than we wanted it to be since at least Pharo 6.

@guillep
Copy link

guillep commented Feb 13, 2023

Interesting finding :)

@guillep
Copy link

guillep commented Feb 23, 2023

I checked the usage of identity dictionaries.
I wrote a silly (and unfinished) identity dictionary implementation that keeps a hash table with hash tables inside. The outer hash table uses the modulo of the identity hash and benefits from them being evenly distributed. This silly implementation outperformed the other identity dictionaries for the larger charges (yellow in the plot below). It's 4x faster than FLLargeIdentityDictionary and 50x faster than the standard identity dictionary for the largest set of objects.

imagen

This kind of tells me that a better identity dictionary will drastically improve the performance of fuel.
I'll add this to the GSOC proposal here: https://gsoc.pharo.org/serialization

@theseion
Copy link
Owner

Brilliant, thanks!

@jecisc
Copy link

jecisc commented Mar 24, 2023

Oh nice, not just Fuel. The system relies a lot on IdentityDictionaries :)

@stale
Copy link

stale bot commented May 23, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

@stale stale bot added the stale label May 23, 2023
@theseion theseion removed the stale label May 24, 2023
@stale
Copy link

stale bot commented Jul 23, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

@stale stale bot added the stale label Jul 23, 2023
@theseion theseion removed the stale label Jul 24, 2023
@stale
Copy link

stale bot commented Sep 23, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

@stale stale bot added the stale label Sep 23, 2023
@theseion theseion added pinned Never mark this issue stale and removed stale labels Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pinned Never mark this issue stale
Projects
None yet
Development

No branches or pull requests

4 participants