-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower load_memory_map memory consumption #15
Comments
With sparse matrix we saved a lot of memory anyway. Also going iterparse saved another 30% I think now it's ok to have 1.3 gb (ok not that great) for the Bible, because we were somewhere around 4-8GB with a simple 100page file. One of the main reasons to write hotpdf was also to prevent our workers from going OOM. For example the 100-page file took >4GB before and killed out workers. Given the pages that we parse on average, our workers won't go OOM. We went from 4GB per file down to less than 300MB. We can definitely take a look at how to reduce this even more. But with the multiple data structures and the nature of handling strings, I doubt we'll see a significant improvements. In any case let's aim for <1GB usage for the Bible. I'll take a look at redundant data storage in memory. |
Makes sense. |
Yep! Next we can try and experimenting with writing the underlying data structures in a language like Rust. We can take a look at PyO3 + maturin |
Let's do it! |
Linking #34 |
Still, a couple of more PRs to go before we can close this 😉 |
Right now to load the Bible (an example of a big pdf file of ~700pages) the memory usage skyrockets to around 1.3 GiB.
The big memory allocation is of course line 108 in memory_map.py
Here's the memray flamegraph showing the allocation call:
Any ideas? 😅
@krishnasism @aptakhin
The text was updated successfully, but these errors were encountered: