Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower load_memory_map memory consumption #15

Open
callegarimattia opened this issue Jan 14, 2024 · 6 comments
Open

Lower load_memory_map memory consumption #15

callegarimattia opened this issue Jan 14, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@callegarimattia
Copy link
Contributor

Right now to load the Bible (an example of a big pdf file of ~700pages) the memory usage skyrockets to around 1.3 GiB.

The big memory allocation is of course line 108 in memory_map.py

char_x0, char_y0, char_x1, _ = [float(char_coord) for char_coord in char_bbox.split()]

Here's the memray flamegraph showing the allocation call:
image

Any ideas? 😅
@krishnasism @aptakhin

@callegarimattia callegarimattia added the enhancement New feature or request label Jan 14, 2024
@krishnasism
Copy link
Contributor

With sparse matrix we saved a lot of memory anyway. Also going iterparse saved another 30%

I think now it's ok to have 1.3 gb (ok not that great) for the Bible, because we were somewhere around 4-8GB with a simple 100page file.

One of the main reasons to write hotpdf was also to prevent our workers from going OOM. For example the 100-page file took >4GB before and killed out workers. Given the pages that we parse on average, our workers won't go OOM. We went from 4GB per file down to less than 300MB.

We can definitely take a look at how to reduce this even more. But with the multiple data structures and the nature of handling strings, I doubt we'll see a significant improvements.

In any case let's aim for <1GB usage for the Bible. I'll take a look at redundant data storage in memory.

@callegarimattia
Copy link
Contributor Author

Makes sense.
To be fair I think the biggest bottleneck is python itself.
May make sense to have the sparse_matrix and trie implementation in a different language to handle better the memory.
What do you think?

@krishnasism
Copy link
Contributor

krishnasism commented Jan 15, 2024

Yep! Next we can try and experimenting with writing the underlying data structures in a language like Rust.

We can take a look at PyO3 + maturin

@callegarimattia
Copy link
Contributor Author

Let's do it!

@callegarimattia
Copy link
Contributor Author

Linking #34

@krishnasism
Copy link
Contributor

Still, a couple of more PRs to go before we can close this 😉
Will take a look at more ways to optimised, also feel free to point out any inefficient code that you find, we can rewrite them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants