Lower load_memory_map memory consumption #15

callegarimattia · 2024-01-14T21:21:10Z

Right now to load the Bible (an example of a big pdf file of ~700pages) the memory usage skyrockets to around 1.3 GiB.

The big memory allocation is of course line 108 in memory_map.py

char_x0, char_y0, char_x1, _ = [float(char_coord) for char_coord in char_bbox.split()]

Here's the memray flamegraph showing the allocation call:

Any ideas? 😅
@krishnasism @aptakhin

krishnasism · 2024-01-15T10:32:58Z

With sparse matrix we saved a lot of memory anyway. Also going iterparse saved another 30%

I think now it's ok to have 1.3 gb (ok not that great) for the Bible, because we were somewhere around 4-8GB with a simple 100page file.

One of the main reasons to write hotpdf was also to prevent our workers from going OOM. For example the 100-page file took >4GB before and killed out workers. Given the pages that we parse on average, our workers won't go OOM. We went from 4GB per file down to less than 300MB.

We can definitely take a look at how to reduce this even more. But with the multiple data structures and the nature of handling strings, I doubt we'll see a significant improvements.

In any case let's aim for <1GB usage for the Bible. I'll take a look at redundant data storage in memory.

callegarimattia · 2024-01-15T10:37:59Z

Makes sense.
To be fair I think the biggest bottleneck is python itself.
May make sense to have the sparse_matrix and trie implementation in a different language to handle better the memory.
What do you think?

krishnasism · 2024-01-15T14:02:17Z

Yep! Next we can try and experimenting with writing the underlying data structures in a language like Rust.

We can take a look at PyO3 + maturin

callegarimattia · 2024-01-15T14:34:47Z

Let's do it!

callegarimattia · 2024-01-16T09:41:12Z

Linking #34

krishnasism · 2024-01-16T10:36:31Z

Still, a couple of more PRs to go before we can close this 😉
Will take a look at more ways to optimised, also feel free to point out any inefficient code that you find, we can rewrite them

callegarimattia added the enhancement New feature or request label Jan 14, 2024

callegarimattia mentioned this issue Jan 16, 2024

Use generators instead of lists and remove extra calls #34

Merged

callegarimattia mentioned this issue Jan 19, 2024

feat: tried to reduce memory but only cleaned up the code 😢 #61

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower load_memory_map memory consumption #15

Lower load_memory_map memory consumption #15

callegarimattia commented Jan 14, 2024

krishnasism commented Jan 15, 2024

callegarimattia commented Jan 15, 2024

krishnasism commented Jan 15, 2024 •

edited

callegarimattia commented Jan 15, 2024

callegarimattia commented Jan 16, 2024

krishnasism commented Jan 16, 2024

Lower load_memory_map memory consumption #15

Lower load_memory_map memory consumption #15

Comments

callegarimattia commented Jan 14, 2024

krishnasism commented Jan 15, 2024

callegarimattia commented Jan 15, 2024

krishnasism commented Jan 15, 2024 • edited

callegarimattia commented Jan 15, 2024

callegarimattia commented Jan 16, 2024

krishnasism commented Jan 16, 2024

krishnasism commented Jan 15, 2024 •

edited