Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMDB OOM's for a Large Dataset #4967

Open
benherber opened this issue Apr 29, 2024 · 8 comments
Open

LMDB OOM's for a Large Dataset #4967

benherber opened this issue Apr 29, 2024 · 8 comments
Labels
🐞 bug issue is a bug

Comments

@benherber
Copy link

benherber commented Apr 29, 2024

Current Behavior

Currently exploring if we can get one of the SAIL implementations to scale to the use cases we have (>= single-digit Billion Triples in some cases. The LMDB SAIL seems like it may be able to handle this (#3706 (reply in thread)); However, I am getting an OOM error on some (not all queries).

More specifically we are using the SP2B benchmark to test this: https://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B/ using the bundled generator to populate the store.

The query that we first ran into was Q2:

PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX swrc:    <http://swrc.ontoware.org/ontology#>
PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
PREFIX bench:   <http://localhost/vocabulary/bench/>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
		
SELECT ?inproc ?author ?booktitle ?title ?proc ?ee ?page ?url ?yr ?abstract
WHERE {
	?inproc rdf:type bench:Inproceedings .
	?inproc dc:creator ?author .
	?inproc bench:booktitle ?booktitle .
	?inproc dc:title ?title .
	?inproc dcterms:partOf ?proc .
	?inproc rdfs:seeAlso ?ee .
	?inproc swrc:pages ?page .
	?inproc foaf:homepage ?url .
	?inproc dcterms:issued ?yr
	OPTIONAL {
	     ?inproc bench:abstract ?abstract
	}
}
ORDER BY ?yr

Initially I ran this with the default JVM heap etc. and it OOM'd after a period of time. I whacked up the heap space to 48G on my 96G machine and it hasn't OOM'd so far.

Expected Behavior

Given the iterator design I would've expected that the query may be slow but shouldn't OOM during evaluation, is that understanding not correct?

Steps To Reproduce

  1. Generate and load LMDB Store with 1 Billlion dataset using SP2B: https://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B/
  2. Run SP2B Q2 over dataset using default JVM settings

Version

4.3.11

Are you interested in contributing a solution yourself?

Perhaps?

Anything else?

The store was able to load 1 Billion on my machine in ~5.5-6 hrs using write-batches of around a 1000 triples which was really nice!

@benherber benherber added the 🐞 bug issue is a bug label Apr 29, 2024
@kenwenzel
Copy link
Contributor

kenwenzel commented Apr 29, 2024

It could be due to the order by clause that needs to materialize all values in a sorted set.

@JervenBolleman I think you have worked on the persistent sets?

@benherber BTW, you should/could use write batches with 100k triples. It would also be better to use the 5.0.0-SNAPSHOT AS the 4.x.x version hast several bugs in the LmdbStore.

@benherber
Copy link
Author

It could be due to the order by clause that needs to materialize all values in a sorted set.

@JervenBolleman I think you have worked on the persistent sets?

@benherber BTW, you should/could use write batches with 100k triples. It would also be better to use the 5.0.0-SNAPSHOT AS the 4.x.x version hast several bugs in the LmdbStore.

Ah that would make sense. Oh good to know! Will try it out thanks. Just to confirm, since 5.0.0 is coming down the pipe rather soon, does that mean that the lmdb implementation in 4.x.x will not get future bug fixes?

@kenwenzel
Copy link
Contributor

The current Implementation in 4.x.x is experimental and I've made some fixes and enhancements in the develop branch.
Those could be backported to 4.x.x but I don't know the correct procedure for doing this.

@hmottestad Could you give some advice here?

@hmottestad
Copy link
Contributor

You can create a PR with the fixes you want to backport and we can merge them into main. If the code ends up being identical between main and develop then there shouldn't be any problems. If not then we will need to be a bit more careful when merging main into develop later.

@kenwenzel
Copy link
Contributor

@benherber How do you execute the queries? Are all results materialized?

@benherber
Copy link
Author

@benherber How do you execute the queries? Are all results materialized?

I just iterate through the result set, counting the number of triples:

try (final TupleQueryResult res = query.evaluate()) {
	for (final BindingSet set : res) {
		for (final var ignore : set) {
			count++;
		}
	}
}

@kenwenzel
Copy link
Contributor

OK, that looks good. Could you investigate the memory usage with VisualVM while running the query?

@benherber
Copy link
Author

OK, that looks good. Could you investigate the memory usage with VisualVM while running the query?

Yea I'll probably get to doing that later this week if I get the chance. Will update once I do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug issue is a bug
Projects
None yet
Development

No branches or pull requests

3 participants