LMDB OOM's for a Large Dataset #4967

benherber · 2024-04-29T18:19:20Z

Current Behavior

Currently exploring if we can get one of the SAIL implementations to scale to the use cases we have (>= single-digit Billion Triples in some cases. The LMDB SAIL seems like it may be able to handle this (#3706 (reply in thread)); However, I am getting an OOM error on some (not all queries).

More specifically we are using the SP2B benchmark to test this: https://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B/ using the bundled generator to populate the store.

The query that we first ran into was Q2:

PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX swrc:    <http://swrc.ontoware.org/ontology#>
PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
PREFIX bench:   <http://localhost/vocabulary/bench/>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
		
SELECT ?inproc ?author ?booktitle ?title ?proc ?ee ?page ?url ?yr ?abstract
WHERE {
	?inproc rdf:type bench:Inproceedings .
	?inproc dc:creator ?author .
	?inproc bench:booktitle ?booktitle .
	?inproc dc:title ?title .
	?inproc dcterms:partOf ?proc .
	?inproc rdfs:seeAlso ?ee .
	?inproc swrc:pages ?page .
	?inproc foaf:homepage ?url .
	?inproc dcterms:issued ?yr
	OPTIONAL {
	     ?inproc bench:abstract ?abstract
	}
}
ORDER BY ?yr

Initially I ran this with the default JVM heap etc. and it OOM'd after a period of time. I whacked up the heap space to 48G on my 96G machine and it hasn't OOM'd so far.

Expected Behavior

Given the iterator design I would've expected that the query may be slow but shouldn't OOM during evaluation, is that understanding not correct?

Steps To Reproduce

Generate and load LMDB Store with 1 Billlion dataset using SP2B: https://dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B/
Run SP2B Q2 over dataset using default JVM settings

Version

4.3.11

Are you interested in contributing a solution yourself?

Perhaps?

Anything else?

The store was able to load 1 Billion on my machine in ~5.5-6 hrs using write-batches of around a 1000 triples which was really nice!

kenwenzel · 2024-04-29T18:32:08Z

It could be due to the order by clause that needs to materialize all values in a sorted set.

@JervenBolleman I think you have worked on the persistent sets?

@benherber BTW, you should/could use write batches with 100k triples. It would also be better to use the 5.0.0-SNAPSHOT AS the 4.x.x version hast several bugs in the LmdbStore.

benherber · 2024-04-29T18:41:14Z

It could be due to the order by clause that needs to materialize all values in a sorted set.

@JervenBolleman I think you have worked on the persistent sets?

@benherber BTW, you should/could use write batches with 100k triples. It would also be better to use the 5.0.0-SNAPSHOT AS the 4.x.x version hast several bugs in the LmdbStore.

Ah that would make sense. Oh good to know! Will try it out thanks. Just to confirm, since 5.0.0 is coming down the pipe rather soon, does that mean that the lmdb implementation in 4.x.x will not get future bug fixes?

kenwenzel · 2024-04-29T18:59:34Z

The current Implementation in 4.x.x is experimental and I've made some fixes and enhancements in the develop branch.
Those could be backported to 4.x.x but I don't know the correct procedure for doing this.

@hmottestad Could you give some advice here?

hmottestad · 2024-04-29T19:37:45Z

You can create a PR with the fixes you want to backport and we can merge them into main. If the code ends up being identical between main and develop then there shouldn't be any problems. If not then we will need to be a bit more careful when merging main into develop later.

kenwenzel · 2024-04-29T20:03:03Z

@benherber How do you execute the queries? Are all results materialized?

benherber · 2024-04-29T20:06:17Z

@benherber How do you execute the queries? Are all results materialized?

I just iterate through the result set, counting the number of triples:

try (final TupleQueryResult res = query.evaluate()) {
	for (final BindingSet set : res) {
		for (final var ignore : set) {
			count++;
		}
	}
}

kenwenzel · 2024-04-29T20:13:45Z

OK, that looks good. Could you investigate the memory usage with VisualVM while running the query?

benherber · 2024-04-29T20:18:00Z

OK, that looks good. Could you investigate the memory usage with VisualVM while running the query?

Yea I'll probably get to doing that later this week if I get the chance. Will update once I do

benherber added the 🐞 bug issue is a bug label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMDB OOM's for a Large Dataset #4967

LMDB OOM's for a Large Dataset #4967

benherber commented Apr 29, 2024 •

edited

kenwenzel commented Apr 29, 2024 •

edited

benherber commented Apr 29, 2024

kenwenzel commented Apr 29, 2024

hmottestad commented Apr 29, 2024

kenwenzel commented Apr 29, 2024

benherber commented Apr 29, 2024

kenwenzel commented Apr 29, 2024

benherber commented Apr 29, 2024

LMDB OOM's for a Large Dataset #4967

LMDB OOM's for a Large Dataset #4967

Comments

benherber commented Apr 29, 2024 • edited

Current Behavior

Expected Behavior

Steps To Reproduce

Version

Are you interested in contributing a solution yourself?

Anything else?

kenwenzel commented Apr 29, 2024 • edited

benherber commented Apr 29, 2024

kenwenzel commented Apr 29, 2024

hmottestad commented Apr 29, 2024

kenwenzel commented Apr 29, 2024

benherber commented Apr 29, 2024

kenwenzel commented Apr 29, 2024

benherber commented Apr 29, 2024

benherber commented Apr 29, 2024 •

edited

kenwenzel commented Apr 29, 2024 •

edited