Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow loading of the Wikidata .bz2 dump #105

Open
kermitt2 opened this issue Jun 12, 2020 · 2 comments
Open

Slow loading of the Wikidata .bz2 dump #105

kermitt2 opened this issue Jun 12, 2020 · 2 comments

Comments

@kermitt2
Copy link
Owner

The Wikidata dump became very big with 1.2 billion statements which makes the initial loading of the bz2 dump into lmdb particularly slow.

To speed-up this step, we could try:

  • instead of having 2 pass on the dump, one to get the properties and one to get the statements, we could do both in one pass and solve the property resolution subsequently with the db

  • instead of reading line by line, try with larger buffer blocks

@kermitt2
Copy link
Owner Author

Complementary info:

  • full loading of Wikidata + 5 wikipedia languages from compiled csv files takes now 22h45m from a mechanical hard drive... (ok I should use a SSD...)

  • the Wikidata statement database in particular went up from 17GB to 62GB in a bit more than 2 years.

The good point is that the increase of Wikidata volume does not affect runtime, just the storage size.

@oterrier
Copy link

oterrier commented Oct 21, 2020

Hi Patrice
According to https://www.wikidata.org/wiki/Wikidata:Statistics, one of the main reason of the explosion of the statement db is that recently most of the published scientific articles have now an entry in wikidata
They currently represent more than 22M concepts out of 71M
I understand the interest to be able to build graphs between authors and articles but it is not very interesting for entity fishing given that these scholary articles have no wikipedia pages associated and have long titles that cannot be recognized by the current EF mention recognizers.
Take the "Attention Is All You Need" paper for example : https://www.wikidata.org/wiki/Q30249683
So one possible optimization of the statement db size would be to be able to filter out some classes ("scholary article" being one of them) when initially building the lmdb database
Let's imagine you can define such filtering constraint somewhere (or had code them?) for example in the kb.yaml file:

#dataDirectory: /home/lopez/resources/wikidata/

# Exclude scholary articles from statement db
excludedConceptStatements:
  - conceptId:
    propertyId: P31
    value: Q13442814

When filling the statement db if I detect a concept meeting the constraint ("instance of" "scholary article" for example) then I forget this concept and I don't store the statements

          	if ((propertytId != null) && (value != null)) {
			if (excludedConceptStatements != null) {
				for (Statement excludedConceptStatement : excludedConceptStatements) {
					exclude = (excludedConceptStatement.getConceptId() == null || excludedConceptStatement.getConceptId() == itemId) &&
							(excludedConceptStatement.getPropertyId() == null || excludedConceptStatement.getPropertyId() == propertytId) &&
							(excludedConceptStatement.getValue() == null || excludedConceptStatement.getValue() == value);
					if (exclude)
						break;
				}
			}
			Statement statement = new Statement(itemId, propertytId, value);
//System.out.println("Adding: " + statement.toString());
			if (!statements.contains(statement))
				statements.add(statement);
		}
...
...
			if (statements.size() > 0 && !exclude) {
				try {
					db.put(tx, KBEnvironment.serialize(itemId), KBEnvironment.serialize(statements));
					nbToAdd++;
					nbTotalAdded++;
				} catch(Exception e) {
					e.printStackTrace();
				}
			}

I think we can considerably reduce the size of the statement db

I can even propose a PR for such a mechanism

Best regards
Olivier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants