Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed-up solr indexing and reduce memory overhead #130

Open
artunit opened this issue Jul 29, 2020 · 8 comments
Open

speed-up solr indexing and reduce memory overhead #130

artunit opened this issue Jul 29, 2020 · 8 comments

Comments

@artunit
Copy link
Collaborator

artunit commented Jul 29, 2020

The implementation for solr indexing in reader uses solr's Data Import Handler (DIH) which, very recently, has been deprecated and seems to be destined to become a 3rd party package. Within DIH, the current implementation makes use of SortedMapBackedCache for the one-to-many tables in the database scheme, which can work quite well, but the tables that are targeted have become too large to fit comfortably into this cache architecture. One option would be to replace SortedMapBackedCache with a different and more sophisticated caching scheme, for example, MapDB, but it does not seem like good timing to add custom java to DIH. I would propose using sqlite's implementation of database views to allow the values from these tables to be brought together in one field from the database, and leverage DIH's support for script-based transformers to break apart the values to then populate the index. This approach would require 6 views as follows:

CREATE VIEW view_authors AS SELECT document_id, GROUP_CONCAT(REPLACE(author,',','_')) as authors FROM authors GROUP BY document_id;

CREATE VIEW view_keywords AS SELECT document_id, GROUP_CONCAT(keyword) as keywords FROM wrd GROUP BY document_id;

CREATE VIEW view_entities AS SELECT document_id, GROUP_CONCAT(DISTINCT(entity)) as entities FROM ent GROUP BY document_id;

CREATE VIEW view_types AS SELECT document_id, GROUP_CONCAT(DISTINCT(type)) as types FROM ent GROUP BY document_id;

CREATE VIEW view_sources AS SELECT document_id, GROUP_CONCAT(source) as sources FROM sources GROUP BY document_id;

CREATE VIEW view_urls AS SELECT document_id, GROUP_CONCAT(REPLACE(url,' ','')) as urls FROM urls GROUP BY document_id;

The views have a little bit of streamlining. For authors, for example, the comma character is replaced by an underscore in the view, and then added back in during the solr processing to avoid conflicts with the default field separator. The suggested DIHconfigfile.xml is attached (with a ".txt" extension since github won't accept ".xml"). This keeps the entire indexing implementation within standard solr without requiring custom java and, from my very limited testing, appears to be dramatically faster and less memory intensive.

DIHconfigfile.xml.txt

@ericleasemorgan
Copy link
Owner

ericleasemorgan commented Jul 29, 2020 via email

@ericleasemorgan
Copy link
Owner

ericleasemorgan commented Jul 30, 2020 via email

@artunit
Copy link
Collaborator Author

artunit commented Jul 30, 2020

The numbers in slack are based on a copy of the CORD database with the views added, it would be useful to try this with @ralphlevan 's solrcloud implementation which I think talks to the "main" CORD database.

@ericleasemorgan
Copy link
Owner

Ralph and Art, how about if:

  1. I duplicate ./etc/cord.db
  2. I add the views
  3. Y'all try your new indexing technique on the view-added version of ./etc/cord.db

How does that sound?

@artunit
Copy link
Collaborator Author

artunit commented Jul 30, 2020

I think it's just a matter of swapping DIHconfig.xml and updating the paths but I defer to @ralphlevan - I can zap my copy of cord to save disk space.

@ralphlevan
Copy link
Collaborator

Eric, you know how to make changes to DIHconfig.xml. Update the zookeepers, delete the old database, make a new database using the cord configset and fire off the Data Input Handler. Nothing ventured, nothing gained!

@ericleasemorgan
Copy link
Owner

ericleasemorgan commented Aug 4, 2020 via email

@artunit
Copy link
Collaborator Author

artunit commented Aug 4, 2020

Perfectly understandable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants