Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Big" RecordLink example #23

Open
dmkoch opened this issue Mar 26, 2015 · 4 comments
Open

"Big" RecordLink example #23

dmkoch opened this issue Mar 26, 2015 · 4 comments

Comments

@dmkoch
Copy link
Contributor

dmkoch commented Mar 26, 2015

Are there any examples of using RecordLink on larger data sets that do not fit into memory -- something similar to the MySQL or PostgreSQL big deduplication examples?

@fgregg
Copy link
Contributor

fgregg commented Apr 28, 2015

@lminer
Copy link

lminer commented Jul 2, 2015

Would that example work if the canonical database is very large? The example just reads the entire canonical file in when indexing. Would a more memory efficient solution involve creating an inverted index via server side queries as in the mysql example?

@rebecca-burwei
Copy link

Hi! I started looking for data sets to produce a big record link example. Does dedupeio/record link require training data with examples of matches? (From the small record link example, it appears not since there isn't a column in the data sets for the target variable... I see that the program may ask the user to verify matches while running..)

I've been looking into matching pre-prints with records of journal publications--something that would have been really useful to me early in grad school when it was hard to tell if a pre-print on arxiv.org (most popular math repo) had been published or not (perhaps under a different title, with different collaborators, etc).

@fgregg
Copy link
Contributor

fgregg commented Sep 7, 2017

Here's a gist of how what this could look like if someone wants to take it and make into a full example https://gist.github.com/fgregg/e45280fa32a9eee8daab65a95f385656

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants