Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create training file #91

Open
cviebrock opened this issue Jun 19, 2019 · 1 comment
Open

How to create training file #91

cviebrock opened this issue Jun 19, 2019 · 1 comment

Comments

@cviebrock
Copy link

I've used cvsdedupe to try and match up a list of ~77,000 unmapped entries to a master list of ~141,000 known things. It worked and has given a list of ~30,000 matches.

I've since done a bunch of manual work to not only check the ML mapping from csvdedupe, but also from some other sources, so I now have a definitive list of matches that I'd like to feed back to cvsdedupe before rerunning it to try and refine and improve my results. I can't figure out how to do that.

The format of the training.json file seems pretty straight-forward, but I can't tell what marks something as a positive or negative match ... or even if that is what that file is about. Can anyone help me?

@zambrana98
Copy link

Noob user here. I had the same question but then figured out that the structure is first there's a "distinct" set with all pairs recognized to be distinct, and then a "match" set with all pairs you've identified as being equal.

I have a list of publications by people with similar names and my goal is to identify those authored by one particular person, so I need that one cluster to be good and I don't care much about deduplicating other authors. I have a list of publications authored by the correct person and a list of some that are definitely not by her.

Here's the training file I created: say set A is the list of publications of the person I am interested in, set B is the set of those not by her. The "distinct" part is made up of the cartesian product of A and B, and the "match" part by the cartesian product {{a[i],a[j]} where a[i] and a[j] belong to A and i!=j}, together with a similar set for elements in B. The rest is making sure that the file has the correct format (double quotes instead of single quotes, the correct brackets, etc.). I'll add the code below.

The code I used is for a dataset with one variable called "match" that identifies the person I am interested in. match==1 means it is the correct person, match==0 means it is someone else, match=='' means we don't know (these are the cases I want dedupe to help me with). Like I said, noob user here, so I'm not sure if this is the best or even correct way to go about it. But I hope it helps.

distinct = []
matches = []
 
matched_0 = {key:value for key, value in data_d.items() if value['match']=='0'}
matched_1 = {key:value for key, value in data_d.items() if value['match']=='1'}
 
pairs = [json.dumps([x,y]) for x in matched_0.values() for y in matched_1.values()]
file1 = open(training_file,"w") 
file1.write('{"distinct": [')
for x in pairs[:-1]:
    file1.write('{"__class__": "tuple", "__value__": '+x+'}, ')
file1.write('{"__class__": "tuple", "__value__": '+pairs[-1]+'}]')
 
pairs = [json.dumps([x,y]) for x in matched_0.values() for y in matched_0.values() if x!=y]
file1.write(', "match": [')
for x in pairs:
    file1.write('{"__class__": "tuple", "__value__": '+x+'}, ')
 
pairs = [json.dumps([x,y]) for x in matched_1.values() for y in matched_1.values() if x!=y]
for x in pairs[:-1]:
    file1.write('{"__class__": "tuple", "__value__": '+x+'}, ')
file1.write('{"__class__": "tuple", "__value__": '+pairs[-1]+'}]}')
file1.close() 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants