Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csvlink outputs duplicates #98

Open
mzagaja opened this issue Apr 21, 2020 · 1 comment
Open

csvlink outputs duplicates #98

mzagaja opened this issue Apr 21, 2020 · 1 comment

Comments

@mzagaja
Copy link

mzagaja commented Apr 21, 2020

The documentation claims:

csvlink operates in much the same way as csvdedupe, but will flatten both CSVs in to one output file similar to a SQL OUTER JOIN statement. You can use the --inner_join flag to exclude rows that don't match across the two input files, much like an INNER JOIN.

When I run csvlink against two CSV files like so:

csvlink CFF-LWP-CSR-Combined-to-Merge-deduped.csv  space-finder-name-scrape.csv --field_names_1 Name --field_names_2 "Space Finder Name" > CFF-LWP-CSR-Space-Finder-Combined.csv

I can verify no duplicates in the name column of the first input tables:

xsv frequency -s Name CFF-LWP-CSR-Combined-to-Merge-deduped.csv
field,value,count
Name,Artist Lofts at WestingHouse ^ The,1
Name,Whistler House Museum of Art,1
Name,Boston Photo Collaborative,1
Name,"Martha's Vineyard Film Society, Inc.",1
Name,Provincetown Center for Coastal Studies,1
Name,119 Braintree,1
Name,Arts United Fall River,1
Name,"Gardner Museum, Inc.^Isabella Stewart",1
Name,Andover Studio Building,1
Name,300 Summer Street,1

Only a couple in the second file...

xsv frequency -s "Space Finder Name" space-finder-name-scrape.csv
field,value,count
Space Finder Name,Norwood Space Center: Creative Studio,2
Space Finder Name,Boston Fit Body Bootcamp:  850ft - Dancing/Yoga/Barre Studio for rent (available now),2
Space Finder Name,Berkshire South Regional Community Center,2
Space Finder Name,Gateway City Arts,2
Space Finder Name,Hope & Feathers Framing and Printing: Hope & Feathers Gallery,1
Space Finder Name,Mass Audubon - Arcadia Wildlife Sanctuary: Event Space,1
Space Finder Name,The Rivers School Conservatory: A. Ramon Rivera Recital Hall,1
Space Finder Name,Lydia Pinkham Building: Lydia Pinkham Artist Studios,1
Space Finder Name,The Westfield Athenaeum: Elizabeth Stewart Reed Room,1
Space Finder Name,Williams Inn: Main Ballroom,1

I note that it outputs multiple rows for matches even if they weren't listed twice in the above files:

xsv frequency -s Name CFF-LWP-CSR-Space-Finder-Combined.csv
field,value,count
Name,(NULL),501
Name,Indian Orchard Mills,2
Name,Fountain Street Studios,2
Name,Fine Arts Work Center,2
Name,Third Life Studio,2
Name,South Shore Art Center,2
Name,Hopkinton Center for the Arts,2
Name,Sound Museum,2
Name,Historic Beaver Mill,2
Name,Eclipse Mill,2

My understanding is OUTER JOIN is defined as follows:

For those rows that do match, a single row will be produced in the result set (containing columns populated from both tables).

While this is a minor issue that I can solve using Pandas after the fact, it lead to confusion today and makes me think I am either missing something or the documentation is not quite correct.

@cah-stevenhaddix
Copy link

cah-stevenhaddix commented Dec 21, 2020

I believe the issue is that CSV Link finds exact matches first. Attempting to exclude them from the need to match at all. If they're the same don't bother guessing. The issue is that it is still passing them into the match function and merging them in again later

match rows = non-exact (input 1) + non-exact (input 2) + exact
result rows = match rows + exact (duplicate merge)

Current code:

# csvlink.py
        threshold = deduper.threshold(data_1, data_2,
                                      recall_weight=self.recall_weight)
        # `duplicateClusters` will return sets of record IDs that dedupe
        # believes are all referring to the same entity.

        logging.info('clustering...')
        clustered_dupes = deduper.match(data_1, data_2, threshold)

        clustered_dupes.extend(exact_pairs)

Updated to only pass in the nonexact to the match function:

match = non-exact (input 1) + non-exact (input 2)
result = match + exact

# csvlink.py
        threshold = deduper.threshold(nonexact_1, nonexact_2,
                                      recall_weight=self.recall_weight)
        # `duplicateClusters` will return sets of record IDs that dedupe
        # believes are all referring to the same entity.

        logging.info('clustering...')
        clustered_dupes = deduper.match(nonexact_1, nonexact_2, threshold)

        clustered_dupes.extend(exact_pairs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants