Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduping a single df #55

Open
matt-erhart opened this issue Jun 2, 2020 · 1 comment
Open

Deduping a single df #55

matt-erhart opened this issue Jun 2, 2020 · 1 comment

Comments

@matt-erhart
Copy link

matt-erhart commented Jun 2, 2020

I'm looking to do this on one df (and then on more than 2). Is there a way to do this that I've missed in the docs?

@andrewRowlinson
Copy link

I posted a potential solution in #56

I hacked together a solution for my dataset. My solution joins the dataframe to a copy of itself and exclude any links where your unique identifier matches. In my case, it was for football players. I have one column I need to de-duplicate on: 'Name' and an identifier: 'player_id'.

import pandas as pd
import numpy as np
import os
from fuzzymatcher.data_preprocessor_default import DataPreprocessor
from fuzzymatcher.data_getter_sqlite import DataGetter
from fuzzymatcher.scorer_default import Scorer
from fuzzymatcher.matcher import Matcher

# load a dataframe

df_all_shots = pd.read_parquet(os.path.join('..', 'data', 'shots.parquet'))

# split into a left and right copy
df_left = df_all_shots[['player_id', 'Name']].drop_duplicates('player_id').copy()
df_right = df_all_shots[['player_id', 'Name']].drop_duplicates('player_id').copy()

# create a match object
dp = DataPreprocessor()
dg = DataGetter()
s = Scorer()
m = Matcher(dp, dg, s)

# match the data with a copy of itself
m.add_data(df_left, df_right, left_on='Name', right_on='Name',  left_id_col='player_id', right_id_col='player_id')
m.match_all()
df_duplicated = m.link_table

# subset the data where the ids don't match - these are the duplicates
df_duplicated = df_duplicated[df_duplicated.__id_left != df_duplicated.__id_right].copy()

# some duplicates will be matched twice (once left/ once right), so keep one copy.
# I had a numeric id so I create columns with the min/max of the ids and de-duplicated 
# based on these new id columns
df_duplicated['id1'] = df_duplicated[['__id_left', '__id_right']].min(axis=1)
df_duplicated['id2'] = df_duplicated[['__id_left', '__id_right']].max(axis=1)
df_duplicated.drop(['__id_left', '__id_right', '__rank'], axis=1, inplace=True)
df_duplicated.drop_duplicates(inplace=True)

# merge back on the columns you want to keep
df_duplicated = df_duplicated.merge(df_left, how='left', left_on='id1', right_on='player_id')
df_duplicated = df_duplicated.merge(df_left, how='left', left_on='id2', right_on='player_id', suffixes=['_1', '_2'])

Here's my result:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants