Deduping a single df #55

matt-erhart · 2020-06-02T00:50:19Z

I'm looking to do this on one df (and then on more than 2). Is there a way to do this that I've missed in the docs?

andrewRowlinson · 2020-06-21T11:08:25Z

I posted a potential solution in #56

I hacked together a solution for my dataset. My solution joins the dataframe to a copy of itself and exclude any links where your unique identifier matches. In my case, it was for football players. I have one column I need to de-duplicate on: 'Name' and an identifier: 'player_id'.

import pandas as pd
import numpy as np
import os
from fuzzymatcher.data_preprocessor_default import DataPreprocessor
from fuzzymatcher.data_getter_sqlite import DataGetter
from fuzzymatcher.scorer_default import Scorer
from fuzzymatcher.matcher import Matcher

# load a dataframe

df_all_shots = pd.read_parquet(os.path.join('..', 'data', 'shots.parquet'))

# split into a left and right copy
df_left = df_all_shots[['player_id', 'Name']].drop_duplicates('player_id').copy()
df_right = df_all_shots[['player_id', 'Name']].drop_duplicates('player_id').copy()

# create a match object
dp = DataPreprocessor()
dg = DataGetter()
s = Scorer()
m = Matcher(dp, dg, s)

# match the data with a copy of itself
m.add_data(df_left, df_right, left_on='Name', right_on='Name',  left_id_col='player_id', right_id_col='player_id')
m.match_all()
df_duplicated = m.link_table

# subset the data where the ids don't match - these are the duplicates
df_duplicated = df_duplicated[df_duplicated.__id_left != df_duplicated.__id_right].copy()

# some duplicates will be matched twice (once left/ once right), so keep one copy.
# I had a numeric id so I create columns with the min/max of the ids and de-duplicated 
# based on these new id columns
df_duplicated['id1'] = df_duplicated[['__id_left', '__id_right']].min(axis=1)
df_duplicated['id2'] = df_duplicated[['__id_left', '__id_right']].max(axis=1)
df_duplicated.drop(['__id_left', '__id_right', '__rank'], axis=1, inplace=True)
df_duplicated.drop_duplicates(inplace=True)

# merge back on the columns you want to keep
df_duplicated = df_duplicated.merge(df_left, how='left', left_on='id1', right_on='player_id')
df_duplicated = df_duplicated.merge(df_left, how='left', left_on='id2', right_on='player_id', suffixes=['_1', '_2'])

Here's my result:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduping a single df #55

Deduping a single df #55

matt-erhart commented Jun 2, 2020 •

edited

andrewRowlinson commented Jun 21, 2020

Deduping a single df #55

Deduping a single df #55

Comments

matt-erhart commented Jun 2, 2020 • edited

andrewRowlinson commented Jun 21, 2020

matt-erhart commented Jun 2, 2020 •

edited