Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training not providing enough matches #1077

Open
tigerang22 opened this issue Aug 1, 2022 · 26 comments
Open

Training not providing enough matches #1077

tigerang22 opened this issue Aug 1, 2022 · 26 comments

Comments

@tigerang22
Copy link

tigerang22 commented Aug 1, 2022

I have been using dedupe 2.0.6. Recently I ran into the KeyError issue with a dataset of 78,598 records. After I upgraded to version 2.017, the KeyError issue has been resolved. However as I am doing regression testing using 2.0.17 against the previous datasets, I have noticed a dramatic memory increase from 300 mb to 8-10 gb and twice as much time as version 2.06 during the deduper.prepare_training() call on my windows machine for a dataset with 121,420 records (I have a Linux app service in Azure that I had to double the size for. I haven't measured the actual memory consumption yet so can't give the metrics at the moment). The more significant problem is that although there is better sampling according to this, my training session consistently ends up with 200-300 distincts and only 3-5 matches.

@fgregg, is this problem related to sampling solely or have other things been changed since 2.06 that is causing what I am experiencing, i.e. memory, perf and not enough matches during training? I have noticed that the old sampling code, that caused the KeyError, had been moved out of core.py to convenience.py and new sampling code is now being used.

Thanks in advance. Love the great work of this project!

@f-hafner
Copy link

f-hafner commented Aug 8, 2022

I am having a similar issue with record linkage: the training session gives mostly distincts and only very few matches.

Problem: In version 2.0.17, the labelling gives a lot of pairs (>30) that are obvious non-links, and only 1-2 pairs that could be true links.

There are about 30k records in both data sets. The features I use are:

  • various parts of name (first, last, middle), each as String type
  • a set of tuples of "(employment year, employer name)" in both data sets as Custom type

I manually inspected some of the records: there are links to be found.

I had used dedupe before on similar data and did not expect this. Thus, I tried out different versions, and at least in version 2.0.11 the labelling works much better (ie, many more pairs that are likely to be true links) with the same data.

Environment: ubuntu 22.04. The project uses the following conda settings:

#Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
affinegap                 1.12                     pypi_0    pypi
argon2-cffi               21.3.0             pyhd8ed1ab_0    conda-forge
argon2-cffi-bindings      21.2.0           py38h0a891b7_2    conda-forge
asttokens                 2.0.5              pyhd8ed1ab_0    conda-forge
attrs                     21.4.0             pyhd8ed1ab_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
beautifulsoup4            4.11.1             pyha770c72_0    conda-forge
blas                      1.0                         mkl  
bleach                    5.0.1              pyhd8ed1ab_0    conda-forge
bottleneck                1.3.5            py38h7deecbd_0  
brotli                    1.0.9                he6710b0_2  
brotlipy                  0.7.0           py38h27cfd23_1003  
btrees                    4.10.0                   pypi_0    pypi
ca-certificates           2022.4.26            h06a4308_0  
categorical-distance      1.9                      pypi_0    pypi
certifi                   2022.6.15        py38h06a4308_0  
cffi                      1.15.0           py38hd667e15_1  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
click                     8.0.4            py38h06a4308_0  
cryptography              37.0.1           py38h9ce1e76_0  
cycler                    0.11.0             pyhd3eb1b0_0  
datetime-distance         0.1.3                    pypi_0    pypi
dbus                      1.13.18              hb2f20db_0  
debugpy                   1.6.0            py38hfa26641_0    conda-forge
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
dedupe                    2.0.17                   pypi_0    pypi
dedupe-variable-datetime  0.1.5                    pypi_0    pypi
defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
doublemetaphone           1.1                      pypi_0    pypi
entrypoints               0.4                pyhd8ed1ab_0    conda-forge
et_xmlfile                1.1.0            py38h06a4308_0  
executing                 0.8.3              pyhd8ed1ab_0    conda-forge
expat                     2.4.4                h295c915_0  
fastcluster               1.2.6                    pypi_0    pypi
flit-core                 3.7.1              pyhd8ed1ab_0    conda-forge
fontconfig                2.13.1               h6c09931_0  
fonttools                 4.25.0             pyhd3eb1b0_0  
freetype                  2.11.0               h70c0345_0  
future                    0.18.2                   pypi_0    pypi
giflib                    5.2.1                h7b6447c_0  
glib                      2.69.1               h4ff587b_1  
gst-plugins-base          1.14.0               h8213a91_2  
gstreamer                 1.14.0               h28cd5cc_2  
haversine                 2.6.0                    pypi_0    pypi
highered                  0.2.1                    pypi_0    pypi
icu                       58.2                 he6710b0_3  
idna                      3.3                pyhd3eb1b0_0  
importlib-metadata        4.11.4           py38h578d9bd_0    conda-forge
importlib_resources       5.8.0              pyhd8ed1ab_0    conda-forge
intel-openmp              2021.4.0          h06a4308_3561  
ipykernel                 6.15.1             pyh210e3f2_0    conda-forge
ipython                   8.4.0            py38h578d9bd_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.18.1           py38h578d9bd_1    conda-forge
jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
joblib                    1.1.0              pyhd3eb1b0_0  
jpeg                      9e                   h7f8727e_0  
jsonschema                4.7.2              pyhd8ed1ab_0    conda-forge
jupyter_client            7.0.6              pyhd8ed1ab_0    conda-forge
jupyter_core              4.10.0           py38h578d9bd_0    conda-forge
jupyterlab_pygments       0.2.2              pyhd8ed1ab_0    conda-forge
kiwisolver                1.4.2            py38h295c915_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
levenshtein-search        1.4.5                    pypi_0    pypi
libffi                    3.3                  he6710b0_2  
libgcc-ng                 11.2.0               h1234567_1  
libgfortran-ng            7.5.0               ha8ba4b0_17  
libgfortran4              7.5.0               ha8ba4b0_17  
libgomp                   11.2.0               h1234567_1  
libpng                    1.6.37               hbc83047_0  
libsodium                 1.0.18               h36c2ea0_1    conda-forge
libstdcxx-ng              11.2.0               h1234567_1  
libtiff                   4.2.0                h2818925_1  
libuuid                   1.0.3                h7f8727e_2  
libwebp                   1.2.2                h55f646e_0  
libwebp-base              1.2.2                h7f8727e_0  
libxcb                    1.15                 h7f8727e_0  
libxml2                   2.9.14               h74e7548_0  
libxslt                   1.1.35               h4e12654_0  
lxml                      4.9.1            py38h1edc446_0  
lz4-c                     1.9.3                h295c915_1  
markupsafe                2.1.1            py38h0a891b7_1    conda-forge
matplotlib                3.5.1            py38h06a4308_1  
matplotlib-base           3.5.1            py38ha18d171_1  
matplotlib-inline         0.1.3              pyhd8ed1ab_0    conda-forge
mistune                   0.8.4           py38h497a2fe_1005    conda-forge
mkl                       2021.4.0           h06a4308_640  
mkl-service               2.4.0            py38h7f8727e_0  
mkl_fft                   1.3.1            py38hd3c417c_0  
mkl_random                1.2.2            py38h51133e4_0  
munkres                   1.1.4                      py_0  
nbclient                  0.6.6              pyhd8ed1ab_0    conda-forge
nbconvert                 6.5.0              pyhd8ed1ab_0    conda-forge
nbconvert-core            6.5.0              pyhd8ed1ab_0    conda-forge
nbconvert-pandoc          6.5.0              pyhd8ed1ab_0    conda-forge
nbformat                  5.4.0              pyhd8ed1ab_0    conda-forge
ncurses                   6.3                  h5eee18b_3  
nest-asyncio              1.5.5              pyhd8ed1ab_0    conda-forge
nltk                      3.7                pyhd3eb1b0_0  
notebook                  6.4.12             pyha770c72_0    conda-forge
numexpr                   2.8.3            py38h807cd23_0  
numpy                     1.22.3           py38he7a7128_0  
numpy-base                1.22.3           py38hf524024_0  
openpyxl                  3.0.10           py38h5eee18b_0  
openssl                   1.1.1q               h7f8727e_0  
packaging                 21.3               pyhd8ed1ab_0    conda-forge
pandas                    1.4.3            py38h6a678d5_0  
pandoc                    2.18                 ha770c72_0    conda-forge
pandocfilters             1.5.0              pyhd8ed1ab_0    conda-forge
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
pcre                      8.45                 h295c915_0  
persistent                4.9.0                    pypi_0    pypi
pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    9.2.0            py38hace64e9_1  
pip                       22.1.2           py38h06a4308_0  
prometheus_client         0.14.1             pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.30             pyha770c72_0    conda-forge
psutil                    5.9.1            py38h0a891b7_0    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pygments                  2.12.0             pyhd8ed1ab_0    conda-forge
pyhacrf-datamade          0.2.6                    pypi_0    pypi
pylbfgs                   0.2.0.14                 pypi_0    pypi
pyopenssl                 22.0.0             pyhd3eb1b0_0  
pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
pyqt                      5.9.2            py38h05f1152_4  
pyrsistent                0.18.1           py38h0a891b7_1    conda-forge
pysocks                   1.7.1            py38h06a4308_0  
python                    3.8.13               h12debd9_0  
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-fastjsonschema     2.15.3             pyhd8ed1ab_0    conda-forge
python_abi                3.8                      2_cp38    conda-forge
pytz                      2022.1           py38h06a4308_0  
pyzmq                     19.0.2           py38ha71036d_2    conda-forge
qt                        5.9.7                h5867ecd_1  
readline                  8.1.2                h7f8727e_1  
regex                     2022.3.15        py38h7f8727e_0  
requests                  2.28.1           py38h06a4308_0  
rlr                       2.4.6                    pypi_0    pypi
scikit-learn              1.1.2                    pypi_0    pypi
scipy                     1.7.3            py38hc147768_0  
send2trash                1.8.0              pyhd8ed1ab_0    conda-forge
setuptools                61.2.0           py38h06a4308_0  
simplecosine              1.2                      pypi_0    pypi
sip                       4.19.13          py38h295c915_0  
six                       1.16.0             pyh6c4a22f_0    conda-forge
soupsieve                 2.3.1              pyhd8ed1ab_0    conda-forge
sqlite                    3.38.5               hc218d9a_0  
stack_data                0.3.0              pyhd8ed1ab_0    conda-forge
terminado                 0.15.0           py38h578d9bd_0    conda-forge
threadpoolctl             3.1.0                    pypi_0    pypi
tinycss2                  1.1.1              pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h1ccaba5_0  
tornado                   6.1              py38h27cfd23_0  
tqdm                      4.64.0           py38h06a4308_0  
traitlets                 5.3.0              pyhd8ed1ab_0    conda-forge
typing-extensions         4.3.0                    pypi_0    pypi
urllib3                   1.26.9           py38h06a4308_0  
wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
webencodings              0.5.1                      py_1    conda-forge
wheel                     0.37.1             pyhd3eb1b0_0  
xz                        5.2.5                h7f8727e_1  
zeromq                    4.3.4                h9c3ff4c_1    conda-forge
zipp                      3.8.0              pyhd8ed1ab_0    conda-forge
zlib                      1.2.12               h7f8727e_2  
zope-index                5.2.0                    pypi_0    pypi
zope-interface            5.4.0                    pypi_0    pypi
zstd                      1.5.2                ha4553b6_0  

Other observations

  • One hypothesis is that blocking works differently. The behavior is similar when using blocked_proportion = .66 and blocked_proportion = 0.95 in linker.prepare_training().
  • I could not find a way to make a reproducible example, but experimented with the dedupe-examples. I cannot reproduce this behavior in the record linkage and patent example from there. There are small difference between the environments, but:
    • I can reproduce the behavior as described above in my data with the environment settings from the dedupe examples, with pandas and nltk dependencies added.
    • I cannot reproduce the behavior in the example patent data when using the projet settings from above.
  • I also noted that the documentation shows version 2.0.11, while PyPI has version 2.0.17. But could be a coincidence.

@fgregg
Copy link
Contributor

fgregg commented Aug 11, 2022

there's been a number of changes that could impact the active labeling.

if you could isolate this to a specific release that would be helpful.

if you could provide some example data where the current code seems to be performing worse, that would also be very helpful

@tigerang22
Copy link
Author

@fgregg thanks for the response. Based on previous comments from @f-hafner, I switched from 2.0.17 back to version 2.0.11, with my own fix to the KeyError issue, the training seems to be well balanced between distincts and matches now. So the issue with not enough matches must have come from 2.0.12 or later. I had this issue when testing with 2.0.17 consistently for three different datasets. At the moment, I can't share the data unfortunately because of PII in there. Interested if anyone has had this issue with any public dataset?

BTW, for those interested, here is my quick fix in core.py in version 2.0.11 for the KeyError when encountered with size of the dataset between ~66,000 and ~92,000:

image

@fgregg
Copy link
Contributor

fgregg commented Aug 11, 2022

could you narrow it down to a specific version between 2.0.11 and 2.0.17

@tigerang22
Copy link
Author

Let me do some testing and will let you know...

@f-hafner
Copy link

I should be able to share a sample of my dataset where the issue occurs; I'll let you know.

@tigerang22
Copy link
Author

@fgregg It looks that version 2.0.14 introduced the issue, while 2.0.13 is still ok despite the KeyError problem.

@f-hafner can you please try your link use case using 2.0.13 and 2.0.14?

@fgregg
Copy link
Contributor

fgregg commented Aug 15, 2022

thank you very much!

@f-hafner
Copy link

@tigerang22 , I think I can confirm this. With 2.0.14, I stopped at 100 negative, 1 positive. With 2.0.13, I stopped at 22 negative, 18 positive.
I'll prepare the data extract now.

@f-hafner
Copy link

Here is the repo with data and scripts: https://github.com/f-hafner/dedupe_training_example
I hope it works; let me know if I need to fix something.

@tigerang22
Copy link
Author

@fgregg any insight on the issue and when a future release would have the fix? Thanks

@fgregg
Copy link
Contributor

fgregg commented Sep 21, 2022

i believe i have addressed this on main @f-hafner and @tigerang22. can you confirm that it works for your cases?

@f-hafner thank you for the example code, that was very helpful

@tigerang22
Copy link
Author

@fgregg great! I will give it a try shortly.

@tigerang22
Copy link
Author

tigerang22 commented Sep 22, 2022

@fgregg I encountered a KeyError related to the datetime field type and it turns out that your commit yesterday doesn't have variables/date_time.py anymore. Are we expected to add that as a custom type now? Please advise.

@fgregg
Copy link
Contributor

fgregg commented Sep 22, 2022

ugh! this is probably related to #1085

@tigerang22
Copy link
Author

@fgregg I solved the datetime type issue by resetting my virtual environment. Just completed test of commit aa2b04e against my previous dataset. The same problem unfortunately still exists for me, 1 match and close to 100 distinct pairs before I stopped the testing.

@f-hafner have you had luck with your scenarios?

@fgregg
Copy link
Contributor

fgregg commented Sep 23, 2022

@tigerang22. that’s unfortunate! i
used @f-hafner’s example
code to debug. can you provide a reproducible example?

@f-hafner
Copy link

I haven't tried it out yet, but I will let you know when I have

@f-hafner
Copy link

Hi @fgregg , @tigerang22 I tried using the github version of dedupe (also on my sample data). It still gave almost only negatives. But I am not sure I got the right version.

I installed dedupe as follows:

conda install git pip 
python -m pip install "dedupe @ git+https://github.com/dedupeio/dedupe@522e7b2147d61fa36d6dee6288df57aee95c4bcc"

But then conda list |grep dedupe shows dedupe 2.0.18 pypi_0 pypi.

Details here: https://github.com/f-hafner/dedupe_training_example

What is the correct way to install the github version?

@fgregg
Copy link
Contributor

fgregg commented Sep 29, 2022

@f-hafner looks like you installed it okay. it's a bit simpler to do it like this

pip install https://github.com/dedupeio/dedupe/archive/522e7b2147d61fa36d6dee6288df57aee95c4bcc.zip

that's very strange that the performance didn't get better for you. using your test repo, it seemed to be working very well for me. hmmm....

@johnmarkpittman
Copy link

johnmarkpittman commented Jan 9, 2023

@fgregg is there any chance that this issue is related to the Dedupe and RecordLink DisagreementLearners if you don't already have a training file?

In these situations, it seems like a randomly chosen record is used to kickoff the learning process and identify pairs of records for you to label. Is it possible that this randomly chosen record just isn't very helpful for learning initial blocking rules and setting up the active learning session?

Also, since some initial blocking is occurring, I wonder if with random_forest_candidates intermittently failing to find predicates (as I recently mentioned in Issue #940) could be playing into this as well.

fgregg added a commit that referenced this issue Feb 17, 2023
@fgregg
Copy link
Contributor

fgregg commented Feb 17, 2023

i think i have a fix for this in 2.0.23

@tigerang22
Copy link
Author

@fgregg Great! I will give 2.0.23 a shot.

@tigerang22
Copy link
Author

@fgregg I have just tested 2.0.23 and unfortunately the same issue exists. Are there any fine tuning options that might be affecting this, such as calling deduper.prepare_training(temp_d) with dynamic sample_size and blocked_proportion instead of using the default values? I did notice that the previous version such as 2.0.13 would take 4-5 mins but now with 2.0.23 it is taking more than 10 mins to finish the prepare_training call.

@fgregg
Copy link
Contributor

fgregg commented Mar 21, 2023

@tigerang22 can you check to see if the example that @f-hafner poster also doesn't work for you. (it does for me now).

@boccheciampe
Copy link

boccheciampe commented Jul 25, 2023

Hello,

I actually struggle with the same problem, version 2.0.23, I've tried to go a little further and stopped at 10 positives and 2000 negatives.

My script is based on the pgsql_big_dedupe_example (hope it's up to date :), adapted to use Django's 3.2 ORM, as I plan to build an identity manager with Dedupe.

My variables are very similar to @f-hafner, I use distinct birth, last, first and middle names (all 'Strings'), a few others (birth date, place, country, ...), and interactions to boost the scores :

dedupe_fields = [
{'field': 'u_birth_name', 'variable name': 'birth_name', 'type': 'String'},
{'field': 'u_last_name', 'variable name': 'last_name', 'type': 'String', 'has missing': True},
{'field': 'u_first_name', 'variable name': 'first_name', 'type': 'String'},
{'field': 'u_second_name', 'variable name': 'second_name', 'type': 'String', 'has missing': True},
{'field': 'text_birth_date', 'variable name': 'birth_date', 'type': 'ShortString', 'has missing': True},
{'field': 'u_birth_place', 'variable name': 'birth_place', 'type': 'ShortString', 'has missing': True},
{'field': 'birth_department_code', 'variable name': 'birth_department_code', 'type': 'Exact', 'has missing': True},
{'field': 'birth_country_code', 'variable name': 'birth_country_code', 'type': 'Exact', 'has missing': True},
{'field': 'nationality_code', 'variable name': 'nationality_code', 'type': 'Exact', 'has missing': True},
{'type': 'Interaction',
'interaction variables': ['birth_name', 'first_name', 'birth_date']
},
]

It looks like only one field is eventually used as a predicate (in my case, the logger shows it's the birth date, defined either as DateTime or String), and of course it's not enough to efficiently dedupe my 315k entries. Some entities end up with members with only the birth date as common data.

Back to 2.0.13, with the same variables definition, I stopped at 47/10 positive, 1000/10 negative, and the following predicates :
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(TfidfNGramCanopyPredicate: (0.6, u_last_name), LevenshteinCanopyPredicate: (1, u_birth_name))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, u_birth_name), SimplePredicate: (hundredIntegersOddPredicate, text_birth_date))
INFO:dedupe.training:(SimplePredicate: (firstTwoTokensPredicate, u_first_name), SimplePredicate: (tokenFieldPredicate, u_birth_name))

With this, I end up with ~3000 entities (out of ~28000 I'm supposed to find).

@fgregg since it looks like it works for you, could it come from something I obviously missed in the variables definition ? Is the training engine more efficient with split birth/last/first/... names into multiple variable or to keep these in a single string ? (the question may be valid for other variables too).

Or, since I have quite a lot of entries, does the training simply need a lot more samples, both with 2.0.13 and 2.0.23 ... ?

Thanks for your answers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants