Skip to content

tekboart/Bigdata-duplicate_record_recognition

Repository files navigation


               

Recognition of duplicate records using BigData Analysis through Apache Spark

Python PySpark Pandas Sklearn Matplotlib seaborn

The BigData ML algorithms used:

  • Classification
    1. Logistic Regression
    2. Linear SVM
    3. Random Forest

Requirements

Python Pandas

Project Dir Structure

.
├── data
├── images
│   └── logos
├── logs
├── outputs
├── reports
└── utils

7 directories

Data

The data was obtained from here and was first used and introduced by Schmidtmann et al. [1].

Dataset info:

  • features
    1. id_1: internal identifier of first record.
    2. id_2: internal identifier of second record.
    3. cmp_fname_c1: agreement of first name, first component
    4. cmp_fname_c2: agreement of first name, second component
    5. cmp_lname_c1: agreement of family name, first component
    6. cmp_lname_c2: agreement of family name, second component
    7. cmp_sex: agreement sex
    8. cmp_bd: agreement of date of birth, day component
    9. cmp_bm: agreement of date of birth, month component
    10. cmp_by: agreement of date of birth, year component
    11. cmp_plz: agreement of postal code
    12. is_match: matching status (TRUE for matches, FALSE for non-matches)

Contact

Should you have any questions, feel free to contact TekBoArt @tekboart.

Reference

[1] Irene Schmidtmann, Gael Hammer, Murat Sariyar, Aslihan Gerhold-Ay: Evaluation des Krebsregisters NRW Schwerpunkt Record Linkage. Technical Report, IMBEI 2009.

License

Shield: CC BY-NC-SA 4.0

  • Refer to the file LICENSE for more information regarding the license of this repository.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

About

Finding duplicate records using Record Linkage Comparison and BigData through Apache Spark

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published