Recognition of duplicate records using BigData Analysis through Apache Spark

The BigData ML algorithms used:

Classification
1. Logistic Regression
2. Linear SVM
3. Random Forest

Requirements

Project Dir Structure

.
├── data
├── images
│   └── logos
├── logs
├── outputs
├── reports
└── utils

7 directories

Data

The data was obtained from here and was first used and introduced by Schmidtmann et al. [1].

Dataset info:

features
1. id_1: internal identifier of first record.
2. id_2: internal identifier of second record.
3. cmp_fname_c1: agreement of first name, first component
4. cmp_fname_c2: agreement of first name, second component
5. cmp_lname_c1: agreement of family name, first component
6. cmp_lname_c2: agreement of family name, second component
7. cmp_sex: agreement sex
8. cmp_bd: agreement of date of birth, day component
9. cmp_bm: agreement of date of birth, month component
10. cmp_by: agreement of date of birth, year component
11. cmp_plz: agreement of postal code
12. is_match: matching status (TRUE for matches, FALSE for non-matches)

Contact

Should you have any questions, feel free to contact TekBoArt @tekboart.

Reference

[1] Irene Schmidtmann, Gael Hammer, Murat Sariyar, Aslihan Gerhold-Ay: Evaluation des Krebsregisters NRW Schwerpunkt Record Linkage. Technical Report, IMBEI 2009.

License

Shield:

Refer to the file LICENSE for more information regarding the license of this repository.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
images		images
logs		logs
outputs		outputs
reports		reports
utils		utils
.gitignore		.gitignore
1. EDA.ipynb		1. EDA.ipynb
2. preprocessing.ipynb		2. preprocessing.ipynb
3. logistic_regression.ipynb		3. logistic_regression.ipynb
4. linear_SVM.ipynb		4. linear_SVM.ipynb
5. Random Forest.ipynb		5. Random Forest.ipynb
LICENSE.md		LICENSE.md
README.md		README.md

License

tekboart/Bigdata-duplicate_record_recognition

Folders and files

Latest commit

History

Repository files navigation

Recognition of duplicate records using BigData Analysis through Apache Spark

Requirements

Project Dir Structure

Data

Contact

Reference

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages