Skip to content

essraahmed/Data-Deduplication-using-AWS-Lake-Formation-FindMatches

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 

Repository files navigation

Data-Deduplication-using-AWS-Lake-Formation-FindMatches

Introduction

AWS Lake Formation FindMatches is a new machine learning (ML) transform that enables you to match records across different datasets as well as identify and remove duplicate records, with little to no human intervention. FindMatches is part of Lake Formation, a new AWS service that helps you build a secure data lake in a few simple steps. FindMatches helps you with the following:

  • Match customers: Link and integrate customer records across different datasets, even where fields do not match exactly (for example, due to different name spellings, address differences, and missing or inaccurate data).
  • Match products: Match products across different vendor catalogs and SKUs. You can do this even when records do not share a common structure.
  • Prevent fraud: Identify potentially fraudulent accounts compared to previously known bad actors.
  • Match other data: Match addresses, movies, parts lists, etc. In general, if a human being could look at your database rows and determine that they were a match, there’s a good chance that FindMatches can help you.

Project Description

Input:

  • Neat Dataset1 with some records (~2600).
  • Dataset2 with some records (~64,000). Messy, duplicated.
  • Matching records (labels) (~350).

Solution:
Combined the two datasets into a single file with a new column that indicates the source of each record, and provided a label file (the “perfect mapping”) in a format compatible with FindMatches.
The matching process includes the following steps:

  • Catalog the data with the AWS Glue Data Catalog.
  • Create a new FindMatches ML transform for the data.
  • Teach FindMatches by providing labeling examples of matching and non-matching records.
  • Review match quality metrics and upload more labels if match quality is not yet sufficient.
  • Create an AWS Glue ETL job that uses your FindMatches transform.
  • Review the output.

AWS Services Used

AWS Services Description
Amazon S3 (Simple Storage Service) used to store and protect any amount of data.
AWS Glue ETL service to categorize data for cleaning, enriching and transforming between data stores.

Author

Esraa Ahmed | esraa-ahmed-ibrahim2

About

Data Deduplication using AWS Lake Formation FindMatches

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published