Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating Pairs #150

Open
thbeh opened this issue Dec 6, 2020 · 2 comments
Open

Generating Pairs #150

thbeh opened this issue Dec 6, 2020 · 2 comments

Comments

@thbeh
Copy link

thbeh commented Dec 6, 2020

Hi, Not a issue per se but just needed to understand the process.

I have a previously some codes for dedup using RL. That process have about 1million rows and 6 columns, mainly people attribute, e.g. FirstName, LastName, etc. With a Block ('last'), the generated pairs was around 185,224,110.

I am using the same code but with about 110K rows with the same set of schema, the generated pairs was 703,701,657. Anyone could help explain the criteria how the pairs are generated and why the huge jump even the rows are less.

Thanks in advance. Cheers

@utah-vabrandon
Copy link

utah-vabrandon commented Dec 7, 2020 via email

@thbeh
Copy link
Author

thbeh commented Dec 7, 2020

Yes, this is a dedup process that I am testing on one dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants