-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating Pairs #150
Comments
The default blocking behavior is a union of all possible matches for each
indexer. If you are only blocking left/right on last_name, there is a
chance that many of the rows have the same last name. Is this a dedup
process? You only mention one dataset, opposed to two.
…On Sun, Dec 6, 2020 at 1:54 PM T H Beh ***@***.***> wrote:
Hi, Not a issue per se but just needed to understand the process.
I have a previously some codes for dedup using RL. That process have about
1million rows and 6 columns, mainly people attribute, e.g. FirstName,
LastName, etc. With a Block ('last'), the generated pairs was around
185,224,110.
I am using the same code but with about 110K rows with the same set of
schema, the generated pairs was 703,701,657. Anyone could help explain the
criteria how the pairs are generated and why the huge jump even the rows
are less.
Thanks in advance. Cheers
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#150>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AN6QRTDBP6O6ABMW56JII5LSTPVQXANCNFSM4UPS45DQ>
.
--
Vincent Brandon
Data Coordinator
Utah Data Research Center
140 East 300 South | Salt Lake City, UT 84111
(801) 526-9705
vbrandon@utah.gov
|
Yes, this is a dedup process that I am testing on one dataset. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, Not a issue per se but just needed to understand the process.
I have a previously some codes for dedup using RL. That process have about 1million rows and 6 columns, mainly people attribute, e.g. FirstName, LastName, etc. With a Block ('last'), the generated pairs was around 185,224,110.
I am using the same code but with about 110K rows with the same set of schema, the generated pairs was 703,701,657. Anyone could help explain the criteria how the pairs are generated and why the huge jump even the rows are less.
Thanks in advance. Cheers
The text was updated successfully, but these errors were encountered: