Generating Pairs #150

thbeh · 2020-12-06T20:54:23Z

Hi, Not a issue per se but just needed to understand the process.

I have a previously some codes for dedup using RL. That process have about 1million rows and 6 columns, mainly people attribute, e.g. FirstName, LastName, etc. With a Block ('last'), the generated pairs was around 185,224,110.

I am using the same code but with about 110K rows with the same set of schema, the generated pairs was 703,701,657. Anyone could help explain the criteria how the pairs are generated and why the huge jump even the rows are less.

Thanks in advance. Cheers

utah-vabrandon · 2020-12-07T15:51:46Z

The default blocking behavior is a union of all possible matches for each indexer. If you are only blocking left/right on last_name, there is a chance that many of the rows have the same last name. Is this a dedup process? You only mention one dataset, opposed to two.

…

On Sun, Dec 6, 2020 at 1:54 PM T H Beh ***@***.***> wrote: Hi, Not a issue per se but just needed to understand the process. I have a previously some codes for dedup using RL. That process have about 1million rows and 6 columns, mainly people attribute, e.g. FirstName, LastName, etc. With a Block ('last'), the generated pairs was around 185,224,110. I am using the same code but with about 110K rows with the same set of schema, the generated pairs was 703,701,657. Anyone could help explain the criteria how the pairs are generated and why the huge jump even the rows are less. Thanks in advance. Cheers — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#150>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AN6QRTDBP6O6ABMW56JII5LSTPVQXANCNFSM4UPS45DQ> .

-- Vincent Brandon Data Coordinator Utah Data Research Center 140 East 300 South | Salt Lake City, UT 84111 (801) 526-9705 vbrandon@utah.gov

thbeh · 2020-12-07T20:10:31Z

Yes, this is a dedup process that I am testing on one dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating Pairs #150

Generating Pairs #150

thbeh commented Dec 6, 2020

utah-vabrandon commented Dec 7, 2020 via email

thbeh commented Dec 7, 2020

Generating Pairs #150

Generating Pairs #150

Comments

thbeh commented Dec 6, 2020

utah-vabrandon commented Dec 7, 2020 via email

thbeh commented Dec 7, 2020