Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results from link phase change between runs #601

Open
spentelow opened this issue Jun 2, 2023 · 5 comments
Open

Results from link phase change between runs #601

spentelow opened this issue Jun 2, 2023 · 5 comments
Assignees
Milestone

Comments

@spentelow
Copy link

Describe the bug
The output data produced by the link phase change each time the model is run.

To Reproduce
Steps to reproduce the behavior:

  1. Follow quick start instructions:
docker pull zingg/zingg:0.3.4
docker run -it zingg/zingg:0.3.4 bash
  1. Change matchType in examples/febrl/configLink.json from 'exact' to 'fuzzy' (resolves issue with this example in version 0.3.4 realted to Issue 427)

  2. Run the 'febrl' model in link mode

./scripts/zingg.sh --phase link --conf examples/febrl/configLink.json
  1. Examine output files (/tmp/zinggOutput)
  2. Re-run steps 3. and 4. (without making changes to configuration or input files) and observe different results . Results differ in the number of output rows, the subset of input datasets included in the output, and the z_score values.

Expected behavior
My expectation is that sequential runs without config or input file changes would produce identical results (except, perhaps, in z_cluster labels).

@sonalgoyal
Copy link
Member

thanks for reporting @spentelow. Will look into this.

Copy link
Member

@vikasgupta

@vikasgupta78
Copy link
Collaborator

I have started looking into this and I am able to reproduce the issue, will check it further.

Ran link phase twice and got 79 results first time and 65 2nd time.

Steps followed:

./scripts/zingg.sh --phase findTrainingData --conf examples/febrl/config.json --zinggDir /tmp/z_601
./scripts/zingg.sh --phase label --conf examples/febrl/config.json --zinggDir /tmp/z_601
./scripts/zingg.sh --phase trainMatch --conf examples/febrl/config.json --zinggDir /tmp/z_601
mv /tmp/zinggOutput /tmp/zinggOutput-match
./scripts/zingg.sh --phase link --conf examples/febrl/configLink.json --zinggDir /tmp/z_601
mv /tmp/zinggOutput /tmp/zinggOutput-link1
cd /tmp/zinggOutput-link1
cat part*.csv > part-combined1.csv => 79 records
./scripts/zingg.sh --phase link --conf examples/febrl/configLink.json --zinggDir /tmp/z_601
mv /tmp/zinggOutput /tmp/zinggOutput-link2
cd /tmp/zinggOutput-link2
cat part*.csv > part-combined2.csv => 65 records

@vikasgupta78
Copy link
Collaborator

Pull request #617 raised

sonalgoyal added a commit that referenced this issue Jun 22, 2023
Issue #601 linker has inconsistent results
@vikasgupta78
Copy link
Collaborator

keeping the issue open as putting a count is more of a workaround to trigger the explain plan but we need to find out the root cause of why the joins behave in this way if we don't have any action till the final join.

@sonalgoyal sonalgoyal added this to the 0.4 milestone Nov 28, 2023
@sonalgoyal sonalgoyal self-assigned this Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants