Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate results #59

Open
pepa65 opened this issue Feb 25, 2023 · 8 comments
Open

Duplicate results #59

pepa65 opened this issue Feb 25, 2023 · 8 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@pepa65
Copy link

pepa65 commented Feb 25, 2023

Not sure if this is to be expected somehow, but I noticed nothing in the Issues about this: I'm getting lots of duplicate results displayed, the same lines with the same values. When exported to csv, it is the same.

For example, the file useful-forks-https\ __github.com_zyedidia_micro_.csv has 104 lines (including the header), while with duplicate lines removed, it only has 59.

@payne911
Copy link
Collaborator

payne911 commented Feb 26, 2023

Good catch. For posterity, here's the result CSV content of the repo this user is talking about: https://justpaste.it/8evui

Worth noting: the CSV code is not responsible. The result table itself contains duplicates.
Subset example on the unfinished scan (this particular entry actually appears 3 times in the full table):
image

And then if I look at the Network tab in the Chrome Dev Tools, we see that it seems like GitHub's API is returning duplicates:
image

The simplest fix to this is to keep a HashSet of all entries added in the table, and to only append new rows if the repo isn't in the HashSet.
Interestingly, this filter could happen earlier, thus potentially saving a few network calls (i.e. avoid even requesting for the fork's information if it already appears in the table).
If someone feels like implementing that, I'll gladly review the PR.

I'll label this issue as a bug, despite the bug not being in our codebase.

@payne911 payne911 added the bug Something isn't working label Feb 26, 2023
@Ethkuil
Copy link
Contributor

Ethkuil commented Mar 1, 2023

The simplest fix to this is to keep a HashSet of all entries added in the table, and to only append new rows if the repo isn't in the HashSet.

The even better fix is to change TABLE_DATA into a set instead of an array.

@payne911
Copy link
Collaborator

payne911 commented Mar 2, 2023

to change TABLE_DATA into a set instead of an array.

Theoretically, a set has no ordering. Thus, the operation of sorting would make no sense against that data structure.

Also, it seems like JavaScript has no built-in set (and thus I regret saying "The simplest fix to this is to keep a HashSet"). We'd have to make one, and at that point it might become not worth it: the extra code increases complexity and reduces maintainability, while not providing much performance gains because we are talking about relatively small tables.

The simplest approach would be to simply scan the entire table array to check for a duplicate: that's an acceptable O(n).

@payne911
Copy link
Collaborator

Update: I have the fix ready. I'm simply waiting on #58 to get merged in as it'll build on it.

@payne911
Copy link
Collaborator

payne911 commented Mar 25, 2023

Odd. This issue is still happening, but it's not consistent.
image

There's most probably a race-condition happening. Could be in the Octokit library, or straight in my code. I'll have to try to manually bisect.

Or maybe my JavaScript knowledge is too narrow for me to even find the source of this bug. ¯\_(ツ)_/¯

@payne911 payne911 reopened this Mar 25, 2023
@payne911
Copy link
Collaborator

229d0ce

This was what I thought would fix it.

@andy0130tw
Copy link

@payne911
Copy link
Collaborator

Yes, although technically given I have done an implementation with a list which should have solved this issue, it is clear that the problem is not what I thought it was.

There's most probably a race-condition happening. Could be in the Octokit library, or straight in my code

Feel free to investigate as I do not have much free time these days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants