Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check out "Winnowing algorithm" by Dr S Baltes #186

Open
Bhargav-Rao opened this issue Mar 2, 2019 · 6 comments
Open

Check out "Winnowing algorithm" by Dr S Baltes #186

Bhargav-Rao opened this issue Mar 2, 2019 · 6 comments
Assignees
Milestone

Comments

@Bhargav-Rao
Copy link
Member

In issue #28, I also pointed to our implementation of the Winnowing algorithm, which serves a similar purpose. We already evaluated how suitable it is for comparing Stack Overflow posts.

Originally posted by @sbaltes in #160 (comment)

@Bhargav-Rao
Copy link
Member Author

And here #28 (comment)

In a recent research paper, we evaluated different string similarity metrics to match Stack Overflow code and text blocks to their predecessors. Those results could also be interesting for your projects, because our implementations of the metrics are available on GitHub. You could, for example, use the Winnowing algorithm to compare code blocks.

@Bhargav-Rao
Copy link
Member Author

screen shot 2019-03-02 at 1 28 20 pm

Actual score of 0.32, looks quite good for a start.

@FelixSFD FelixSFD added this to the 1.3 milestone Mar 3, 2019
@sbaltes
Copy link

sbaltes commented Mar 3, 2019

Feel free to modify the similarity threshold according to your needs. In our evaluation, we used a combination of metrics and heuristics to match code blocks, so the results may not be completely transferable to your use case. Besides the Winnowing algorithm (which I by the way just implemented, not invented :-) ), you may check out simpler metrics such as tokenDiceNormalized which scored best in a later evaluation.

@Bhargav-Rao
Copy link
Member Author

Thanks, @sbaltes, we'll check those out as well. I can somewhat see that the algorithm is trying its best to match within the code blocks. Perhaps we can use it for the code only checks. It certainly is looking very promising, at this stage.

(We're testing it out in the Workshop. Let us know if you want to check it out, we will provide you with write access to the room)

@sbaltes
Copy link

sbaltes commented Mar 4, 2019

Sure, if there is anything I can help you with, just let me know.

@sbaltes
Copy link

sbaltes commented Mar 4, 2019

BTW: There is a related discussion on Stack Overflow Meta: https://meta.stackoverflow.com/questions/375761/how-to-handle-code-clones-on-stack-overflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants