Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with identifying the short license text #9

Open
amanjain97 opened this issue Jun 7, 2018 · 3 comments
Open

Problem with identifying the short license text #9

amanjain97 opened this issue Jun 7, 2018 · 3 comments
Labels
help wanted Extra attention is needed

Comments

@amanjain97
Copy link
Collaborator

Generally, license contained in the source code file is either is short license itself or a block of large license which becomes difficult for the information retrieval algorithms and similarity finding algorithms to classify efficiently.

Please suggest how this should be resolved before implementing other IR (Information retrieval) algorithms.

@amanjain97 amanjain97 added the help wanted Extra attention is needed label Jun 7, 2018
@ag4ums
Copy link
Collaborator

ag4ums commented Jun 7, 2018

From the Discussion : let us have a working code with the large block of license,
then we can work on to fine tune the algorithm, or work around.

@amanjain97
Copy link
Collaborator Author

Please check with ca157e9

@mcjaeger
Copy link
Member

It looks like the bigram cosine similarity returns a high number of bit torrent results.
Given the SPDX test files, BitTorrent-1.{0|1} are repetitively high. For example, when seeing the 0BSD text, the BigramCosideSimilarity is returning BitTorrent-1.0 with highest score.

Rough idea of this is because the BitTorrent license texts are super long and cover a lot of different areas. Then, there is a high number of bigrams that match many licenses. The computation of the score already takes into account the number of bigrams matching between the reference text and the scanned test, however, maybe an additional weight to temp value when computing could be an approach to start texts with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants