Problem with identifying the short license text #9

amanjain97 · 2018-06-07T08:29:09Z

Generally, license contained in the source code file is either is short license itself or a block of large license which becomes difficult for the information retrieval algorithms and similarity finding algorithms to classify efficiently.

Please suggest how this should be resolved before implementing other IR (Information retrieval) algorithms.

ag4ums · 2018-06-07T10:58:38Z

From the Discussion : let us have a working code with the large block of license,
then we can work on to fine tune the algorithm, or work around.

amanjain97 · 2018-06-07T17:00:50Z

Please check with ca157e9

mcjaeger · 2018-07-19T08:29:38Z

It looks like the bigram cosine similarity returns a high number of bit torrent results.
Given the SPDX test files, BitTorrent-1.{0|1} are repetitively high. For example, when seeing the 0BSD text, the BigramCosideSimilarity is returning BitTorrent-1.0 with highest score.

Rough idea of this is because the BitTorrent license texts are super long and cover a lot of different areas. Then, there is a high number of bigrams that match many licenses. The computation of the score already takes into account the number of bigrams matching between the reference text and the scanned test, however, maybe an additional weight to temp value when computing could be an approach to start texts with.

amanjain97 added the help wanted Extra attention is needed label Jun 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with identifying the short license text #9

Problem with identifying the short license text #9

amanjain97 commented Jun 7, 2018

ag4ums commented Jun 7, 2018

amanjain97 commented Jun 7, 2018

mcjaeger commented Jul 19, 2018

Problem with identifying the short license text #9

Problem with identifying the short license text #9

Comments

amanjain97 commented Jun 7, 2018

ag4ums commented Jun 7, 2018

amanjain97 commented Jun 7, 2018

mcjaeger commented Jul 19, 2018