Our Agents: Briefly Explained

AIM

The main aim of the scanning algorithms (called as agents) is to detect which license(s) are there in the text file. We detect the license names and gives the matching score by which we can tell how much resemblance the license inside the file has with the original license text.

Common major steps:

Extract commented part(the license text) out of the text file.
Applying the algorithms of the agents to match with the original license text
Giving the obtained result (license name(s) with matching score)

Example of Header Texts

When matching licenses, more often file headers are about to be matched, not the full license texts. As such, also recognition must work. In the following, a few examples are listed:

Apache 2.0 License

Apache-2.0

Copyright [yyyy] [name of copyright owner]

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

GPL-3.0+

Copyright (C) <year>  <name of author>

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>

LGPL-2.1+

Copyright (C) year  name of author

This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

Our Agents

1. DLD (Damerau–Levenshtein distance)

Returns the edit distance between two words or sequences after transposing one onto another by doing three operations mainly:

Insertion
Deletion
Substitution

Want to know more about edit distance? Watch Professor Dan Jurafsky from Stanford University explaining “Minimum Edit Distance”
Read Fred J. Damerau’s paper “A technique for computer detection and correction of spelling errors” for better in-depth knowledge about the concept.
Usage in Atarashi: https://github.com/fossology/atarashi/blob/master/atarashi/agents/dameruLevenDist.py#L33

2. Tfidf (term frequency-inverse document frequency)

It basically helps us in evaluation how relevant or important a word in the document or corpus by assigning weights to them.

It determines the relative frequency of words in a specific document through an inverse proportion of the word over the entire document corpus.

Tf(term frequency) counts the number of times a word occurs in a document. which measures how frequently a term occurs in a document.

tf(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

Now, idf emphasizes on the rareness of word in the document by devaluing the common words or simply stop words like the, is, are, etc.

idf(t) = log_e(Total number of documents / Number of documents with term t in it)

combining these two part of the TF-IDF the higher the score, the more relevant that word is in that particular document.

tf-idf(t,D) = tf(t, D) * idf(t,D)

Where, t → term, D → Document

tf-idf is a classic old information retrieval algorithm but it is used extensively till date due to its simplicity and effectiveness.

Read more about tf-idf: http://www.tfidf.com/
Usage in Atarashi: https://github.com/fossology/atarashi/blob/master/atarashi/agents/tfidf.py#L41

3. wordFrequencySimilarity

As the name suggests, it calculates the frequency of each unique word in the file.

It then compares the same with the original license text to match with the relevant word and returns the score. The license text having the highest score is returned as the result.

This algorithm seems trivial but gives us a good result because in license text we don’t usually see a lot of the variations. The sentences are straightforward and they are used the same as it is.

Usage in Atarashi: https://github.com/fossology/atarashi/blob/master/atarashi/agents/wordFrequencySimilarity.py#L32

4. N-Gram

N-gram, as the name suggests, is a contiguous sequence of N terms. Taking n as 2 and 3 results in giving us bi-gram and tri-gram respectively.

It is adapted from the Markov Chain model.

Read about Markov Chain: https://en.wikipedia.org/wiki/Markov_chain

To find items similar to a query string, it splits the query into N-grams and ranks the items by a score based on the ratio of shared to unshared N-grams between strings. We match the unique sets of n-grams from the input text with the original license text. The matching gives us the score and the license matched with highest n-gram score is returned.

To understand N-gram in reference to the probability please watch Prof. Daniel Jurafsky from Stanford University explanation of the maths behind it: Estimating N-gram Probabilities — [ NLP || Dan Jurafsky || Stanford University ]

To understand how similarity is calculated, please read N-gram similarity and distance