Skip to content

hastagAB/GSoC-2020

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Google Summer of Code 2020

A full report on my Google Summer of Code 2020 work with FOSSology

Project: "Accelerating Atarashi" πŸ‘¨β€πŸ’»

Logo

Google Summer of Code 2020 🚩 Report: "Accelerating Atarashi" πŸ‘¨β€πŸ’»

ViewCount GitHub Twitter GitHub Stars

Coder

CONTRIBUTIONS

1. Nirjas ~ নির্যাস

A Python library for Comments and Source Code Extraction

To scan for various open-source licenses inside of the file, one of the crucial parts is to extract the comments out of the code so that the base algorithm(agents) can detect the license. The license texts are in the comment section which makes it separate from the original codebase.

I and Kaushlendra worked on developing a fully dedicated Python library from scratch for these tasks and managed to publish the initial version at PyPI before the first evaluation.

Nirjas is live at PyPI and can be installed using pip install nirjas.

The major task was to classify different types of comments and to write separate logic for each one of them. The types are:

  1. Single line comments
  2. Multi-line comments
  3. Continuous single lines (continuous lines commented out using single-line syntax at each line)
  4. Inline comments (the comments that are written after the code on the same line)

The library can extract comments as well as code out of files from more than 20 popular programming languages. Along with that the library also serves you with all the required metadata about your Code, Comments and File(s). The library is available for public use and can be used in projects ranging from various domains.

Nirjas

Major Pull Requests

The complete list of Open and Closed PRs can be found at Nirjas/Pull requests

2. Integrating Nirjas with Atarashi

The next task was to replace existing code comment extractor with Nirjas in Atarashi. At this point, the existing code comment extractor was not working at all and was throwing an error whenever an agent was called. So we took the right decision to create our code comment extractor.

Nirjas supports almost all the major programming languages currently and will be continuously developed and maintained by FOSSology itself.

The integration is done in such a way that it will extract and pass only those comments which contain a license statement. The comments classification by Nirjas played a big role here followed by our customized list of tokens which helps us find the actual license comment out of all other comments. Earlier the comment extractor used to pass all the comments which made the input string little bit noisier to detect.

A small change was done in the Evaluator where the testing files were zipped and the existing code was improved.

Pull Request

3. Implementing Inverted Index with TF-IDF

The main idea was to create an Inverted Index for all the license texts and then use TF-IDF score to detect the licenses. This was supposed the decrease the detection time drastically and make agents faster.

Flowchart

The inverted index created is in the form:

{
    "keyword1": [
        [
            "doc1",
            TF-IDF Score
        ],
        [
            "doc2",
            TF-IDF Score
        ]
    ],
    "keyword2": [
        [
            "doc3",
            TF-IDF Score
        ],
        [
            "doc2",
            TF-IDF Score
        ],
        [
            "doc'n'",
            TF-IDF Score
        ]
    ]
  
}

Then for every input comment, we are extracting the keywords and comparing their TF-IDF Scores with the posting inside of Inverted Index file. The documents having the closest TF-IDF scores are ranked in order and the top result is returned as our detected license.

Although the algorithm succeeded in decreasing the scanning time from around 1200 secs to 260 secs (for 100 files) unfortunately we were not able to increase the accuracy. After applying various searching techniques, the maximum accuracy we got was 50% which is less than the original TF-IDF agent (i.e 59%).

result

According to me the two main factors that affect the performance of the algorithms(in terms of accuracy) are:

1. Irregularity in the size of license texts

The license texts are of different sizes and there is a huge difference in terms of keywords count which abrupt the postings of the keywords. Longer texts contain most of the unique keywords which mess up with the uniqueness of keywords in the smaller texts. Due to this, the longer license texts dominate the resulting output. For better results, the texts should be normalized which will give equal opportunity to all the license texts.

2. License texts are different than traditional text corpora

Usually in the traditional corpus, the documents are different to some extent that differentiate them with each other. But in license texts, most of the tokens are similar in the majority of the texts and there is a very slight difference in the use of these token to create a license statement. The keywords used in these license texts can be found in almost all of them with a slight variation which makes sense because they all the eventually open-source licenses talking about open source software and permissions. These similarities in-licenses make them tough to be differentiated by any traditional information retrieval algorithm.

Codebase

4. Creation of SPDX OSS license dataset

To implement any Machine learning/Deep learning algorithm we need a better and bigger dataset of SPDX Licences. But unfortunately, there exists no such dataset for open source licenses on the web.

To generate the dataset the base approach we used is to n-gram the paragraphs of license texts and to generate different permutations and combinations of them Suppose a license text has 5 paragraphs [1,2,3,4,5] in order. To create a dataset we include subsets like [1], [1,2], [1,2,3], [1,2,3,4], [1,2,3,4,5] for all combinations starting from 1,2,3,4 and 5. each one with the same label.

Using this technique we were able to generate more than 1 million files from 447 SPDX license files.

Now, as all the para are not equally important and most of these will create a lot of noise in the dataset. To resolve this, we'll be choosing para with high relevance and then will repeat the same process.

Few Updation that we need to do is:

  1. Shifting from txt files to SPDX JSON endpoint
  2. Differentiating License Header from Full Text
  3. Adding FOSSology Nomos agent STRINGS.in regex in dataset creation

Codebase

5. Documenting Nirjas & Atarashi

During the GSoC period, I got the time to create and organize documentation for both Atarashi and Nirjas. The documentation contains all the user and developer information of the project and is organized in a way to be easily accessible by all.

The Documentation can be found at:

πŸ‘¨πŸ»β€πŸ« DELIVERABLES

Tasks Planned Completed Remarks
Creating Nirjas Yes βœ”οΈ Beta version is live & the project will be developed & maintained continuously
Publish to PyPI Yes βœ”οΈ Nirjas is live and can be installed and used in projects
Integrate Nirjas with Atarashi Yes βœ”οΈ We can select specific license comment part from all comments
Implementing Inverted Index with TF-IDF Yes βœ”οΈ Desired accuracy can not be achieved with this algorithm
Creating SPDX OSS Dataset No βœ”οΈ dataset can be improved further and the development is continuously going on
Implementing BERT (OPTIONAL) Yes but was OPTIONAL ❌ can only be implemented after the dataset creation

πŸš€ FUTURE PLANS

  1. Implement complete regex in Nirjas covering most of the boundary cases.
  2. Improving the created SPDX OSS Dataset
  3. Continue developing Nirjas and Atarashi
  4. Maintaining Nirjas and Atarashi
  5. Searching for other methods to be implemented for license scanning

πŸ“š Things I learned from Google Summer of Code

  • Learned about various NLP techniques by studying, testing and implementing them
  • various Open-Source licenses and their Importance in codes, projects and software.
  • Learned to develop a complete library from scratch
  • Packaging of Python Projects and how it is maintained
  • Sharpened my skill of GIT
  • Various Information retrieval algorithms & traditional searching techniques
  • Learned to create a better and cleaner dataset.
  • Improved my knowledge of Data Science
  • Learned the importance of time management as well as perfect deliverables.
  • Improved my documentation skill
  • Improved my communication & presentation skill

Selected Proposal - Proposal-Atarashi-GSoC2020

GIF

Let's get connected!

About

Google Summer of Code 2020 🚩 Report On Project: "Accelerating Atarashi" πŸ‘¨β€πŸ’»

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published