Skip to content

hastagAB/GSoC-19

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HitCount Atarashi Build Status License GitHub stars

                           _______       _____             _____ _    _ _____ 
                        /\|__   __|/\   |  __ \     /\    / ____| |  | |_   _|
                       /  \  | |  /  \  | |__) |   /  \  | (___ | |__| | | |  
                      / /\ \ | | / /\ \ |  _  /   / /\ \  \___ \|  __  | | |  
                     / ____ \| |/ ____ \| | \ \  / ____ \ ____) | |  | |_| |_ 
                    /_/    \_\_/_/    \_\_|  \_\/_/    \_\_____/|_|  |_|_____|

🚩 ABSTRACT

Atarashi scans for license statements in open source software, focusing on text statistics and information retrieval algorithms. It was designed to work stand-alone and with FOSSology Software. Earlier, it worked only on the simple command-line interface. for it's better usage it needed to be integrated with the fossology software.

My project was to integrate Atarashi with the fossology software and to create a stable GUI that uses various existing fossology features. The integration part was done using Ninka, which was used to wrap atarashi around fossology.

To improve atarashi and continue its development, I had to make a pip package of atarashi and publish it to PyPI. Along with that, I had to establish a machine learning model and implement a new algorithm using it. The new algorithm Semantic Text Similarity works on the Gensim implementation of Doc2Vec and finds the most similar docs using the cosine similarities of vectors. With the gradual improvement in the future will make this model more accurate and atarashi more powerful and faster than ever.

Also, I had to create an Evaluation Script for the existing and upcoming algorithms for atarashi to validate its accuracy and reliability. Evaluation is based on time and the accuracy of the agents. This will be beneficial to get the best open-source license scanning algorithms.

🌏 CONTRIBUTIONS

1. Package Atarashi and Publish to PyPI

To integrate Atarashi with fossology we had to make a package of it and publish it to PyPI so that we can pip install atarashi in the fossology system and use that in the software. To make a publishable python package there needs to be certain organized directory and file system. The structure of Atarashi was already organized and setup.py had every bit of detailed which was needed to publish. This saved me a lot of time and work.

There were some minor errors like the classifiers in setup.py metadata was a little bit different than actually standardized classifiers provided by PyPI. Changed Development Status :: Pre-Alpha" to β€œDevelopment Status :: 2 - Pre-Alpha” and "License :: OSI Approved :: GPL v2.0 License" to β€œLicense :: OSI Approved :: GNU General Public License v2 (GPLv2). One more error occurred while publishing when it was not accepting the CodeComment as it was external so I had to remove it from setup.py.

Currently, the Package is live at: https://pypi.org/project/atarashi/

Atarashi can be installed in the system using pip install atarashi

2. Integrating Atarashi to Fossology

To integrate Atarashi with fossology Ninka Agent was to be modified and used as a wrapper. The codebase was to be modified accordingly with respect to atarashi.

The Modifications are as follows:

  • Changed various Classes and Functions accordingly.
  • Added the command of atarashi agent to run on bash.
  • Parsed the output format of Atarashi i.e. JSON to the result output format of Ninka. for that, I used Jsoncpp library that was already used there is FOSSology before.
  • Modified the makefiles accordingly
  • Compiled atarashiWrapper codebase and connected to Fossology database.
  • Added atarashiWrapper to fossology Scheduler.

The working UI was fixed by GMishx

  • Cherry Picked the commit GMishx@3b31c17 and added to the PR for working UI
  • Added the mod_deps files for Python, pip and atarashi Installation.

The UI is working fine and the agent can be scheduled for upload < for file < Select optional analysis

atarashiWrapper

To test, just do a fresh installation of Fossology and upload a file with Atarashi license scanning option.

The PR for the integration is: feat(atarashi): Add Atarashi to FOSSology

3. Algorithms Evaluator

Before creating any new algorithm for Atarashi, it was suggested that a script should be there to evaluate the algorithms to get an accurate and reliable license scanner agent. Also it will be easier to get the best scanner agent out of all.

The evaluation is done on the basis of two factors:

  1. Time taken by the agent
  2. Accuracy of the result

The script runs the agent command on the bash and gets the output. From the output the license name is parsed which is then matched with the correct license name. There is a dataset of various files containing various open source license statements.

This dataset is created from nomosTestFiles and spdx-test-files and contains a total of 100 files. The script runs various commands on each and every files in the test dataset and gives the result in the end.

The PR for the evaluator is: feat(evaluator): Add Evaluation Script for algorithms

4. New Algorithm: Semantic Text Similarity

Semantic Text Similarity find the similarity between documents based on its semantics.

The Gensim implementation of Doc2Vec converts the whole document (unlike word2vec) into vector with their labels. The Doc2Vec model is trained using the filename as its label and license text as the document. The current training dataset is the txt format of license-list-data provided by SPDX.

The model is trained on the full license training dataset using the filename as its label and license text as the document. The model is then loaded which is used to convert the whole document into vectors.

The cosine distance between the vectors are calculated and the highest score is returned as sim_score.

Steps in training the model

  1. Loads the dataset
  2. Reads each document and save its content in memory.
  3. Tokeninze the content and lowercase it.
  4. Saves each and every token as list elements.
  5. Iterate over the whole tokens of a document and provided it with a label.
  6. starts the training with fixed epoch size and learning rate.
  7. Saves the model in binary format

Steps in Implementing the Algorithm

  1. Loads the trained model.
  2. Load and reads the document provided in the filepath.
  3. Tokenize the whole document and lowercase it.
  4. Saves each and every token as list elements.
  5. finds the cosine sim between the provided document and that of the trained model.
  6. Returns the top 10 similar document labels.
  7. The label with the highest sim_score is the result.

The label is actually the license name

The PR for the Algorithm is: feat(doc2vec) : Semantic Text Similarity Algorithm with dataset & training code

πŸ”§ PULL REQUESTS

Major Contributions

Other Contributions

πŸ‘¨πŸ»β€πŸ« DELIVERABLES

Tasks Planned Completed
Package Atarashi Yes βœ”οΈ
Publish to PyPI Yes βœ”οΈ
Integrate Atarashi with FOSSology Yes βœ”οΈ
Create working UI for Atarashi Yes βœ”οΈ
Create Algorithm Evaluator No βœ”οΈ
Training and Implementing New ML Algorithm Yes but Imporoved βœ”οΈ

πŸš€ FUTURE PLANS

  1. Increase the size and Quality of training Dataset for ML Model
  2. Improve text preprocessing using third party libraries
  3. Work on code comment extractor
  4. Parallelize the evaluation script
  5. Maintain the Atarashi package published at PyPI

πŸ“š Things I learned from Google Summer of Code

  • Learned about Real-World Software system workflow and Architecture.
  • various Open-Source licenses and their Importance in codes, projects and softwares.
  • Started understanding a bit of PHP and C/C++
  • Pacakaging of Python Projects and how it is maintained and released.
  • Shell scripts and its use within various systems
  • Sharpened my skill of GIT
  • Various Information retrieval algorithms especially word2vec and doc2vec.
  • Learned to create a better and cleaner dataset.
  • Sharpened my knowledge of Machine Learning.
  • Learned the importance of time management as well as perfect deliverables.
  • Learned the importance of docmentation.
  • Improved my communication skill.

πŸ“œ License

This repository is licensed under the GPL-2.0 Β© HastagAB.

About

Google Summer of Code 2019 🚩 Report On Project: "Continuation of Atarashi OSS" πŸ‘¨β€πŸ’»

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published