🚩 ABSTRACT

                           _______       _____             _____ _    _ _____ 
                        /\|__   __|/\   |  __ \     /\    / ____| |  | |_   _|
                       /  \  | |  /  \  | |__) |   /  \  | (___ | |__| | | |  
                      / /\ \ | | / /\ \ |  _  /   / /\ \  \___ \|  __  | | |  
                     / ____ \| |/ ____ \| | \ \  / ____ \ ____) | |  | |_| |_ 
                    /_/    \_\_/_/    \_\_|  \_\/_/    \_\_____/|_|  |_|_____|

🚩 ABSTRACT

Atarashi scans for license statements in open source software, focusing on text statistics and information retrieval algorithms. It was designed to work stand-alone and with FOSSology Software. Earlier, it worked only on the simple command-line interface. for it's better usage it needed to be integrated with the fossology software.

My project was to integrate Atarashi with the fossology software and to create a stable GUI that uses various existing fossology features. The integration part was done using Ninka, which was used to wrap atarashi around fossology.

To improve atarashi and continue its development, I had to make a pip package of atarashi and publish it to PyPI. Along with that, I had to establish a machine learning model and implement a new algorithm using it. The new algorithm Semantic Text Similarity works on the Gensim implementation of Doc2Vec and finds the most similar docs using the cosine similarities of vectors. With the gradual improvement in the future will make this model more accurate and atarashi more powerful and faster than ever.

Also, I had to create an Evaluation Script for the existing and upcoming algorithms for atarashi to validate its accuracy and reliability. Evaluation is based on time and the accuracy of the agents. This will be beneficial to get the best open-source license scanning algorithms.

🌏 CONTRIBUTIONS

1. Package Atarashi and Publish to PyPI

To integrate Atarashi with fossology we had to make a package of it and publish it to PyPI so that we can pip install atarashi in the fossology system and use that in the software. To make a publishable python package there needs to be certain organized directory and file system. The structure of Atarashi was already organized and setup.py had every bit of detailed which was needed to publish. This saved me a lot of time and work.

There were some minor errors like the classifiers in setup.py metadata was a little bit different than actually standardized classifiers provided by PyPI. Changed Development Status :: Pre-Alpha" to “Development Status :: 2 - Pre-Alpha” and "License :: OSI Approved :: GPL v2.0 License" to “License :: OSI Approved :: GNU General Public License v2 (GPLv2). One more error occurred while publishing when it was not accepting the CodeComment as it was external so I had to remove it from setup.py.

Currently, the Package is live at: https://pypi.org/project/atarashi/

Atarashi can be installed in the system using pip install atarashi

2. Integrating Atarashi to Fossology

To integrate Atarashi with fossology Ninka Agent was to be modified and used as a wrapper. The codebase was to be modified accordingly with respect to atarashi.

The Modifications are as follows:

Changed various Classes and Functions accordingly.
Added the command of atarashi agent to run on bash.
Parsed the output format of Atarashi i.e. JSON to the result output format of Ninka. for that, I used Jsoncpp library that was already used there is FOSSology before.
Modified the makefiles accordingly
Compiled atarashiWrapper codebase and connected to Fossology database.
Added atarashiWrapper to fossology Scheduler.

The working UI was fixed by GMishx

Cherry Picked the commit GMishx@3b31c17 and added to the PR for working UI
Added the mod_deps files for Python, pip and atarashi Installation.

The UI is working fine and the agent can be scheduled for upload < for file < Select optional analysis

To test, just do a fresh installation of Fossology and upload a file with Atarashi license scanning option.

The PR for the integration is: feat(atarashi): Add Atarashi to FOSSology

3. Algorithms Evaluator

Before creating any new algorithm for Atarashi, it was suggested that a script should be there to evaluate the algorithms to get an accurate and reliable license scanner agent. Also it will be easier to get the best scanner agent out of all.

The evaluation is done on the basis of two factors:

Time taken by the agent
Accuracy of the result

The script runs the agent command on the bash and gets the output. From the output the license name is parsed which is then matched with the correct license name. There is a dataset of various files containing various open source license statements.

This dataset is created from nomosTestFiles and spdx-test-files and contains a total of 100 files. The script runs various commands on each and every files in the test dataset and gives the result in the end.

The PR for the evaluator is: feat(evaluator): Add Evaluation Script for algorithms

4. New Algorithm: Semantic Text Similarity

Semantic Text Similarity find the similarity between documents based on its semantics.

The Gensim implementation of Doc2Vec converts the whole document (unlike word2vec) into vector with their labels. The Doc2Vec model is trained using the filename as its label and license text as the document. The current training dataset is the txt format of license-list-data provided by SPDX.

The model is trained on the full license training dataset using the filename as its label and license text as the document. The model is then loaded which is used to convert the whole document into vectors.

The cosine distance between the vectors are calculated and the highest score is returned as sim_score.

Steps in training the model

Loads the dataset
Reads each document and save its content in memory.
Tokeninze the content and lowercase it.
Saves each and every token as list elements.
Iterate over the whole tokens of a document and provided it with a label.
starts the training with fixed epoch size and learning rate.
Saves the model in binary format

Steps in Implementing the Algorithm

Loads the trained model.
Load and reads the document provided in the filepath.
Tokenize the whole document and lowercase it.
Saves each and every token as list elements.
finds the cosine sim between the provided document and that of the trained model.
Returns the top 10 similar document labels.
The label with the highest sim_score is the result.

The label is actually the license name

The PR for the Algorithm is: feat(doc2vec) : Semantic Text Similarity Algorithm with dataset & training code

🔧 PULL REQUESTS

Major Contributions

Other Contributions

👨🏻‍🏫 DELIVERABLES

Tasks	Planned	Completed
Package Atarashi	Yes	✔️
Publish to PyPI	Yes	✔️
Integrate Atarashi with FOSSology	Yes	✔️
Create working UI for Atarashi	Yes	✔️
Create Algorithm Evaluator	No	✔️
Training and Implementing New ML Algorithm	Yes but Imporoved	✔️

🚀 FUTURE PLANS

Increase the size and Quality of training Dataset for ML Model
Improve text preprocessing using third party libraries
Work on code comment extractor
Parallelize the evaluation script
Maintain the Atarashi package published at PyPI

📚 Things I learned from Google Summer of Code

Learned about Real-World Software system workflow and Architecture.
various Open-Source licenses and their Importance in codes, projects and softwares.
Started understanding a bit of PHP and C/C++
Pacakaging of Python Projects and how it is maintained and released.
Shell scripts and its use within various systems
Sharpened my skill of GIT
Various Information retrieval algorithms especially word2vec and doc2vec.
Learned to create a better and cleaner dataset.
Sharpened my knowledge of Machine Learning.
Learned the importance of time management as well as perfect deliverables.
Learned the importance of docmentation.
Improved my communication skill.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
WPRs		WPRs
files		files
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WPRs

WPRs

files

files

LICENSE

LICENSE

README.md

README.md

Repository files navigation

🚩 ABSTRACT

🌏 CONTRIBUTIONS

1. Package Atarashi and Publish to PyPI

2. Integrating Atarashi to Fossology

The Modifications are as follows:

3. Algorithms Evaluator

4. New Algorithm: Semantic Text Similarity

Steps in training the model

Steps in Implementing the Algorithm

🔧 PULL REQUESTS

Major Contributions

Other Contributions

👨🏻‍🏫 DELIVERABLES

🚀 FUTURE PLANS

📚 Things I learned from Google Summer of Code

📜 License

About

Releases

Packages

License

hastagAB/GSoC-19

Folders and files

Latest commit

History

Repository files navigation

🚩 ABSTRACT

🌏 CONTRIBUTIONS

1. Package Atarashi and Publish to PyPI

2. Integrating Atarashi to Fossology

The Modifications are as follows:

3. Algorithms Evaluator

4. New Algorithm: Semantic Text Similarity

Steps in training the model

Steps in Implementing the Algorithm

🔧 PULL REQUESTS

Major Contributions

Other Contributions

👨🏻‍🏫 DELIVERABLES

🚀 FUTURE PLANS

📚 Things I learned from Google Summer of Code

📜 License

About

Resources

License

Stars

Watchers

Forks