Skip to content

Work Report for GSoC-2021 project with Fossology.

License

Notifications You must be signed in to change notification settings

Kaushl2208/GSoC2021

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

GSoC2021

ViewCount GitHub Twitter GitHub Stars

Summer-of-code

Copyright False Positive Detection using ML @ FOSSology

Project Details | Contributions | Deliverables | Future Goals | Key Takeaways | Acknowledgements

Project Details

In the current scenario, most of the projects use copyright agents defining the mode of usage for their product. Software like Fossology uses the rule-based approaches for copyright detection and scanning. Agents like nomos use Regex based approaches to extract the statements from a project then a Regex based agent shows the results with the several copyrights statements used in the project followed by a set of agents which does the Deactivation of the copyrights which are False. Still, there are a lot of statements that are left in the Agent's findings and then an user has to involve in the Manual findings. It has become a two step process which was not ideal for the case.

My proposed ideas and objectives revolved around Fossology entirely i.e. from including a Natural Language Processing based approach for pre-processing and then recognising a pattern between the false and a true copyright statement with the help of NLP and Automation. Another given functionality was to remove the clutter from original extracted copyright statement. Entire goal for the proposed ideas was to introduce new functionalities into Fossology.


Contributions

1. Introducing NER and POS tagging for Copyright Statements

A Python based approach to analyse copyright statements

One thing about copyright statements is very intriguing i.e. They looks so predictable but there are millions variations to how they look and how many things they can contain. Despite of having a predictable architecture. The first task revolved around from understanding the architecture with the close filter of all types of (TEXT UNDERSTANDINGS) i.e. the types of entities and the parts of speech in our case.

From there, I decided to predict a specific structure that is being followed by most of the copyright statements despite the variations to how they look. Two different lists of Named Entities and POS tags are then hypothised. These hypothised lists helped in benchmarking as an ideal structure. It cleared the further understanding and outine of the complete project.

According to NER, the structure looked like:

Statement: "Copyright (c) 2021, Kaushlendra Pratap (kaushlendra@xyz.com)"
Probable NER Entity looks Like: ['DATE', 'PERSON', 'CARDINAL', 'ORG']
Probable POS Tags looks Like: ['NOUN', 'NUM', 'PROPN', 'PROPN']

2. Hypothesis to Working Solution

After testing the architecture predicted and getting good accuracy in recognising most of the copyright statements from the required datasets provided. The compilation of the script started. The working of script looked like:
WorkFlow-Diagram

The task was divided into three sections:

  • Text-preprocessing to make the input data more accurate and with less clutter.
  • Defining a function that calculated the NER and POS tag for each statement, iteratively.
  • Two stage filtering if-else ladder with the mechanism to update "T" if the match is found and "F" if not, A new characteristic was introduced in CSV i.e."is_copyright"

3. Testing the script and Accuracy Calculations

Accuracy Calculation :

The accuracy calculation was done with the help of several datasets which are manually marked by Human. Iterating over the CSV,

IF ManualTag == AlgorithmTag: counter += 1; accuracy_score = (counter/total_occurence)*100

The accuracy was divided into: FP_accuracy, TP_accuracy, TN_accuracy and FN_aacuracy.

Final Precision = (TP + FP)/(TP + FP + TN + FN)

Results


AccuracyScore

Final Precision: 94.37 %


4. Clutter Removal from the Copyright Statements

Copyright statements are not ideally with direct structure that comprise of license statements appended to them at the end.

Normal Copyright with clutter:

Copyright (c) 2021, Kaushlendra Pratap Singh. Distributed Under the MIT license ....

Copyright with clutter removal:

Copyright (c) 2021, Kaushlendra Pratap Singh

The approach taken was:

IF is_copyright == "t"

string = copyrightStatement;

IF 'ORG', 'PERSON' in NER_LIST:

clutter = string[0:string.index(org_name)] (same way person_name)

RESULTS :

Cluter-Removal

5. Integrating the Script as Decider Agent

Fossology has a list of several agents like Nomos, Monk, Ninka, Decider etc. etc. The main goal was to intoroduce the python script into the PHP code and then use it as a Decider agent.

The tasks in hand were:

  1. To create two flags on UI with Copyright Deactivation and Copyright Deactivation with Clutter removing
  2. Create two rules and two seperate function in DeciderAgent.php to call the python script and then Update the Database with the True and Deactivated copyright statements.
  3. To differentiate between the functionality of both the functions and providing the absolute $uploadID, $content, $action and $hash.
  4. Installing changes in the Makefile to install the script with make install.
  5. Creating a mod_deps file to introduce and install the dependencies required to run the script.

Each task was accomplished and the agent was completely integrated.

RESULTS

Last two entries:
DeciderAgent

The after working results :

No Clutter Removal Flag:
WithoutClutterDeactivation

Clutter Removal Flag:
WithClutterRemovalDeacitvation

6. Documentation and Pull Request

  1. The Pull request with the script, integration changes, UI change and Database updation code is: Check the PR from here

  2. The Progress has been regularly marked every week and they are kept in a seperate wiki. Check WPRs from here

  3. The setting up and user documentation of the script in Decider Agent can be visited here: Check documentation

  4. The installation and user documentation for the Jupyter Notebook can be visited here: README

Deliverables

Tasks Planned Completed Remarks
Introducing NER and POS tagging for Copyright Statements Yes ✔️ This was like the POC for the idea.
Implementing the Hypothesis as a working product. Yes ✔️ The working of the script is efficient but can be improved further.
Accuracy Score calculation and Testing Yes ✔️ The accuracy is acceptable but can be improved with more checks involved
Integrating the Script with Fossology Yes ✔️ Integration is done and can be used with fossology installation
Documenting the working of Script Yes ✔️ NONE

Future Goals

  1. Implementing further more layer of checks to cover the edge cases.
  2. Going through other NLP techniques to understand some other perspectives of the copyright statements.
  3. Maintaining the agent and look for achieving further more accuracy in clutter removal techniques.
  4. Be with Fossology community as contributor and help future developers to get started with Fossology, Atarashi and Nirjas.
  5. Continue Maintaining Atarashi and Nirjas.

Key Takeaways

  • Learnt the art of collaboration and working on real-time software development.
  • Improved programming skills, including OOP concepts and Modular Programming.
  • Learnt alot about NLP Techniques for pre-processing texts.
  • Learnt about importance of Open-Source Copyrights and their detail figurative analysis.
  • Improved Git skills.
  • How a full fledge system like fossology functions in Model, View and Controllers perspectives.
  • Better analysis of code and debugging more easily.
  • Importance of a well equipped dataset and creating one from scratch for training our own NER model.
  • Punctuality and adaptability according to time and situation.
  • Communicating properly, presenting the code and keep on asking doubts.

Acknowledgements

This year Google Summer of Code came with extra fun because it was my second time participating with Fossology and It is going with a little sadness because it was my last time to be participating with fossology as student developer. There are several people to whom I want to extend my regards to.

I want to thank and appreciate my mentors Michael C. Jaeger, Anupam Ghosh, Gaurav Mishra, Vasudev Maduri, Ayush Bharadwaj and Shaheem Azmal M MD. without the help and support from them, all this would not have been possible.

Now, I would like to extend my regards to two very important figures who helped me to steer across all the challenges(PS: Not just GSoC :P), Ayush Bharadwaj and Sahil Jha.

Finally, I am glad to meet all the fellow developers down the road. You guys are awesome and keep doing the great work.

Reach out to me

About

Work Report for GSoC-2021 project with Fossology.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published