GSoC2021

Copyright False Positive Detection using ML @ FOSSology

Project Details

In the current scenario, most of the projects use copyright agents defining the mode of usage for their product. Software like Fossology uses the rule-based approaches for copyright detection and scanning. Agents like nomos use Regex based approaches to extract the statements from a project then a Regex based agent shows the results with the several copyrights statements used in the project followed by a set of agents which does the Deactivation of the copyrights which are False. Still, there are a lot of statements that are left in the Agent's findings and then an user has to involve in the Manual findings. It has become a two step process which was not ideal for the case.

My proposed ideas and objectives revolved around Fossology entirely i.e. from including a Natural Language Processing based approach for pre-processing and then recognising a pattern between the false and a true copyright statement with the help of NLP and Automation. Another given functionality was to remove the clutter from original extracted copyright statement. Entire goal for the proposed ideas was to introduce new functionalities into Fossology.

Contributions

1. Introducing NER and POS tagging for Copyright Statements

A Python based approach to analyse copyright statements

Codebase: GitHub
Documentation: FalsePositiveDetection-repo

One thing about copyright statements is very intriguing i.e. They looks so predictable but there are millions variations to how they look and how many things they can contain. Despite of having a predictable architecture. The first task revolved around from understanding the architecture with the close filter of all types of (TEXT UNDERSTANDINGS) i.e. the types of entities and the parts of speech in our case.

From there, I decided to predict a specific structure that is being followed by most of the copyright statements despite the variations to how they look. Two different lists of Named Entities and POS tags are then hypothised. These hypothised lists helped in benchmarking as an ideal structure. It cleared the further understanding and outine of the complete project.

According to NER, the structure looked like:

Statement: "Copyright (c) 2021, Kaushlendra Pratap (kaushlendra@xyz.com)"
Probable NER Entity looks Like: ['DATE', 'PERSON', 'CARDINAL', 'ORG']
Probable POS Tags looks Like: ['NOUN', 'NUM', 'PROPN', 'PROPN']

2. Hypothesis to Working Solution

After testing the architecture predicted and getting good accuracy in recognising most of the copyright statements from the required datasets provided. The compilation of the script started. The working of script looked like:

The task was divided into three sections:

Text-preprocessing to make the input data more accurate and with less clutter.
Defining a function that calculated the NER and POS tag for each statement, iteratively.
Two stage filtering if-else ladder with the mechanism to update "T" if the match is found and "F" if not, A new characteristic was introduced in CSV i.e."is_copyright"

3. Testing the script and Accuracy Calculations

Accuracy Calculation :

The accuracy calculation was done with the help of several datasets which are manually marked by Human. Iterating over the CSV,

IF ManualTag == AlgorithmTag: counter += 1; accuracy_score = (counter/total_occurence)*100

The accuracy was divided into: FP_accuracy, TP_accuracy, TN_accuracy and FN_aacuracy.

Final Precision = (TP + FP)/(TP + FP + TN + FN)

Results

Final Precision: 94.37 %

4. Clutter Removal from the Copyright Statements

Copyright statements are not ideally with direct structure that comprise of license statements appended to them at the end.

Normal Copyright with clutter:

Copyright with clutter removal:

The approach taken was:

IF is_copyright == "t"

string = copyrightStatement;

IF 'ORG', 'PERSON' in NER_LIST:

clutter = string[0:string.index(org_name)] (same way person_name)

RESULTS :

5. Integrating the Script as Decider Agent

Fossology has a list of several agents like Nomos, Monk, Ninka, Decider etc. etc. The main goal was to intoroduce the python script into the PHP code and then use it as a Decider agent.

The tasks in hand were:

To create two flags on UI with Copyright Deactivation and Copyright Deactivation with Clutter removing
Create two rules and two seperate function in DeciderAgent.php to call the python script and then Update the Database with the True and Deactivated copyright statements.
To differentiate between the functionality of both the functions and providing the absolute $uploadID, $content, $action and $hash.
Installing changes in the Makefile to install the script with make install.
Creating a mod_deps file to introduce and install the dependencies required to run the script.

Each task was accomplished and the agent was completely integrated.

RESULTS

Last two entries:

The after working results :

No Clutter Removal Flag:

Clutter Removal Flag:

6. Documentation and Pull Request

The Pull request with the script, integration changes, UI change and Database updation code is: Check the PR from here
The Progress has been regularly marked every week and they are kept in a seperate wiki. Check WPRs from here
The setting up and user documentation of the script in Decider Agent can be visited here: Check documentation
The installation and user documentation for the Jupyter Notebook can be visited here: README

Deliverables

Tasks	Planned	Completed	Remarks
Introducing NER and POS tagging for Copyright Statements	Yes	✔️	This was like the POC for the idea.
Implementing the Hypothesis as a working product.	Yes	✔️	The working of the script is efficient but can be improved further.
Accuracy Score calculation and Testing	Yes	✔️	The accuracy is acceptable but can be improved with more checks involved
Integrating the Script with Fossology	Yes	✔️	Integration is done and can be used with fossology installation
Documenting the working of Script	Yes	✔️	NONE

Future Goals

Implementing further more layer of checks to cover the edge cases.
Going through other NLP techniques to understand some other perspectives of the copyright statements.
Maintaining the agent and look for achieving further more accuracy in clutter removal techniques.
Be with Fossology community as contributor and help future developers to get started with Fossology, Atarashi and Nirjas.
Continue Maintaining Atarashi and Nirjas.

Key Takeaways

Learnt the art of collaboration and working on real-time software development.
Improved programming skills, including OOP concepts and Modular Programming.
Learnt alot about NLP Techniques for pre-processing texts.
Learnt about importance of Open-Source Copyrights and their detail figurative analysis.
Improved Git skills.
How a full fledge system like fossology functions in Model, View and Controllers perspectives.
Better analysis of code and debugging more easily.
Importance of a well equipped dataset and creating one from scratch for training our own NER model.
Punctuality and adaptability according to time and situation.
Communicating properly, presenting the code and keep on asking doubts.

Acknowledgements

This year Google Summer of Code came with extra fun because it was my second time participating with Fossology and It is going with a little sadness because it was my last time to be participating with fossology as student developer. There are several people to whom I want to extend my regards to.

I want to thank and appreciate my mentors Michael C. Jaeger, Anupam Ghosh, Gaurav Mishra, Vasudev Maduri, Ayush Bharadwaj and Shaheem Azmal M MD. without the help and support from them, all this would not have been possible.

Now, I would like to extend my regards to two very important figures who helped me to steer across all the challenges(PS: Not just GSoC :P), Ayush Bharadwaj and Sahil Jha.

Finally, I am glad to meet all the fellow developers down the road. You guys are awesome and keep doing the great work.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
files		files
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

files

files

LICENSE

LICENSE

README.md

README.md

Repository files navigation

GSoC2021

Copyright False Positive Detection using ML @ FOSSology

Project Details

Contributions

1. Introducing NER and POS tagging for Copyright Statements

2. Hypothesis to Working Solution

3. Testing the script and Accuracy Calculations

Results

Final Precision: 94.37 %

4. Clutter Removal from the Copyright Statements

5. Integrating the Script as Decider Agent

6. Documentation and Pull Request

Deliverables

Future Goals

Key Takeaways

Acknowledgements

Reach out to me

About

Releases

Packages

License

Kaushl2208/GSoC2021

Folders and files

Latest commit

History

Repository files navigation

GSoC2021

Copyright False Positive Detection using ML @ FOSSology

Project Details

Contributions

1. Introducing NER and POS tagging for Copyright Statements

2. Hypothesis to Working Solution

3. Testing the script and Accuracy Calculations

Results

Final Precision: 94.37 %

4. Clutter Removal from the Copyright Statements

5. Integrating the Script as Decider Agent

6. Documentation and Pull Request

Deliverables

Future Goals

Key Takeaways

Acknowledgements

Reach out to me

About

Resources

License

Stars

Watchers

Forks