ML-EMBL-publication-project

Repository of ML models and algorithm built for the EMBL publication project

Clone

git clone https://github.com/0AlphaZero0/ML-EMBL-publication-project.git

Introduction

This prototype is made to detect every EMBL paper within a list of PMIDs. An EMBL paper is a paper where there is at least one affiliation to either an EMBL site or an EMBL partnership.

At this time there are 6 different sites and 2 partnerships :

Australia (partnership)
Barcelona
Hinxton (EMBL-EBI Cambridge)
Grenoble
Hamburg
Heidelberg
Nordic (partnership)
Rome

The prototype used two machine learning models and two vectorizer built during the FREYA project. This work has been made for the deliverable 4.6.

All this project is described in Confluence

Running

As described before, the prototype needs a list of PMIDs. This list can be in a file provide by the user or a string directly wrote in the script. Results are organized by searches, each new search require a file .csv or .txt containing PMIDs. Each search have its corresponding folder in the searches folder. The results are a list of file, one for each site and one for all EMBL PMIDs detected. For each site the file will be a .csv table like the following :

PMIDs	EMBL	Member states	Worldwide	Partnership
30537516	TRUE	TRUE	TRUE	TRUE
30496853	TRUE	TRUE	FALSE	FALSE
29330484	TRUE	FALSE	TRUE	FALSE

To run this prototype just use the following command :

python .\detect_EMBL.py

In the script, 3 variables are necessary to run your search:

search_name="test"
search_file="test_pmid_EPMC.txt"
directory="./searches/"+search_name+"/"

The search_name corresponds to a name you choose and the directory name in the searches directory. Then the search_file corresponds to the file in your directory where the PMIDs you want to process are located.

This algorithm uses multiprocessing to be able to process huge amount of PMIDs, it is, therefore, possible that the machine where this algorithm run could be slowed.

Details

This prototype remains on two algorithms :

is_EMBL

This algorithm take a an affiliation string and will return a dictionary with prediction scores and methods of prediction. It uses a combination of exact matches and predictions either on the whole string or sub parts of this string.

get_geoloc_from

This algorithm take a an affiliation string and will return a dictionary with corresponding geolocation information found in this string. This algorithm is not the best one to extract geolocation from a string and thus to improve the EMBL detection this is one algorithm to think about.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
models		models
searches		searches
README.md		README.md
detect_EMBL.py		detect_EMBL.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models

models

searches

searches

README.md

README.md

detect_EMBL.py

detect_EMBL.py

requirements.txt

requirements.txt

Repository files navigation

ML-EMBL-publication-project

Clone

Table of Contents

Introduction

Running

Details

is_EMBL

get_geoloc_from

About

Releases

Packages

Languages

0AlphaZero0/ML-EMBL-publication-project

Folders and files

Latest commit

History

Repository files navigation

ML-EMBL-publication-project

Clone

Table of Contents

Introduction

Running

Details

is_EMBL

get_geoloc_from

About

Topics

Resources

Stars

Watchers

Forks

Languages