WE4LKD

Word Embeddings For Latent Knowledge Discovery

Who are we?

WE4LKD is a brazilian research group consisting of undergraduate, master's, doctoral, and postdoctoral students with a strong focus on Artificial Intelligence (AI) and Natural Language Processing (NLP). Our primary objective is to study, analyze, explore, and propose effective real-world NLP applications. Through the use of word embeddings, we aim to uncover hidden knowledge and patterns in textual data to extract valuable insights and improve various applications in different fields.

Accelerating Discoveries in Medicine using Distributed Vector Representations of Words

Berto, Matheus V. V.; De Freitas, Breno L.; Scarton, Carolina E.; Neto, João A. M.; Almeida, Tiago A.

This study aims to extend a recently proposed strategy by combining different unsupervised models to accelerate discoveries in medicine. Distributed vector representations of words were trained on a large corpus of medical papers related to Acute Myeloid Leukemia (AML), a highly malignant form of cancer, and show that established therapies could be developed years before their first proposal. The results open new avenues toward faster medical discoveries through more effective drug and gene testing, enabling better treatments to promote a healthier, prolonged life for patients.

Starting from 1963 - the first explicit occurrence of AML in our corpus - we generated yearly prediction rankings for a set of 21 target compounds. We then calculated the percentage of these predicted AML treatments later reported in the literature, considering only the compounds in the top-3 predictions (orange curve) or not (blue curve). Using only the top-3 predictions would accelerate the percentage discovery of treatments up to 2.3x five years after the first predictions compared to random testing drugs.

Finally, our models were able to identify and suggest testing of some of the currently known compounds used to treat AML up to 11 years before they were explicitly mentioned in the literature, as illustrated below. The remainder of this repository describes the evolution of the project.

Contributing

We encourage you to contribute to our project! Please check out the Issued page.

Built With

Getting Started

This section provides a high-level quick start guide.

Prerequisites

To use this project, you need to have Pyhton installed on your machine. This project used Python version 3.6. In addition, you will also need Pip, the Python package manager to install the other requirements of the project.

Clone the repository

git clone https://github.com/matheusvvb-19/WE4LKD-leukemia_w2v.git
cd WE4LKD-leukemia_w2v/

Setup a Python virtual environment

# create venv
python3 -m venv venv
# activate venv
source venv/bin/activate
# install requirements
pip3 install --ignore-installed -r requirements.txt

Usage

If you like, you can change the search phrases in the /data/search_strings.txt file
Run crawler.py

mkdir results
python3 crawler.py

or download, decompress, and place this file into /pubchem/. If you do this, skip to step 5.

Execute the script merge_txt.py, this will generate the .txt files with all articles between periods

mkdir results_aggregated
python3 merge_txt.py

Execute the script /pubchem/clean_summaries.py, which will clean the merged .txt files

  python3 clean_summaries.py

Train the Word2Vec or FastText incremental models

  cd word2vec
  python3 train_yoy.py

Streamlit web app

To complement this project, we developed two web applications using the Streamlit Python package. The Embeddings Viewer allows users to explore the vector space of our Word2Vec models by searching for specific tokens and analyzing their neighborhood, applying filters to refine the results if necessary.

Acknowledgements

This work was supported by the Brazilian agencies FAPESP (grant 2021/13054-8), Capes, and CNPq. The authors thank Priscila Portela Costa for helping conceptualize this project. We also thank the Computer Science Department from the University of Sheffield for recieving Matheus on his research internship for this project.

Contact

Please do not exitate to contact us by any of the links below.

Matheus Vargas Volpon Berto,
Computer Science B.Sc. student, Federal University of São Carlos (UFSCar), Sorocaba, Brazil.

References

"Unsupervised word embeddings capture latent knowledge from materials science literature", Nature 571, 95–98 (2019)

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 878 Commits
bert		bert
data		data
ner		ner
pages		pages
pubchem		pubchem
word2vec		word2vec
Home.py		Home.py
README.md		README.md
__init__.py		__init__.py
crawler.py		crawler.py
generate_analogies_aml.py		generate_analogies_aml.py
generate_dotproducts_csv.py		generate_dotproducts_csv.py
generate_dotproducts_csv_falirnlp.py		generate_dotproducts_csv_falirnlp.py
get_n_common_words_english.py		get_n_common_words_english.py
latent_knowledge_report.py		latent_knowledge_report.py
latent_knowledge_template.tex		latent_knowledge_template.tex
merge_txt.py		merge_txt.py
requirements.txt		requirements.txt

matheusvvb-19/WE4LKD-leukemia_w2v

Folders and files

Latest commit

History

Repository files navigation

WE4LKD

Who are we?

Accelerating Discoveries in Medicine using Distributed Vector Representations of Words

Table of Contents

Contributing

Built With

Getting Started

Prerequisites

Usage

Streamlit web app

Acknowledgements

Contact

References

About

Topics

Resources

Stars

Watchers

Forks

Languages