publication-domain-discernibility

Analysis of publication domain by statistical analysis of word counts.

Instruction

Dataset Gathering

First, publications need to be found using the Microsoft Academic Knowledge API. This is a responsibility of the discover package.

Specify domains of interest within config.json

{
  ...
  "DOMAINS": [
    "cancer",
    "another_domain"
  ]
}

Find your Microsoft Academic Knowledge API key here. You should copy Key 1 and use it in the next step.
Discover relevant publications with discover package. Provide number of papers around 5 times greater than the number you actually want to download - not all papers are downloadable.

python -m discover --api-key <your-copied-key> --count <count-of-papers>

You can find discovery files within data/pubs directory, named as <domain_name>.json.

Download publications as PDF files via download module. Here you provide an actual number of papers to download.

python -m download --count <count-of-papers>

You can find downloaded publications within data/pubs/<domain-name> directory, named as <publication-title>.pdf.

Convert downloaded publications to TXT format via convert module.

python -m convert

You can find converted publications within data/pubs/<domain-name> directory, named as <publication-title>.txt.

Technical Feasibility Check

Feasibility was checked more-or-less during the topic selection classes. Proposed flow is as follows:

Specify domains to gather papers for
Use Microsoft Academic Knowledge API to find publications for a domain
Download found publications
Convert PDFs to TXT
Use TFIDF embedding to produce paper features
Use ANOVA / nonparametric alternative for checking, which words make a difference

Microsoft Academic Knowledge API

Flow of the API is simple:

Select domain by constructing query expression with interpret endpoint.
Use evaluate endpoint with provided query to find papers in the domain.

Several links may be useful:

Conversion of PDF to TXT

There is a package pdftotext for Python 2 and 3.

pip
GitHub

Extraction of TFIDF Text Features

There is an implementation in Python within a package called scikit-learn. You can check it here. There are some parameters to play with - understanding them may be key to success. Here you can find theoretical background.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.dvc		.dvc
data		data
discover		discover
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
categories-general.json		categories-general.json
categories-specialized.json		categories-specialized.json
clean.py		clean.py
config-diseases.json		config-diseases.json
config-sciences.json		config-sciences.json
config.json		config.json
convert.py		convert.py
download.py		download.py
insight.Dockerfile		insight.Dockerfile
make_embedding.py		make_embedding.py
requirements-base.txt		requirements-base.txt
requirements-update.Dockerfile		requirements-update.Dockerfile
requirements.txt		requirements.txt
tukey_general.ipynb		tukey_general.ipynb
tukey_specialized.ipynb		tukey_specialized.ipynb
utils.py		utils.py

License

gmrukwa/publication-domain-discernibility

Folders and files

Latest commit

History

Repository files navigation

publication-domain-discernibility

Instruction

Dataset Gathering

Technical Feasibility Check

Microsoft Academic Knowledge API

Conversion of PDF to TXT

Extraction of TFIDF Text Features

About

Resources

License

Stars

Watchers

Forks

Languages