Skip to content

Analysis of publication domain by statistical analysis of word counts.

License

Notifications You must be signed in to change notification settings

gmrukwa/publication-domain-discernibility

Repository files navigation

publication-domain-discernibility

Analysis of publication domain by statistical analysis of word counts.

Instruction

Dataset Gathering

First, publications need to be found using the Microsoft Academic Knowledge API. This is a responsibility of the discover package.

  1. Specify domains of interest within config.json
{
  ...
  "DOMAINS": [
    "cancer",
    "another_domain"
  ]
}
  1. Find your Microsoft Academic Knowledge API key here. You should copy Key 1 and use it in the next step.
  2. Discover relevant publications with discover package. Provide number of papers around 5 times greater than the number you actually want to download - not all papers are downloadable.
python -m discover --api-key <your-copied-key> --count <count-of-papers>

You can find discovery files within data/pubs directory, named as <domain_name>.json.

  1. Download publications as PDF files via download module. Here you provide an actual number of papers to download.
python -m download --count <count-of-papers>

You can find downloaded publications within data/pubs/<domain-name> directory, named as <publication-title>.pdf.

  1. Convert downloaded publications to TXT format via convert module.
python -m convert

You can find converted publications within data/pubs/<domain-name> directory, named as <publication-title>.txt.

Technical Feasibility Check

Feasibility was checked more-or-less during the topic selection classes. Proposed flow is as follows:

  1. Specify domains to gather papers for
  2. Use Microsoft Academic Knowledge API to find publications for a domain
  3. Download found publications
  4. Convert PDFs to TXT
  5. Use TFIDF embedding to produce paper features
  6. Use ANOVA / nonparametric alternative for checking, which words make a difference

Microsoft Academic Knowledge API

Flow of the API is simple:

  1. Select domain by constructing query expression with interpret endpoint.
  2. Use evaluate endpoint with provided query to find papers in the domain.

Several links may be useful:

Conversion of PDF to TXT

There is a package pdftotext for Python 2 and 3.

Extraction of TFIDF Text Features

There is an implementation in Python within a package called scikit-learn. You can check it here. There are some parameters to play with - understanding them may be key to success. Here you can find theoretical background.

About

Analysis of publication domain by statistical analysis of word counts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published