ADD: Academic Disciplines Detector Based on Wikipedia

This repository contains code and evaluation results for the research paper “ADD: Academic Disciplines Detector Based on Wikipedia”. The purpose of the Academic Disciplines Detector (ADD) is detection of academic disciplines defined in Wikipedia at particular moment, in order to facilitate the timely detection of emerging or obsolete disciplines and to enable studying of their evolution. The sole purpose of this repository is to provide additional details on the respective paper.

Citing

A. Gjorgjevikj, K. Mishev and D. Trajanov, "ADD: Academic Disciplines Detector Based on Wikipedia," in IEEE Access, vol. 8, pp. 7005-7019, 2020.

Requirements

Pretrained Word Vectors and Models

The code available in this repository uses the following pretrained word embeddings and models:

FastText [2] word embeddings trained on Common Crawl (2 million word vectors).
InferSent model trained with fastText word embeddings (version 2).
Universal Sentence Encoder (USE) - Transformer [3] available in TensorFlow Hub.

Modules

To run each of the Academic Disciplines Detector (ADD) modules, see modules/demo.ipynb. The modules should be run in the specified order.

Text/Metadata Extractor

The Extractor class reads Wikipedia XML export files and produces JSON files containing the extracted metadata and text.

Usage

Copy Wikipedia dump files in the data folder.
Run the code from modules/demo.ipynb.

Basic Filter / Lead Section Excerpts Extractor

Filters Wikipedia articles that do not follow the patterns common for academic disciplines titles and extracts short representative subsection from the article’s lead section.

Usage

Make sure that Text/Metadata Extractor’s output files are present in the data folder.
Run the code from modules/demo.ipynb.

Text Classifier

Calculates the probability that one Wikipedia article is an academic discipline based on trained classifiers over Wikipedia articles’ lead section excerpt.

Usage

Download the InferSent source code from its GitHub repository and specify the path to its models.py script in modules/text_classifier.py. InferSent is distributed by Facebook under the Creative Commons Attribution-NonCommercial 4.0 International Public License (for more information please see the InferSent GitHub repository).
Download the pretrained InferSent model (version 2) and copy it in the directory specified by the parameter MODELS_DIR.
Download the pretrained fastText word embeddings trained on Common Crawl (2 million word vectors) and copy it in the directory specified by the parameter WORD_VECTORS_DIR.
Make sure that Basic Filter’s output files are present in the data folder.
Run the code from modules/demo.ipynb. If the option TextClassifierOptions.ALL is not used, to reproduce the results, make sure to run the other three classifiers before running with the option TextClassifierOptions.ENSAMBLE.

Node Classifier

Calculates the probability that one candidate discipline is an academic discipline based on a trained classifier over disciplines graph centrality-based features.

Usage

Make sure that Text Classifier’s merged output CSV file present in the data folder.
Run the code from modules/demo.ipynb. The final CSV file contains a probability score that a candidate Wikipedia article is an academic discipline.

Evaluation

The test dataset errors made by the trained text classification and node classification models are available in the evaluation directory.

Text classifier: ensamble-classifier-test-errors.csv
Node classifier: node-classifier-test-errors.csv

Note

The textual content coming from Wikipedia dumps is available under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License. For more information see the License information about Wikimedia dump downloads.

References

[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, "Supervised learning of universal sentence representations from natural language inference data," 2017, arXiv:1705.02364. [Online]. Available: https://arxiv.org/abs/1705.02364

[2] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin, "Advances in pre-training distributed word representations," in Proc. Int. Conf. Lang. Resour. Eval. (LREC), 2018.

[3] D. Cer, Y. Yang, S.-Y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S.Yuan, and C. Tar, "Universal sentence encoder," 2018, arXiv:1803.11175. [Online]. Available: https://arxiv.org/abs/1803.11175

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
evaluation		evaluation
modules		modules
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADD: Academic Disciplines Detector Based on Wikipedia

Citing

Requirements

Pretrained Word Vectors and Models

Modules

Text/Metadata Extractor

Basic Filter / Lead Section Excerpts Extractor

Text Classifier

Node Classifier

Evaluation

Note

References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

f-data/ADD

Folders and files

Latest commit

History

Repository files navigation

ADD: Academic Disciplines Detector Based on Wikipedia

Citing

Requirements

Pretrained Word Vectors and Models

Modules

Text/Metadata Extractor

Basic Filter / Lead Section Excerpts Extractor

Text Classifier

Node Classifier

Evaluation

Note

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages