GitHub - Geekiac/Masters-Dissertation-Code: Using information extraction and machine learning to determine "Further Work" in Artificial Intelligence research papers

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataframe		dataframe
plots		plots
xml		xml
AcquireData.py		AcquireData.py
AnalyseCNNResults.py		AnalyseCNNResults.py
AnalyseNBandSVMResults.py		AnalyseNBandSVMResults.py
AnalyseSentences.py		AnalyseSentences.py
CNNTests.py		CNNTests.py
CleanUpData.py		CleanUpData.py
ConvertPdfToXmlFiles.py		ConvertPdfToXmlFiles.py
CreateXmlDataSet.py		CreateXmlDataSet.py
GetArxivMetaData.py		GetArxivMetaData.py
GetPdfFiles.py		GetPdfFiles.py
IMAT5314-Further Work Project.v9.pdf		IMAT5314-Further Work Project.v9.pdf
Logging.py		Logging.py
Metrics.py		Metrics.py
NBandSVMTests.py		NBandSVMTests.py
PostAnnotationCleanup.py		PostAnnotationCleanup.py
Readme.txt		Readme.txt
Regular expression searchs.txt		Regular expression searchs.txt
cnn_results.pickle		cnn_results.pickle
conclusions_dataframe.pickle		conclusions_dataframe.pickle
conclusions_with_fw.xml		conclusions_with_fw.xml
conclusions_with_fw_15_lines_or_less.xml		conclusions_with_fw_15_lines_or_less.xml
conclusions_with_fw_15_lines_or_less_post_cleanup.xml		conclusions_with_fw_15_lines_or_less_post_cleanup.xml
nb_and_svm_results.pickle		nb_and_svm_results.pickle
python_libraries_installed.txt		python_libraries_installed.txt
search_results_000.xml		search_results_000.xml

Repository files navigation

Have a look at IMAT5314-Further Work Project.v9.pdf for an overview of this project

PLEASE NOTE
===========

This minimal zip had to be less than 40MB, so some files have been deleted.
The following files would need to be downloaded:

The CERMINE jar file can be obtained from https://github.com/CeON/CERMINE and the file should be but into the folder structure below:

./cermine-impl-1.13-jar-with-dependencies.jar

The Stanford NLP POS Tagger can be obtained from: https://nlp.stanford.edu/software/tagger.shtml and the files should be but into the folder structure below:

./stanford-postagger.jar
./models/english-bidirectional-distsim.tagger
./models/english-bidirectional-distsim.tagger.props
./models/english-left3words-distsim.tagger
./models/english-left3words-distsim.tagger.props
./models/README-Models.txt


Executable Scripts
==================
All of these scripts can be executed without parameters:

e.g. python GetArxivMetaData.py

PLEASE NOTE: Most of these files won't find work to do as the files they are
to generate have already been generated!

01. GetArxivMetaData.py
02. GetPdfFiles.py
03. ConvertPdfToXmlFiles.py
04. CreateXmlDataSet.py
05. Metrics.py
05. PostAnnotationCleanup.py
06. AnalyseSentences.py
07. NBandSVMTests.py
08. CNNTests.py
09. AnalyseNBandSVMResults.py
10. AnalyseCNNResults.py

Supporting Files containing library functions
=============================================
01. AcquireData.py
02. CleanUpData.py
03. Logging.py

arXiv meta-data and conclusions xml files
=========================================
01. search_results_000.xml
02. conclusions_with_fw.xml
03. conclusions_with_fw_15_lines_or_less.xml
04. conclusions_with_fw_15_lines_or_less_post_cleanup.xml

Serialized Pandas DataFrames
============================
01. conclusions_dataframe.pickle
02. cnn_results.pickle
03. nb_and_svm_results.pickle

CERMINE application
===================
cermine-impl-1.13-jar-with-dependencies.jar

Stanford NLP POS tagger
=======================
stanford-postagger.jar

./dataframe/ - is a backup of the DataFrames in pickle format
./pdf/ - This folder contains the PDF and XML files used in the experiments
./logs/ - Contains the TensorFlow logs generated during the CNN tests
./models/ - contains POS tagging models for stanford-postagger.jar
./plots/ - Contains the plots generated whilst analysing the results
./xml/ - is a backup of the generate xml files