Quickstart

You will need Maven 3 and Java 8 to build the project. The other dependencies such as UIMA, uimaFIT, StanfordNLP and OpenNLP are handled by maven. You can find a list of the dependencies in the pom.xml.

Clone the repository. git clone https://github.com/daimrod/csa.git
Retrieve PubMed Open Access Corpus and CSV information. (more information)
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.A-B.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.C-H.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.I-N.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.O-Z.tar.gz
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv
Extract the corpus in the directory of your choice, we will suppose it’s in \~/corpus.
Go to project directory and build it using maven. mvn -Ddev=true package
Create a file annotator.conf with the following information:

inputDirectory
the directory containing the PubMed Corpus

outputDirectory
the directory used to store the results

listArticlesFilename
a file containing the name of the articles to read

mappingFilename
a file describing the patterns used (more on this later)

windowSize
the size of the citation context window
Run the Annotator. java -cp target/csa-1.0-SNAPSHOT.jar jgreg.internship.nii.WF.AnnotatorWF -config annotator.conf
Create a file statistic.conf with the following information:

inputDirectory
the directory containing the results previously computed

mappingFilename
same as before

outputFile
the file containing the extracted statistics

infoFile
the file containing some additional information
Run the Statistic module. java -cp target/csa-1.0-SNAPSHOT.jar jgreg.internship.nii.WF.StatisticsWF -config statistic.conf

Files format

Configuration files

Here is a example of the annotator.conf file:

inputDirectory = ~/corpus/
outputDirectory = ~/workspace/output/
listArticlesFilename = ~/workspace/mylist.txt
mappingFilename = ~/workspace/hs-mapping.lst
windowSize = 1

The statistic.conf file has the exact same syntax:

inputDirectory = ~/workspace/output/
mappingFilename = ~/workspace/hs-mapping.lst
outputFile = ~/workspace/output/all-out.dat
infoFile = ~/workspace/output/info.dat

Mapping file

The mapping file is used to describe the order of the annotation in the results and where to find the cue phrases for each annotation.

order = negative neutral positive

# Sentiment cues phrases
negative = ~/workspace/negative.pat
neutral = ~/workspace/neutral.pat
positive = ~/workspace/positive.pat

Patterns files

The Annotator module uses the Stanford NLP Token Sequence Matcher to match cues pharses. You can find a description of the accepted syntax here.

The pattern files must have one pattern per line, here are some examples of the accepted patterns:

good
/state-of-the-art/
{ tag:"NN" } achieve

How to run it in parallel?

You can dispatch the processing on N processes by splitting the list of articles in N chunks (e.g. using the split(1) command) and using the GNU Parallel tool.

For example, to use 20 cores:

split -n l/20 path/to/listArticlesFilename list-
ls list-* | parallel --halt 2 \
                     java -cp target/csa-1.0-SNAPSHOT.jar \
                     jgreg.internship.nii.WF.AnnotatorWF \
                     -config annotator.conf \
                     -listArticlesFilename {}

The split(1) command will split the file listArticlesFilename in 20 files prefixed by “list-“. We then use the parallel(1) command to run as many java processes as input file and overriding the listArticlesFilename parameter from the configuration file using a command line parameter.

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
src		src
.gitignore		.gitignore
README.org		README.org
exec.sh		exec.sh
jcasgen.sh		jcasgen.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.org

README.org

exec.sh

exec.sh

jcasgen.sh

jcasgen.sh

pom.xml

pom.xml

Repository files navigation

Quickstart

Files format

Configuration files

Mapping file

Patterns files

How to run it in parallel?

About

Releases

Packages

nmeuschke/csa

Folders and files

Latest commit

History

Repository files navigation

Quickstart

Files format

Configuration files

Mapping file

Patterns files

How to run it in parallel?

About

Resources

Stars

Watchers

Forks