Batch script for Named Entity Recognition

This is a single command-line script that will call the Stanford Named Entity Recognizer on each text file in a folder, count unique entities, and print the results into a spreadsheet.

The final spreadsheet (called entities.csv) will have the text filename, the entity recognized, the type of entity (organization, location, person), and the number of times that entity occurred as that type** within the document.

**It's possible for the same word to be tagged as more than one type of entity within a document.

Requirements

This script only works on text (.txt) files, but it will work on as many text files as you'd like without any further interaction on your part.

You will need to download Stanford Named Entity Recognizer and also script in this repository.

Folder Setup

As is, the script will run Stanford NER on every text (.txt) file within a folder. This expects that all of the text files and the batchner.sh script are all within the same folder, and that the NER folder (as of this writing, the stanford-ner-2018-10-16) is in the same directory as the folder of files.

(Note: you do not have to change the names of the .txt files—the filenames below are just for demonstration)

├──🗂 stanford-ner-2018-10-16
├──🗂 project folder
|   └──batchner.sh
|   └──batchner_markup.sh
|   └──file1.txt
|   └──file2.txt
|   └──file3.txt
|   └──file4.txt
|   └──file5.txt
|   └──file6.txt
|   └──etc.

If you're familiar with shell scripting and file navigation, you can fairly easily restructure this.

Running the Script

Mac OS X

Once all of your files are properly arranged as above:

Open Terminal
Navigate to the folder containing these files (using $ cd) [if you have a folder 'project' on the Desktop, type (without the $) $ cd Desktop/project.]
Type $ sh batchner.sh This will take a bit to run (4-5 files will likely take about a minute), but will print all of the results into a file in the same folder called entities.csv
Type $ sh batchner_markup.sh This will take longer to run, but will print to file two .txt files, one for all files where people are found and one for all files where places are found, each marked up with /PERSON and /LOCATION respectively.

Windows

Download and install Cygwin. Once your files are arranged as above:

Open batchner.sh in a text editor, remove the # at the start of line 8 (starts with nertext=$(java -mx600m -cp...), and add a # to line 9 (starts with nertext=$(stanford-ner...)
Open Cygwin
Navigate to the folder containing these files (using $ cd) [if you have a folder 'project' on the Desktop, type (without the $) $ cd /cygdrive/c/Users/YOUR-USERNAME/Desktop/project.]
Type $ sh batchner.sh This will take a bit to run (4-5 files will likely take about a minute), but will print all of the results into a file in the same folder called entities.csv

Credit

Goes to @brandontlocke.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

batchner.sh

batchner.sh

batchner_markup.sh

batchner_markup.sh

Repository files navigation

Batch script for Named Entity Recognition

Requirements

Folder Setup

Running the Script

Mac OS X

Windows

Credit

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
README.md		README.md
batchner.sh		batchner.sh
batchner_markup.sh		batchner_markup.sh

CatalogueLegacies/batchner

Folders and files

Latest commit

History

Repository files navigation

Batch script for Named Entity Recognition

Requirements

Folder Setup

Running the Script

Mac OS X

Windows

Credit

About

Resources

Stars

Watchers

Forks

Languages