APh Corpus

authors:

Matteo Romanello, matteo.romanello@gmail.com
Eric Rebillard

Data and Goal

The main purpose of this corpus is to support the extraction of named entities--of interest to classical scholars--from secondary sources such as commentaries, journal papers, etc.

Content

catalog.csv : CSV file with four column
1. ID
2. COLLECTION : (legacy information)
3. TOKEN_COUNT : number of tokens
4. LANG : abstract language
5. BiBLIO : bibliographic information about the publication the abstract is about
iob/ : contains the corpus one record per file stored as IOB format (3 columns: token, POS tag, NE label)
- the name of each file--excluded the file extension--has a corresponding record in the catalog.csv file
txt/ : contains the corpus as plain text, one record per file
- the name of each file--excluded the file extension--has a corresponding record in the catalog.csv file
ann/
extra/

Visualizing and Annotating

Processing the Corpus

To parse the IOB files using NLTK's conll reader:

import nltk
corpus = nltk.corpus.reader.conll.ConllCorpusReader('./iob/', '.*\.txt',('words','pos','chunk'))
corpus.sents()
corpus.chunked_sents()
len(corpus.chunked_sents())

TODO

manual correction of POS tags
improve quality and readability of the biblio field

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
devset		devset
extra		extra
goldset		goldset
testset		testset
LICENSE.md		LICENSE.md
README.md		README.md
align_IOB_to_ST_annotations.py		align_IOB_to_ST_annotations.py
catalog.tsv		catalog.tsv
example.py		example.py
titles.csv		titles.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

devset

devset

extra

extra

goldset

goldset

testset

testset

LICENSE.md

LICENSE.md

README.md

README.md

align_IOB_to_ST_annotations.py

align_IOB_to_ST_annotations.py

catalog.tsv

catalog.tsv

example.py

example.py

titles.csv

titles.csv

Repository files navigation

APh Corpus

Data and Goal

Content

Visualizing and Annotating

Processing the Corpus

TODO

About

Releases 4

Packages

Languages

License

mromanello/APh_Corpus

Folders and files

Latest commit

History

Repository files navigation

APh Corpus

Data and Goal

Content

Visualizing and Annotating

Processing the Corpus

TODO

About

Resources

License

Stars

Watchers

Forks

Languages