Skip to content

emorynlp/character-mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Character Mining

The Character Mining project challenges machine comprehension on multiparty dialogue. The objective of this project is to infer explicit and implicit contexts about individual characters through their conversations. This is an open-source project led by the Emory NLP research group that provides resources for the following tasks:

We welcome feedbacks and contributions from the community. Most of our annotation are crowdsourced; implying that, errors are expected to be found. Please make pull requests if you wish to fix errors in our datasets.

Dataset

Our dataset is based on the popular TV show called Friends. Transcripts for all 10 seasons of the show as well as manual and crowdsourced annotation for subparts of the show are provided. All text data are available in the JSON files; please visit the individual task pages to retrieve datasets specifically designed for those tasks.

Statistics

Each season consists of episodes, each episode is divided into scenes, each scene comprises utterances, each utterance is a list of sentences where tokens are split.

Season ID Episodes Scenes Utterances Sentences Tokens Speakers
s01 24 326 5,968 10,790 81,453 107
s02 24 293 5,747 9,337 81,910 107
s03 25 348 6,495 10,858 90,753 108
s04 24 338 6,318 10,889 87,289 100
s05 24 311 6,220 11,133 83,907 107
s06 25 350 6,458 11,496 90,384 112
s07 24 332 6,314 11,340 84,974 94
s08 24 288 6,220 11,714 86,164 107
s09 24 302 6,322 11,831 93,773 99
s10 18 219 5,247 9,345 69,493 78
Total 236 3,107 61,309 108,733 850,100 700

Some utterances include action notes. In the following example, extracted from s01_e01_c01_u028, the speaker is talking to Ross, which is indicated by the action note:

"transcript": "Let me get you some coffee.",
"transcript_with_note": "(to Ross) Let me get you some coffee.",

The followings show the statistics including action notes:

Season ID Utterances Sentences Tokens
s01 6,626 12,088 100,773
s02 6,048 10,565 97,763
s03 7,267 12,288 117,912
s04 7,119 12,811 116,703
s05 7,082 13,540 118,509
s06 7,235 13,506 120,471
s07 7,019 13,363 116,341
s08 6,845 13,321 109,984
s09 6,653 13,548 119,090
s10 5,479 11,029 93,390
Total 67,373 126,059 1,110,936

Documentations

References

Contact

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages