Skip to content

hjian42/SVO_Automation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 

Repository files navigation

SVO Triplet Automation of Narrative Stories for Social Sciences

The goal of the project is to build the pipeline to automate the process of generating SVO triplets for the use of social science research. For example, character relationships can be visualized using networks in Gephi based on SVO triplets. In the end, we want to integrate the pipeline into the NLP software PC-ACE developed by Professor Roberto Franzosi at Emory from Sociology Department.

The whole pipeline is composed of three steps:

  • Data Cleaning
  • Anaphora Resolution
  • SVO Triplets Extraction

Data Cleaning

  • Clean data converted from pdf format
  • Extract titles and contents of Emory Lynching articles and separate them into two parts

Anaphora Resolution: Stanford CoreNLP

  • Replace mentions of entities (e.g. pronouns like "he" and "she") with their most representative representations using Stanford CoreNLP's coreference (anaphora) resolution
  • Used to maximize and validate SVO extraction by correctly identifying actors

For example:

Bill Cato Attempted to Assault Mrs. Vickers. He was shot to death. will look like Bill Cato Attempted to Assault Mrs. Vickers. Bill Cato was shot to death. after anaphora resolution.

SVO Extraction: ClausIE

  • Format Emory Lynching Corpus cleaned_corenlp_lynching.txt into clausie_input.txt to be ready for ClausIE in order to get triplets
  • Extract only SVO's from sentences-test-out.txt to svo.txt
  • Filter SVO sets into terminal_svo.txt by preserving only triplets with a confirmed social actor as the subject

The SVO results will look like the following (verbs are converted into stem, so estim means estimate):

  S: mob            , V: estimate       , O: shooting       
  S: girl           , V: protect        , O: negro          
  S: prisoner       , V: have            , O: neck 

Data Visualization

  • output file is ready to be seen by Gephi

         Node1     Edge      Node2
    0   people     have      wrath
    1   people     have      hands
    2   county     have       duty
    3  sheriff  convene      court
    4  sheriff      try  criminals
    

Dependencies:

  • Stanford CoreNlp
  • NLTK
  • ClausIE
  • enchant

Version

Alpha Version. It is still up to changes in the future. Welcome any comments and advice.