Skip to content

Toolkit aimed on Character Personalities Extraction from Literature Novel Books with Experiments organization in separate folders

License

Notifications You must be signed in to change notification settings

nicolay-r/deep-book-processing

Repository files navigation

Book Processing Framework

Open In Colab

This repository represents source code for the literature character personality formation workflow which is 🔥 solely relies on book content only 🔥, described in paper Personality Profiling for Literary Character Dialogue Agents with Human Level Attributes.

Contents

Workflow

This repository represents a source code for literature novel book processing workflow implementation.

Task: Studies propose the novel Character Comments Annotation problem, which refers to quotation annotation [paper].

This workflow relies on external text processing components: (1) NER, (2) automatic dialogue annotation. See dependencies section for greater detail.

The formation of datasets of character conversations represent a byproduct of the related data flow. The content of dataset yields of dialogues, with utterances that annotated with speakers.

Personality Profiling Model

We adopt adjective-pair lexicon as a source for the spectrum-based character profiling model. We provide API for collection information on characters and composing their personalities in a form of output matrices:

Each row of the matrix represent character and columns related to their personality traits. There are two type of output personalities (see figure below): (left) individual and (right) inter-dependent / embeddings based on personalities factorization model.

Applications

The directions this project was aimed at the following research directions:

  • e_pairs -- response generation and response prediction for the given dialogue pairs aka CONV-turns;
  • e_se [legacy] -- extraction of the speakers for utterances in Subin Jung thesis work;
  • e_rag [legacy] -- extraction of utterances and contexts as well as forming character knowledge based for RAG and augmenting Large Languge Models (LLM).

For each direction we provide a pipeline (sequence of the separately ordered scripts) aimed at resource construction and evaluation.

Datasets

LDC

The common version of the resource dubbed as Literature Dialogue Collection (LDC).

It consists of dialogues extracted from 17K books of the Project Gutenberg platform. This resource could be automatically constructed using the following steps:

  1. Downloading all the necessary books 📚 and resources (Downloading takes: ~3.5 hours ☕)
  2. Executing the scripts from e_pairs directory.

We fine-cleaned dataset of dialogue pairs between 400 most-frequently appeared characters which results in LDC-400 datasets.

LDC-400-RP

This dataset if for the Response Prediction problem.

We utilize ParlAI framework for conducting experiments. In order to embed extracted data, we utilize the related data formatter.

Link for ParlAI agents / task: [parlai-agents]

Candidates count: 20

Collection-type Format train valid test
NO-HLA ParlAI Train w/o HLA Valid w/o HLA Not Applicable
HLA-spectrum ParlAI Train with HLA Valid with HLA Five speakers: [1] [2] [3] [4] [5]
Human Evaluation Text -- -- Five speakers: [1] [2] [3] [4] [5]

NOTE: Please use nicolay-r/parlai_bookchar_task repository on embedding task into ParlAI. All the resources below are automatically downloaded once the task is embedded into ParlAI framework.

LDC-400-SR

This dataset is for Speaker Recognition problem.

We utilize ParlAI framework for conducting experiments. In order to embed extracted data, we utilize the related data formatter.

Link for ParlAI agents / task: [parlai-agents]

Candidates count: 20

Collection-type Format train valid test
HLA-spectrum ParlAI Train with HLA Valid with HLA Five speakers: [1] [2] [3] [4] [5]

NOTE: Please use nicolay-r/parlai_bookchar_task repository on embedding task into ParlAI. All the resources below are automatically downloaded once the task is embedded into ParlAI framework.

Experiments

Open In Colab

Dependencies

  1. NER:
  2. Dialogue utterances extraction from literature novel books:

Organizations

This work has been accomplished as a part of my Research Fellow position at Newcastle University.

References

You can cite this work as follows:

TO BE ADDED

About

Toolkit aimed on Character Personalities Extraction from Literature Novel Books with Experiments organization in separate folders

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published