Skip to content

Stylometric analysis of poetic texts based on their versification

Notifications You must be signed in to change notification settings

versotym/stichometry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 

Repository files navigation

Parameters

lang :      'cs', 'de', 'es' ... 
  Language to process (subfolder in "pickle" folder)  

method :     'sticho', 'word', 'lemma', '3gram_t' ...
  Which data to use for attribution (file in pickle > lang folder)

Methods

Data filtering

reduce_features(filters)
Only for method == 'sticho'
Filter features (columns) which should be used for attribution. 
E.g. drop all statistics on rhyme, or leave on stress profile.
  filters:     conditions to filter features (format accepted by pandas .query method)
               default: None    

mfi(n)
Only for method != 'sticho'
Select how many most frequented items (words, lemmata, n-grams) will be analyzed.
  n:           int
               number of mfi
               default: 500    

reduce_sets(filters, n_min, remove_singles)
Filter datasets (rows) according to specified conditions.
  filters:         conditions to filter datasets (format accepted by pandas .query method)
                   default: None
  n_min:           int
                   minimum number of all features to keep dataset
                   default: 0
  remove_singles:  boolean
                   whether to drop datasets author of which is not author of any other dataset
                   default: True

Normalization

zscores()
Normalize data to z-scores across datasets.

Attribution

nearest_neighbour()
Classification by nearest neighbour (various distance metrics)

svm(multiclass, **kwargs)
Classification by support vector machine
  multiclass:      boolean
                   whether to perform multiclass or binary classification
                   when 'True' each dataset is assigned to one author
                   when 'False' on-vs.-rest. classifier is trained for every author resulting in:
                      (a) assigning author to the dataset if precisely one classifier 
                          gives other decision than 'rest'
                      (b) "I don't know" answer in other cases
                   default: True
  **kwargs:        Parameters for sklearn.svm.SVC (e.g. kernel, gamma...)
  
random_forest(multiclass, **kwargs)  
Classification by random forest
  multiclass:      boolean
                   whether to perform multiclass or binary classification
                   when 'True' each dataset is assigned to one author
                   when 'False' on-vs.-rest. classifier is trained for every author resulting in:
                      (a) assigning author to the dataset if precisely one classifier 
                          gives other decision than 'rest'
                      (b) "I don't know" answer in other cases
                   default: True
  **kwargs:        Parameters for sklearn.ensemble.randomForestClassifier 
                   (e.g. n_estimators, class_weight...)

Evaluation

evaluate()
Print evaluation of particulars methods that were applied

dendrograms()
Plot dendrograms (only if nearest_neighbour has been applied)

complete_results(pickle, filename)
Returns dictionary with complete results
  pickle:          boolean
                   whether to pickle dict into a file (stored in 'pickle' folder)
                   default: True
  filename:        specifies the name of a pickled file
                   default: method name (e.g. sticho, word...)

About

Stylometric analysis of poetic texts based on their versification

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages