Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Un classique inconnu #98

Open
mbwolff opened this issue Nov 30, 2018 · 3 comments
Open

Un classique inconnu #98

mbwolff opened this issue Nov 30, 2018 · 3 comments

Comments

@mbwolff
Copy link

mbwolff commented Nov 30, 2018

The code for this project attempts to generate a text in French with rhymed couplets of Alexandrines using the Théâtre Classique's online collection of French plays from the sixteenth to the nineteenth centuries.

The goal is to transform a canonical text systematically by an analogy based on a major theme in the text. By rewriting the text with other texts from the same period, it may be possible to reinvent rhetorically the text from an orthogonal perspective. Or something like that.

Here's the procedure:

  1. Scrape the links for all the XML files of the plays from this page and then download them to build a corpus.
  2. Make a vector space for all words in the corpus using Gensim's Word2Vec module. The words are lemmatized using SpaCy to simplify the vector space.
  3. Build a tf-idf matrix for all the verses in all the plays in the corpus.
  4. Choose a play in the corpus (such as Racine's Phèdre) and a pair of words to form the basis of an analogy (femme and homme, for instance). The pair will enable a modification of the play by replacing words according to the analogy (roi is to homme as reine is to femme).
  5. Take the first verse in the original play and modify the verse with word substitutions based on the vector space.
  6. Construct a tf-idf vector for the modified verse based on the matrix for the whole corpus.
  7. Find a verse in the corpus that is most similar to the modified verse using cosine similarity. The verse should follow the pattern aa bb cc dd ... where the rhymes alternate between feminine (the last word of the verse ending in a silent e) and masculine (the last word ending in some other letter). The epitran module is useful for transliterating text into IPA, although it is imperfect (as its authors acknowledge) because the relationship between word spellings and phonetics in French is complicated.
  8. Return to step 5, taking the next verse in the original play, and continue until every verse in the original play has been modified, vectorized, and replaced with another verse from the corpus.

Here are the first lines of the generated text:

J'en aurais paru digne autant ou plus qu'un autre :  (CORNEILLEP_PULCHERIE.xml:994)
Doncques vous vous plaignez d'une ingrate maîtresse ?   (DESMARETS_VISIONNAIRES.xml:1368)
Mais un coup d'oeil peut subjuguer un sage.  (VOLTAIRE_DROITDUSEIGNEUR.xml:934)
Qui retient mon courage  (URFE_SYLVANIRE.xml:4031)
Mes feux, qu'ont redoublés ces propos adorables,  (CORNEILLEP_SUIVANTE.xml:913)
Je t'ai fait un secret dont la charge m'accable ;  (NIVELLE_PREJUGEALAMODE.xml:414)
On parle bien de vous, le Prince vous regarde   (AURE_GENEVIEVE70.xml:1357)
Paix, voici mon vieillard.   (AURE_DIPNE.xml:315)

At the end of each verse is a reference to its source text and line number.

@hugovk
Copy link
Member

hugovk commented Dec 1, 2018

Do you think you can output 50k words?

@mbwolff
Copy link
Author

mbwolff commented Dec 1, 2018 via email

@mbwolff
Copy link
Author

mbwolff commented Dec 3, 2018

It's updated to generate a very long rhymed text. I had to change the algorithm because I could not find a dramatic text in French verse of at least 50,000 words. The new procedure selects randomly one of 819,787 verses from the corpus of French plays and goes from there. I updated the README to explain the new procedure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants