Skip to content

Files to generate an master's thesis for the CCI course MA Internet Equalities at UAL.

License

Notifications You must be signed in to change notification settings

lexahl/text-regurgitation

Repository files navigation

                                                                                             
            .(@@@@@@%/         *%/,,,,,,,..,*/((((/.                                                
          /@@@&%&&@@@@@@&.  ,,#,,,,,,,,,,,,,,&%(........,*/%(.                                      
         @@@&%%&&@@@@@@@@@@@% (,((((/,,,,,,*,,,,@,,....,.......,..,#%.                              
        &@@&&@@@@@@@@@@@@@//*(,............**,,,,,,*%.............,,..,..%,                         
       *@@@@@@@@@@@@@%(%/%///*(%..,............ ...,..,/.,,..,.............../*                     
       (@@@%**/*#@@@@@((/*&////%./.,.................#*//*&,..................,.*#                  
       *@/(***//////*///////////@,.#((,,..........,*(///////*%,..,.....,/((/**(*...,(               
         #((((/////*,*////////((#,#.(,,,,%(* .,,,,,(/*****///////%.,,,...#/........,.%.%            
             &*@%*(#/**////////(,,/,..#.*..,,,...%#%*/////////*,*//%....,....,......,*.,..          
                 &*//%#%/////////   %,...,/,..,,,,,,,,.*&#*/////(////#...,*((,,.,...#.,., /         
                       *,%*&*        ,(,...,.%.,,,,,,,,,,,,%*//////////#,,,,,,,,,,,&(#,,,,,.
                       .*#.            ,( *,.((,#.,,,,,,,,,,.@*/////////(*,,,,,,,*/**/#.,,.&
                       /,*x                ,&%@%/,./#,,,,,,,,*,./%*///////&,*,,%,*,#..#,,, %
                       (w*&                   &//////%&/ ,,,,,*,(.,.%//////#%.#%,,.,,/,,,,/  
                       *.#,                     &//(///#   /%.,,,,,,,**(*/((//,... ./.*/,.,         
                       #e,/                      .%//(/%   @(/,,.,*,,*,,&%*&//*,,,,**,,,.*          
                       (.*/                        ,(///( /.*****,%./#///*/////#,,,(.,*.(           
                       */#                           #///(/,,..,(#(//////***//&,*,(.**/.            
                       *,(                            /(//%,.,,,,%.##////*/,@.,,.(.,,#.             
                       /o%                             %///@.,*&.%.#%*#,#&,,,,,.#.,*/,              
                       /t%                            (//(/(.(% %,*%/%.%*#,,,,.#.,***               
                       .*(                          .#///(/*&#/..,/(#(,(,,,,,*.,,*,#                
                       ,q,                         (/(((((&*@&#((*((*//#&%*/#**/(((,                
                        %.                         (/%(#//(((((((%(*//////(/////%                   
                        ,.                         ,(#/%(&(((((&,,//////////(((.                    
                        (,                           ,#(//%((((% %//**//////*/                      
                                                      ,(////((((#//*//////%(                        
                                                       #////(((((%/////////&                        
                        ,#                             *(////(((((&(/////////*                      
                                                        #////(((((%///////////*                     
                                                        %(///(((((%#//////////%                     
                         A                              %#(//((((& #*/////(///%                     
                                                        %#(//((((# /*///////(/(                     
                                                        %#(/((((#, */////////%                      
                         z                               ##((((((%  *(///////#                       
                                                        #(((((((&  ,(/////(#@                       
                                                        #(/(((#(&  *//////(//                       
                                                      .#(///(#((%  ///////(#                        
                         ,                     ,%&&(//////((((((((& ,(/////(@                        
                                           //////(%##/(#%%(((((((( *//////(&                        
████████╗███████╗██╗░░██╗████████╗        ((%%%%###(########(((((% ////////@                        
╚══██╔══╝██╔════╝╚██╗██╔╝╚══██╔══╝                                 @///////@                        
░░░██║░░░█████╗░░░╚███╔╝░░░░██║░░░                              .%*////////@                        
░░░██║░░░██╔══╝░░░██╔██╗░░░░██║░░░                           ,&*/////////(/@.                       
░░░██║░░░███████╗██╔╝╚██╗░░░██║░░░                       &######%%#(/*///%%%@                       
░░░╚═╝░░░╚══════╝╚═╝░░╚═╝░░░╚═╝░░░                   *##(%#//(#%%%%#(##&#%%.                        
                                                    (#&&#@,%*(*(((&%%%(                             
                                                        .#@@%##%&&(                                 


██████╗░███████╗░██████╗░██╗░░░██╗██████╗░░██████╗░██╗████████╗░█████╗░████████╗██╗░█████╗░███╗░░██╗
██╔══██╗██╔════╝██╔════╝░██║░░░██║██╔══██╗██╔════╝░██║╚══██╔══╝██╔══██╗╚══██╔══╝██║██╔══██╗████╗░██║
██████╔╝█████╗░░██║░░██╗░██║░░░██║██████╔╝██║░░██╗░██║░░░██║░░░███████║░░░██║░░░██║██║░░██║██╔██╗██║
██╔══██╗██╔══╝░░██║░░╚██╗██║░░░██║██╔══██╗██║░░╚██╗██║░░░██║░░░██╔══██║░░░██║░░░██║██║░░██║██║╚████║
██║░░██║███████╗╚██████╔╝╚██████╔╝██║░░██║╚██████╔╝██║░░░██║░░░██║░░██║░░░██║░░░██║╚█████╔╝██║░╚███║
╚═╝░░╚═╝╚══════╝░╚═════╝░░╚═════╝░╚═╝░░╚═╝░╚═════╝░╚═╝░░░╚═╝░░░╚═╝░░╚═╝░░░╚═╝░░░╚═╝░╚════╝░╚═╝░░╚══╝

Text Regurgitation

Text Regurgitation aims to critique Large Language Models often unacknowledged but harmful decontextualization of language through the parody of text generation algorithms. Text Regurgitation is simultaneously a commentary on western education systems and knowledge production. Regurgitation refers to the act of bringing up something that has been previously swallowed or digested. In the context of information, regurgitation refers to the repetition of previously learned information without understanding it. Language models can not understand; they can only regurgitate without meaning, even if the produced text is seemingly coherent. 

The project takes the form of multiple receipts, each containing a "thesis." These theses have been generated intentionally without using Large Language Models. Instead, the text is generated using various functions that take inspiration from algorithms, some over 100 years old (see below). The text corpus was created from the assigned readings for the course "MA Internet Equalities" and the syllabus "Book of Units" itself. This is the repository for the code that generates the thesis with created text corpora. Example theses generated using this code are available in this repository as output.txt and in the folder theses.

About the Algorithms

░█████╗░███████╗░██████╗░  ░██╗░░░░░░░██╗██╗████████╗██╗░░██╗  ███╗░░██╗██╗░░░░░████████╗██╗░░██╗
██╔══██╗██╔════╝██╔════╝░  ░██║░░██╗░░██║██║╚══██╔══╝██║░░██║  ████╗░██║██║░░░░░╚══██╔══╝██║░██╔╝
██║░░╚═╝█████╗░░██║░░██╗░  ░╚██╗████╗██╔╝██║░░░██║░░░███████║  ██╔██╗██║██║░░░░░░░░██║░░░█████═╝░
██║░░██╗██╔══╝░░██║░░╚██╗  ░░████╔═████║░██║░░░██║░░░██╔══██║  ██║╚████║██║░░░░░░░░██║░░░██╔═██╗░
╚█████╔╝██║░░░░░╚██████╔╝  ░░╚██╔╝░╚██╔╝░██║░░░██║░░░██║░░██║  ██║░╚███║███████╗░░░██║░░░██║░╚██╗
░╚════╝░╚═╝░░░░░░╚═════╝░  ░░░╚═╝░░░╚═╝░░╚═╝░░░╚═╝░░░╚═╝░░╚═╝  ╚═╝░░╚══╝╚══════╝░░░╚═╝░░░╚═╝░░╚═╝

About the Context-Free Grammar with NLTK Algorithm

CFG (Context-Free Grammar) refers to a system that represents all possible strings in a given formal language. Symbols represent language in CFG, and NLTK is a platform in Python to work with human language data (often used in Natural Language Processing) that can represent language in symbols (Part-of-speech [POS] tag), among other functionalities.

ily = "I love you" 
ily_t = word_tokenize(ily) # -> ['I', 'love', 'you']
ily_td = nltk.pos_tag(ily_t) # -> [('I', 'PRP'), ('love', 'VBP'), ('you', 'PRP')]

PRP = "personal Pronoun"
VBP = "non-3rd person singular present forms"

The grammar for ily would be PRP->VBP->PRP. For example, other VBPs ("non-3rd person singular present forms") are "like," "hate," "need," etc. With this grammar, the VBP can be replaced to create "I hate him," and the grammar, representing this possible string in English, would still be grammatically correct.

In this project, NLTK is used to generate the abstract of the thesis by tokenizing (splitting up the text into words), and then Part-of-speech (POS) tagging the words. The CFG with NLTK algorithm in this project tags and tokenizes both an input text and an "ideal" abstract. The majority of the text is generated using a created text corpus from the readings assigned in the MA syllabus, but the abstract and introduction are generated using text from the syllabus (book_of_units.txt) itself, which describes the goals and topics discussed in the course. The algorithm then replaces the words in an abstract with words from the (randomly shuffled) syllabus with the same POS tag.

Abstract and Conclusion

  • POS tagging is used to create grammars to generate the Abstract
  • The conclusion is the generated Abstract randomly shuffled.
  • Reference: NLTK

████████╗██████╗░░█████╗░██╗░░░██╗███████╗░██████╗████████╗██╗░░░██╗
╚══██╔══╝██╔══██╗██╔══██╗██║░░░██║██╔════╝██╔════╝╚══██╔══╝╚██╗░██╔╝
░░░██║░░░██████╔╝███████║╚██╗░██╔╝█████╗░░╚█████╗░░░░██║░░░░╚████╔╝░
░░░██║░░░██╔══██╗██╔══██║░╚████╔╝░██╔══╝░░░╚═══██╗░░░██║░░░░░╚██╔╝░░
░░░██║░░░██║░░██║██║░░██║░░╚██╔╝░░███████╗██████╔╝░░░██║░░░░░░██║░░░
░░░╚═╝░░░╚═╝░░╚═╝╚═╝░░╚═╝░░░╚═╝░░░╚══════╝╚═════╝░░░░╚═╝░░░░░░╚═╝░░░

About the Travesty Algorithm

A Travesty Generator for Micros by Hugh Kenner and Joseph O'Rourke was published in BYTE Magazine in 1984. With the subtitle, "nonsense imitation can be disconcertingly recognizable," this algorithm is an application of Markov chains and scrambles text in a way that can feel familiar because it has the same frequency of which pairs of words or characters that appear in the original text. (Read more here).

A text, such as a passage from a novel, is, among other things, a set of characters. It consists of so many e's, so many f's, and so on. It's also a set of character pairs (so many ex's, so many ch's, etc.) and of triplets (die's, wkw's, etc.), and so on. For any same-size group of characters — call the size n — it's possible to make a frequency table for a particular text. From that table, another text can be constructed that shares statistical properties but only those properties with the first one. That's what Travesty does. It produces an output text that duplicates the frequencies of n-character groups in the input text. (Description source: Virtual Muse: Experiments in Computer Poetry by CO Hartman, 1996)

As stated above, the introduction is generated using text from the syllabus (book_of_units.txt) itself, which describes the goals and topics discussed in the course, and in the Travesty part of the algorithm, the word "course" is swapped with "thesis." Due to the design of the algorithm, not being random, the introduction is the same if the input text is the same.


███╗░░░███╗░█████╗░██████╗░██╗░░██╗░█████╗░██╗░░░██╗
████╗░████║██╔══██╗██╔══██╗██║░██╔╝██╔══██╗██║░░░██║
██╔████╔██║███████║██████╔╝█████═╝░██║░░██║╚██╗░██╔╝
██║╚██╔╝██║██╔══██║██╔══██╗██╔═██╗░██║░░██║░╚████╔╝░
██║░╚═╝░██║██║░░██║██║░░██║██║░╚██╗╚█████╔╝░░╚██╔╝░░
╚═╝░░░░░╚═╝╚═╝░░╚═╝╚═╝░░╚═╝╚═╝░░╚═╝░╚════╝░░░░╚═╝░░░

About the Markov Algorithm

The "Markov" algorithm refers to algorithms used in this project that generate text based on Markov chains. Markov chains are stochastic models that represent a sequence of possible events. In the context of text generation, Markov chains can regurgitate text by selecting the next word based on the previous one(s) by using the probabilities of what words are most likely to follow. A "character-level" Markov chain algorithm uses individual characters ('a',' b',...) and the combinations that they occur in a text to make predictions and a "word-level" Markov chain algorithm uses whole words ('hello', 'goodbye',...) and the order that they occur in a text to make predictions. Markov Algorithms require a prompt (starting point) to begin. The algorithms in this project randomly select a starting word/character from the text corpus.

  • Python adaptions of Markov chains have been used to generate the "Literature Review" and "Methods" sections in the thesis.
  • "Literature Review" is word-level, and "Methods" is character-level
  • Reference: N-grams and Markov chains by Allison Parrish, license below.

██████╗░░█████╗░██████╗░░█████╗░
██╔══██╗██╔══██╗██╔══██╗██╔══██╗
██║░░██║███████║██║░░██║███████║
██║░░██║██╔══██║██║░░██║██╔══██║
██████╔╝██║░░██║██████╔╝██║░░██║
╚═════╝░╚═╝░░╚═╝╚═════╝░╚═╝░░╚═╝

About the Dada Algorithm

"Dada" text regurgitating algorithms in this project are based on the guide "To Make a Dadaist Poem" (Pour Faire Un Poème Dadaiste, 1920) by Tristan Tzara.

Take a newspaper. Take some scissors. Choose from this paper an article of the length you want to make your poem. Cut out the article. Next, carefully cut out each of the words that make up this article and put them all in a bag. Shake gently. Next, take out each cutting one after the other. Copy conscientiously in the order in which they left the bag. The poem will resemble you. And there you are - an infinitely original author of charming sensibility, even though unappreciated by the vulgar herd. (Translation reference: Pour faire un poème dadaïste [traduction en anglais] by Alma Barroca)

Prenez un journal. Prenez des ciseaux. Choisissez dans ce journal un article ayant la longueur que vous comptez donner à votre poème. Découpez l’article. Découpez ensuite avec soin chacun des mots qui forment cet article et mettez-les dans un sac. Agitez doucement. Sortez ensuite chaque coupure l’une après l’autre. Copiez les consciencieusement dans l’ordre où elles ont quitté le sac. Le poème vous ressemblera. Et vous voilà un écrivain infiniment original et d’une sensibilité charmante, encore qu’incomprise du vulgaire.

The algorithms first split up the text (take some scissors), for the character-level algorithm, it splits the text into characters, and for the word-level algorithm, it splits the text into words (carefully cut out each of the words that make up this article), then randomly (shake gently) selects words/characters one at a time to regurgitate text (copy conscientiously in the order in which they left the bag).

  • Python adaptions of "To Make a Dadaist Poem" have been used to generate the "Presentation of Work" and "Discussion" sections in the thesis.
  • "Presentation of Work" is word-level, and "Discussion" is character-level

How to Run (and Print) This Code

Download or clone the repository to a computer. Navigate in the terminal/command line to the folder `cd .../text-regurgitation-main.

First run the command below to generate the introduction:

python3 travesty.py book_of_units.txt> travesty-intro.txt

Once the above command is complete, run the following command to produce a thesis:

python3 regurgitate.py

A file output.txt will be written with a "generated" thesis. The thesis will be different each time the above command is run. The thesis will also show in the terminal/command line.

To print using CUPS, use the command line prompt below:

lp -o lpi=10 -o cpi=17 output.txt

Modifications: Adjust the values of lpi (lines per inch) and cpi (characters per inch) as needed. This code was made to create a file with formatting for printing on a thermal printer. If you would like to use your own files, move your text files into the folder as sources.txt, references.txt, book_of_units.txt after removing/renaming the original files. Edit the formatting directly in the regurgitate.py file in the ## formatting variables section.

Help: If travesty.py does not run, make sure your syllabus text file is larger than 2000 characters. If IndexError: list index out of range play around with the defaults in travesty.py. If there is trouble with NLTK, see Installing NLTK for installation help with NLTK.

LICENSES

**These must be preserved with this repository's MPL-2.0 license **

"Travesty in Python" - MIT License, Copyright (c) 2019 Rodney Shupe

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.



"N-grams and Markov chains" - Copyright © 2018 Allison Parrish

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

About

Files to generate an master's thesis for the CCI course MA Internet Equalities at UAL.

Topics

Resources

License

Stars

Watchers

Forks

Languages