Skip to content

formats text in a zigzag (boustrophedon) manner and then introduces a kind of "compression loss" by corrupting tokens

Notifications You must be signed in to change notification settings

EveryOneIsGross/boustroLOSS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

boustroLOSS


Boustrophedon and Large Language Models: Context Preservation in Truncated Text

example of boustrophedon:

forgive me i did this on my phone by sight

here is my idea I have a
o seidob gnitacnurt tuob
f text when a portion of
b derucsbo ro tsol si ti
y truncation or physical
 I .tnemucod a ot egamad
 suspect it might help m
 ........txetnoc niatnia

here is my idea I have a
x seidob gnitacnurt tuob
xxtext when a portion of
xxxerucsbo ro tsol si ti
xxxxuncation or physical
xxxxxnemucod a ot egamad
xxxxxxct it might help m
xxxxxxx..txetnoc niatnia

boustrophedon2

Boustrophedon, derived from Greek "βουστροφηδόν" (meaning "as the ox turns"), is an ancient writing style where lines alternate in reading direction. It mirrors the pattern an ox creates when plowing a field, reversing at the end of each row. Boustrophedon is an ancient way of writing manuscripts and inscriptions, where lines alternate between reading left-to-right and right-to-left. The name comes from the Greek word "βουστροφηδόν," which means "as the ox turns," a reference to the pattern created by an ox as it plows a field, turning at the end of each row to start the next.

Boustrophedon could potentially maximize the use of contextual information in a setting where the beginning and end of a text segment are cut off, i.e., token truncation. The logic here is that in a boustrophedon style, the end of one line is the beginning of the next, so if the beginning or end of a text is lost, you still maintain some context because you have remnants of those lost tokens in the adjacent lines. You'd essentially be interlacing the text, meaning each line would have some information from both the preceding and following lines. This could provide more robust information for the model, allowing it to make better sense of the middle section of the text even if parts are missing.

This study hypothesizes that boustrophedon can serve as a method to manage token truncation in Large Language Models (LLMs), such as GPT. In the case of token truncation, the beginning and end of a text segment are removed, leaving only the middle portion. The central premise is that the boustrophedon style, due to its interlacing nature, could help retain more context within these truncated segments.

In a boustrophedon paragraph divided into 'n' lines of equal length, each line potentially provides context for the ones before and after it, forming a 'n' x 'n' contextual matrix. If truncation occurs, the boustrophedon style could facilitate retrieval of some lost contextual information from adjacent lines, acting as a 'contextual chain'.

In a normal left-to-right or right-to-left paragraph, if either the start or the end is truncated, the context is entirely lost. However, with boustrophedon, each line becomes a bridge to the line above and below it, creating a sort of 'contextual chain'. Therefore, if truncation occurs, some parts of the lost context could potentially be retrieved from the adjacent lines.

The usage of boustrophedon in ancient Greek inscriptions on stone tablets might provide an interesting historical context. While maximizing the available writing space could have been a primary reason for this style, a speculative hypothesis could be that the boustrophedon style served as a mechanism for context preservation in case a part of the tablet was damaged. Given the effort involved in engraving text onto stone, boustrophedon allows for the maximization of available space, as the engraver wouldn't need to return to the left side of the tablet for each new line. But in my thoughts if a part of the tablet was damaged, the boustrophedon style could provide a mechanism for preserving some of the context from the lost portion.

Designing an experiment to test this hypothesis would involve:

  1. Data Preparation: Transforming a text corpus into boustrophedon format.

  2. Model Training: Retraining a language model like GPT on the transformed data.

  3. Experiment Design: Comparing the performance of the boustrophedon-trained model with a traditionally trained model on a token truncation task.

  4. Evaluation: Defining evaluation metrics, analyzing results, and determining whether the boustrophedon style offers any performance advantages.

Th attached functioning 💅 python script boustroLOSS.py lets you input a body of text, state how many "tokens" per line and then which side to truncate and corrupt before printing it out formatted in a boustrophedon "tablet".

FUTURE THOUGHTS:

IDEA: Context loss over large token limits might be reduced by shrank logic "tablets" too, which might benefit from boustrolossy shadowing 🤷‍♂️

About

formats text in a zigzag (boustrophedon) manner and then introduces a kind of "compression loss" by corrupting tokens

Topics

Resources

Stars

Watchers

Forks

Languages