Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SegmentedLine #152

Open
karwa opened this issue May 21, 2022 · 3 comments
Open

SegmentedLine #152

karwa opened this issue May 21, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@karwa
Copy link

karwa commented May 21, 2022

I had a need for a data structure recently and couldn't find any good package implementations. I think it might be a nice addition to this package.

Essentially, the problem was that I needed to generate some Unicode data tables. This required a structure that allowed me to model the entire 21-bit space of Unicode Code Points, and essentially "paint" regions with data I parsed from the Unicode files. The second part was that I had to optimise the data - for example, of the dozen or so Bidi_Class values, I only needed data to group scalars in to 4 categories. Certain code points would be disallowed by earlier stages of the process and could be skipped, and I wanted to combine data sets from different files in order to reduce lookups. At the same time, Unicode scalars are sparse; most of that 21-bit space is unassigned.

Anyway, I had a brief look at the various tree candidates and nothing really suited me. I don't really care so much about lookup time, since the emphasis was on creating something that could support flexible editing operations. I eventually made a simple sort of number-line, which, well, paints regions of a comparable space with a value:

var line = SegmentedLine<Int, String?>(bounds: 0..<100, value: nil)

// After setting values <5 to "small" and values >10 to "large",
// the gap is left with its previous value, "medium".

line.set(0..<20,  to: "medium")
line.set(0..<5,   to: "small")
line.set(10..<60, to: "large")
print(line)
// | [0..<5]: "small" | [5..<10]: "medium" | [10..<60]: "large" | [60..<100]: nil |
let string = "Bob is feeling great"

// Create a SegmentedLine for the collection's contents.
// Start by setting a font attribute over the entire string.

var tags = SegmentedLine(
  bounds: string.startIndex..<string.endIndex,
  value: [Font.custom("Comic Sans")] as [Any]
)

// Set each word to a different color.
// Use 'modify' to append the attribute, but only for the region
// we're modifying.

for word: Substring in string.split(separator: " ") {
  tags.modify(word.startIndex..<word.endIndex) { attributes in
    attributes.append(Color.random())
  }
}

// Check the result.
// - ✅ Every segment still contains the font attribute.
// - ✅ Each word also contains its individual color attribute.

for (range, attributes) in tags.segments {
  print(#""\#(string[range])""#, "-", attributes)
}

// "Bob"     [Font.custom("Comic Sans"), Color.orange]
// " "       [Font.custom("Comic Sans")]
// "is"      [Font.custom("Comic Sans"), Color.green]
// " "       [Font.custom("Comic Sans")]
// "feeling" [Font.custom("Comic Sans"), Color.pink]
// " "       [Font.custom("Comic Sans")]
// "great"   [Font.custom("Comic Sans"), Color.yellow]
// ℹ️ Imagine we have a complex SegmentedLine with lots of small segments
//    capturing granular details, and we'd like to simplify it.

enum ComplexData {
   case categoryA, categoryB, categoryC // ...
}
let complexLine: SegmentedLine<Int, ComplexData> = // ...
print(complexLine)
// | [0..<2]: categoryA | [2..<4]: categoryB | [4..<12]: categoryC | ...

// 1️⃣ Perhaps we can map these to a smaller number of states.

enum SimplifiedData {
   case valid, invalid
}
var simplifiedLine = complexLine.mapValues { complex in
  SimplifiedData(validating: complex)
}
print(simplifiedLine)
// | [0..<2]: valid | [2..<4]: valid | [4..<12]: valid | ...

// 2️⃣ Notice that we have lots of segments for boundaries which
//    which are no longer important. 'combineSegments' can clean them up.

simplifiedLine.combineSegments()
print(simplifiedLine)
// | [0..<2000]: valid | [2000..<2024]: invalid | [2024..<2056]: valid | ...

Implementation:
https://github.com/karwa/swift-url/blob/df08ccec114350c4bb845d28c5d5850c20521cca/Sources/UnicodeDataStructures/Shared/SegmentedLine.swift

It has been super helpful to have this thing. In particular, to do something like reduce the Bidi_Class values, there are 2 dead-simple operations to map the elements, and then to perform a kind of restartable left-fold to gather elements in to larger regions. Like this. I actually generate an indexed static data table straight from the SegmentedLine. It's pretty nifty, I think.

It's nothing particularly groundbreaking, but it has been surprisingly effective and, as I said, I couldn't find any good package solution to solve this. I think it's an interesting design space to think of operations that this kind of structure could make easier.

@karwa karwa added the enhancement New feature or request label May 21, 2022
@lorentey
Copy link
Member

Cc @Azoy

@karwa

This comment was marked as outdated.

@karwa karwa changed the title RangeTable/NumberLine SegmentedLine Jun 9, 2022
@karwa
Copy link
Author

karwa commented Jun 9, 2022

@lorentey I added a modify operation and some better examples to the original post, along with a link to the finished implementation (now called SegmentedLine). There are tests in the package, too, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants