SegmentedLine #152

karwa · 2022-05-21T17:31:58Z

I had a need for a data structure recently and couldn't find any good package implementations. I think it might be a nice addition to this package.

Essentially, the problem was that I needed to generate some Unicode data tables. This required a structure that allowed me to model the entire 21-bit space of Unicode Code Points, and essentially "paint" regions with data I parsed from the Unicode files. The second part was that I had to optimise the data - for example, of the dozen or so Bidi_Class values, I only needed data to group scalars in to 4 categories. Certain code points would be disallowed by earlier stages of the process and could be skipped, and I wanted to combine data sets from different files in order to reduce lookups. At the same time, Unicode scalars are sparse; most of that 21-bit space is unassigned.

Anyway, I had a brief look at the various tree candidates and nothing really suited me. I don't really care so much about lookup time, since the emphasis was on creating something that could support flexible editing operations. I eventually made a simple sort of number-line, which, well, paints regions of a comparable space with a value:

var line = SegmentedLine<Int, String?>(bounds: 0..<100, value: nil)

// After setting values <5 to "small" and values >10 to "large",
// the gap is left with its previous value, "medium".

line.set(0..<20,  to: "medium")
line.set(0..<5,   to: "small")
line.set(10..<60, to: "large")
print(line)
// | [0..<5]: "small" | [5..<10]: "medium" | [10..<60]: "large" | [60..<100]: nil |

let string = "Bob is feeling great"

// Create a SegmentedLine for the collection's contents.
// Start by setting a font attribute over the entire string.

var tags = SegmentedLine(
  bounds: string.startIndex..<string.endIndex,
  value: [Font.custom("Comic Sans")] as [Any]
)

// Set each word to a different color.
// Use 'modify' to append the attribute, but only for the region
// we're modifying.

for word: Substring in string.split(separator: " ") {
  tags.modify(word.startIndex..<word.endIndex) { attributes in
    attributes.append(Color.random())
  }
}

// Check the result.
// - ✅ Every segment still contains the font attribute.
// - ✅ Each word also contains its individual color attribute.

for (range, attributes) in tags.segments {
  print(#""\#(string[range])""#, "-", attributes)
}

// "Bob"     [Font.custom("Comic Sans"), Color.orange]
// " "       [Font.custom("Comic Sans")]
// "is"      [Font.custom("Comic Sans"), Color.green]
// " "       [Font.custom("Comic Sans")]
// "feeling" [Font.custom("Comic Sans"), Color.pink]
// " "       [Font.custom("Comic Sans")]
// "great"   [Font.custom("Comic Sans"), Color.yellow]

// ℹ️ Imagine we have a complex SegmentedLine with lots of small segments
//    capturing granular details, and we'd like to simplify it.

enum ComplexData {
   case categoryA, categoryB, categoryC // ...
}
let complexLine: SegmentedLine<Int, ComplexData> = // ...
print(complexLine)
// | [0..<2]: categoryA | [2..<4]: categoryB | [4..<12]: categoryC | ...

// 1️⃣ Perhaps we can map these to a smaller number of states.

enum SimplifiedData {
   case valid, invalid
}
var simplifiedLine = complexLine.mapValues { complex in
  SimplifiedData(validating: complex)
}
print(simplifiedLine)
// | [0..<2]: valid | [2..<4]: valid | [4..<12]: valid | ...

// 2️⃣ Notice that we have lots of segments for boundaries which
//    which are no longer important. 'combineSegments' can clean them up.

simplifiedLine.combineSegments()
print(simplifiedLine)
// | [0..<2000]: valid | [2000..<2024]: invalid | [2024..<2056]: valid | ...

Implementation:
https://github.com/karwa/swift-url/blob/df08ccec114350c4bb845d28c5d5850c20521cca/Sources/UnicodeDataStructures/Shared/SegmentedLine.swift

It has been super helpful to have this thing. In particular, to do something like reduce the Bidi_Class values, there are 2 dead-simple operations to map the elements, and then to perform a kind of restartable left-fold to gather elements in to larger regions. Like this. I actually generate an indexed static data table straight from the SegmentedLine. It's pretty nifty, I think.

It's nothing particularly groundbreaking, but it has been surprisingly effective and, as I said, I couldn't find any good package solution to solve this. I think it's an interesting design space to think of operations that this kind of structure could make easier.

The text was updated successfully, but these errors were encountered:

lorentey · 2022-05-23T21:59:09Z

Cc @Azoy

karwa · 2022-06-09T16:44:09Z

@lorentey I added a modify operation and some better examples to the original post, along with a link to the finished implementation (now called SegmentedLine). There are tests in the package, too, of course.

karwa added the enhancement New feature or request label May 21, 2022

This comment was marked as outdated.

Sign in to view

karwa changed the title ~~RangeTable/NumberLine~~ SegmentedLine Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SegmentedLine #152

SegmentedLine #152

karwa commented May 21, 2022 •

edited

lorentey commented May 23, 2022

This comment was marked as outdated.

karwa commented Jun 9, 2022

SegmentedLine #152

SegmentedLine #152

Comments

karwa commented May 21, 2022 • edited

lorentey commented May 23, 2022

This comment was marked as outdated.

karwa commented Jun 9, 2022

karwa commented May 21, 2022 •

edited