Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O(1) indexing on bytes #443

Open
oberblastmeister opened this issue Jun 11, 2022 · 12 comments
Open

O(1) indexing on bytes #443

oberblastmeister opened this issue Jun 11, 2022 · 12 comments
Labels
feature request question Requires more investigation

Comments

@oberblastmeister
Copy link
Contributor

I am interested in making a pr for this. We would have to check that the index is on a correct char boundary. Partial and non partial variants would be provided

@Bodigrim
Copy link
Contributor

Could you please elaborate with type signatures and semantics of suggested functions?

@oberblastmeister
Copy link
Contributor Author

@Bodigrim

-- | O(1) returns the char given the index for the first code unit.
-- Panics if the indices are on invalid char boundaries
index :: HasCallStack => Int -> Text -> Char
indexMaybe :: Int -> Text -> Maybe Char
-- | O(1) returns a slice of the text given the start and end code units.
-- Panics if the indices are on invalid char boundaries
slice :: HasCallStack => Int -> Int -> Text -> Text
sliceMaybe :: Int -> Int -> Text -> Maybe Text

Detecting if a byte is a char boundary can be done with b < 128 || b >= 192

@Bodigrim
Copy link
Contributor

I’m not a fan of partial APIs, to be honest.

Isn’t Data.Text.Foreign enough for your purposes?

@oberblastmeister
Copy link
Contributor Author

Well, we already have a lot of partial functions. Libraries like bytestring and vector have similar functions. Functions that return Maybe can be used too. Also, I think, O(1) indexing is important. Lots of algorithms on text are implemented naturally with indices, and will return indices, such as text search. Currently, this can't be done because there is no O(1) indexing. Also, I don't think that silently ignoring incorrect indices is better than a partial function.

@Lysxia
Copy link
Contributor

Lysxia commented Jun 11, 2022

Maybe a compromise is to add this to Data.Text.Unsafe, and in the interest of promoting total functions, only the Maybe version(s). Although technically total, it still breaks the level of abstraction of "text as a sequence of Char". This isn't a rule currently but I'm suggesting that it could become one for future additions.

Lots of algorithms on text are implemented naturally with indices, and will return indices, such as text search.

I recently added spanM, which lets you do a scan and returns well-formed text slices wherever you decide to stop. Isn't that sufficient to cover a large class of text search algorithms in a well-encapsulated manner, without having to expose knowledge of UTF-8 in their implementations?

@oberblastmeister
Copy link
Contributor Author

it still breaks the level of abstraction of "text as a sequence of Char".

How does this break the level of abstraction and why does that matter? We are indexing by code units which make up a Char. If they are in the middle of a Char, then it is invalid. Furthermore, Text is fundamentally not a sequence of Char, as it is a sequence of unicode scalar values. I don't see the point in simplifying Text to a sequence of Char.

@Lysxia
Copy link
Contributor

Lysxia commented Jun 11, 2022

Furthermore, Text is fundamentally not a sequence of Char, as it is a sequence of unicode scalar values. I don't see the point in simplifying Text to a sequence of Char.

OK there's a subtlety there but that's beyond my point. Replace what I said with "index breaks the level of abstraction of 'text as a sequence of unicode scalar values'" and I will stand by it.

How does this break the level of abstraction and why does that matter?

If you change the encoding of Text then you also have to change the indices that you are indexing with. This is not a property of the functions in Data.Text (except the "low-level" ones which I argue are out of place): they actually treat Text as a sequence of unicode scalar values and are agnostic to its representation (except for performance).

Maintaining this abstraction means that users don't have to know about the fact that Text is encoded as UTF-8 to effectively use Data.Text.

It is an arbitrary and fairly strong restriction, but it at least provides a well-defined threshold to control the growth of the Data.Text API. It also puts the onus on users who would want to import or reimplement index to convince themselves or their managers that what they want to do cannot be done using the safe and well-abstracted API.

@Bodigrim
Copy link
Contributor

@oberblastmeister
Copy link
Contributor Author

I think those functions would be good. I understand now that Text should be representation agnostic. Thank you for explaining to me.

@oberblastmeister
Copy link
Contributor Author

oberblastmeister commented Sep 9, 2022

I thought about this for a while and I am reopening because I think this is useful and also I thought of a good compromise.

First, this function is useful, especially for compilers, parsers, and text processing algorithms. In a compiler, storing only the start and end utf8 indices can be much more efficient than storing lines and columns or something else. It is also flexible, because utf8 indices can be converted to other positions easily and efficiently. This is also useful for parsers, because only tracking byte level indices and then converting those indices into other positions is much faster. For example, I made a parser combinator library that is 8-10x faster than attoparsec and megaparsec, and it also uses byte level indices.

I recently added spanM, which lets you do a scan and returns well-formed text slices wherever you decide to stop. Isn't that sufficient to cover a large class of text search algorithms in a well-encapsulated manner, without having to expose knowledge of UTF-8 in their implementations?

spanM is good, but doesn't cover use cases that are only a little more complex. Having indices allows one to move in any direction they like, while spanM can only go forwards. If you want to write your own text search algorithm, you cannot use spanM.

@oberblastmeister why https://hackage.haskell.org/package/text-2.0/docs/Data-Text-Foreign.html#g:5 are not sufficient for your purposes?

Everything in that module is incredibly unsafe, and takeWord8 and dropWord8 are no exception. By contrast, utf8 indexing returns a Maybe, which means that there is no unsafety. If we were to add this function, it should not be placed in an unsafe module, because nothing about it is unsafe.

If you change the encoding of Text then you also have to change the indices that you are indexing with. This is not a property of the functions in Data.Text (except the "low-level" ones which I argue are out of place): they actually treat Text as a sequence of unicode scalar values and are agnostic to its representation (except for performance).

Maintaining this abstraction means that users don't have to know about the fact that Text is encoded as UTF-8 to effectively use Data.Text.

To solve this, we should name the function sliceUtf8 and put it in a specialized module such as Data.Text.Encoding. No abstraction is broken, because the indexing function is not in the main api, and it is explicitly marked as utf8. If text decides to change its internal encoding, the function should still work, albeit running slower. There is no reason why sliceUtf8 should break the abstraction of text anymore than decodeUtf8. In fact, you can already implement sliceUtf8 with decodeUtf8, encodeUtf8, and bytestring indexing.

@Bodigrim
Copy link
Contributor

Bodigrim commented Sep 9, 2022

Everything in that module is incredibly unsafe, and takeWord8 and dropWord8 are no exception.

What's unsafe w.r.t. takeWord8 and dropWord8?

@oberblastmeister
Copy link
Contributor Author

I should have been more careful because they are actually not unsafe. However I think the interface is not ergonomic and the semantics are weird. Slicing is a more common operation than taking and dropping. I can't think of a usecase where I would want taking and dropping over slicing. Silently accepting invalid utf8 indices seems dubious. It is also probably slower than just checking if the index given is valid. Also, the Foreign module seems to be semi internal, which doesn't seem to be good to rely on. It's not clear what should happen if text changes its internal representation.

@Lysxia Lysxia added feature request question Requires more investigation labels Feb 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request question Requires more investigation
Projects
None yet
Development

No branches or pull requests

3 participants