peekCAText is less efficient than it could be #79

chessai · 2018-11-07T15:59:27Z

currently it's just defined as

peekCAText cp = Text.pack <$> peekCAString cp

so, it not only calls the more inneficient peekCAString from base, but then it must convert the String into text! This seems unnecessarily expensive. There is an alternative in something similar to Data.Text.Foreign.peekCStringLen, which goes from CStringLen -> IO Text. There is a drawback to this, in that it only supports CStrings that are valid UTF-8, and throws an exception otherwise. Another drawback is that there's only a CStringLen variant, but that's not hard to get around:

peekCString :: CString -> IO Text
peekCStringLen cs = do
  bs <- Data.ByteString.Unsafe.unsafePackCString cs
  return $! decodeUtf8 bs

This is almost exactly like peekCStringLen, but calls a different function from Data.ByteString.Unsafe, since there's no length information present.

This seems like a good idea, if you're willing to sacrifice support for non-UTF-8. you could always use something like bytestring-encodings.Data.ByteString.Encoding.isUtf8 (https://hackage.haskell.org/package/bytestring-encodings-0.2.0.2/docs/Data-ByteString-Encodings.html#v:isUtf8) to verify that the ByteString is UTF-8 encoded before proceeding, but then you'd probably have to return a 'Maybe Text', which doesn't seem worth it.

The text was updated successfully, but these errors were encountered:

AlexeyRaga · 2018-11-07T22:49:16Z

@chessai We should think about it...
In fact UTF8 is not valid for all the cases. For example, you cannot create a topic with UTF8 name, the canonical Kafka implementation doesn't allow it:

Error while executing topic command : Topic name "test-日本語" is illegal, it contains a character other than ASCII alphanumerics, '.', '_' and '-'
[2018-11-07 22:45:02,085] ERROR org.apache.kafka.common.errors.InvalidTopicException: Topic name "test-日本語" is illegal, it contains a character other than ASCII alphanumerics, '.', '_' and '-'

We should look at specs (if there are any) and think carefully about pros and cons...

That can be a bit late though because we have already exposing Text as a data type for things like topic names etc. if I am not mistaken?

chessai · 2018-11-08T00:18:24Z

Yeah, we're already using Text everywhere. Since ASCII is a subset of UTF-8, that would work "fine" with bytestring-encodings's isAscii function, by "fine" i mean that at least the same exception could be thrown.

AlexeyRaga · 2018-12-20T22:58:34Z

@chessai I think you are right, did you have a PR for this?

chessai · 2018-12-21T23:46:29Z

@AlexeyRaga I opened an issue against text, but there's no reason this couldn't live here until it's accepted into text. we can use isAscii, since it seems like these things must be ascii anyway. I'm worried about throwing an exception though; these things will probably need to be wrapped in an Either or Maybe or something if we don't want to just throw exceptions (as decodeUtf8 does.)

chessai · 2018-12-21T23:53:01Z

haskell/text#239

chessai mentioned this issue Nov 7, 2018

there is no need for both peekCText and peekCAText to exist #80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

peekCAText is less efficient than it could be #79

peekCAText is less efficient than it could be #79

chessai commented Nov 7, 2018

AlexeyRaga commented Nov 7, 2018

chessai commented Nov 8, 2018

AlexeyRaga commented Dec 20, 2018

chessai commented Dec 21, 2018

chessai commented Dec 21, 2018

peekCAText is less efficient than it could be #79

peekCAText is less efficient than it could be #79

Comments

chessai commented Nov 7, 2018

AlexeyRaga commented Nov 7, 2018

chessai commented Nov 8, 2018

AlexeyRaga commented Dec 20, 2018

chessai commented Dec 21, 2018

chessai commented Dec 21, 2018