Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peekCAText is less efficient than it could be #79

Open
chessai opened this issue Nov 7, 2018 · 5 comments
Open

peekCAText is less efficient than it could be #79

chessai opened this issue Nov 7, 2018 · 5 comments

Comments

@chessai
Copy link
Contributor

chessai commented Nov 7, 2018

currently it's just defined as

peekCAText cp = Text.pack <$> peekCAString cp

so, it not only calls the more inneficient peekCAString from base, but then it must convert the String into text! This seems unnecessarily expensive. There is an alternative in something similar to Data.Text.Foreign.peekCStringLen, which goes from CStringLen -> IO Text. There is a drawback to this, in that it only supports CStrings that are valid UTF-8, and throws an exception otherwise. Another drawback is that there's only a CStringLen variant, but that's not hard to get around:

peekCString :: CString -> IO Text
peekCStringLen cs = do
  bs <- Data.ByteString.Unsafe.unsafePackCString cs
  return $! decodeUtf8 bs

This is almost exactly like peekCStringLen, but calls a different function from Data.ByteString.Unsafe, since there's no length information present.

This seems like a good idea, if you're willing to sacrifice support for non-UTF-8. you could always use something like bytestring-encodings.Data.ByteString.Encoding.isUtf8 (https://hackage.haskell.org/package/bytestring-encodings-0.2.0.2/docs/Data-ByteString-Encodings.html#v:isUtf8) to verify that the ByteString is UTF-8 encoded before proceeding, but then you'd probably have to return a 'Maybe Text', which doesn't seem worth it.

@AlexeyRaga
Copy link
Member

@chessai We should think about it...
In fact UTF8 is not valid for all the cases. For example, you cannot create a topic with UTF8 name, the canonical Kafka implementation doesn't allow it:

Error while executing topic command : Topic name "test-日本語" is illegal, it contains a character other than ASCII alphanumerics, '.', '_' and '-'
[2018-11-07 22:45:02,085] ERROR org.apache.kafka.common.errors.InvalidTopicException: Topic name "test-日本語" is illegal, it contains a character other than ASCII alphanumerics, '.', '_' and '-'

We should look at specs (if there are any) and think carefully about pros and cons...

That can be a bit late though because we have already exposing Text as a data type for things like topic names etc. if I am not mistaken?

@chessai
Copy link
Contributor Author

chessai commented Nov 8, 2018

Yeah, we're already using Text everywhere. Since ASCII is a subset of UTF-8, that would work "fine" with bytestring-encodings's isAscii function, by "fine" i mean that at least the same exception could be thrown.

@AlexeyRaga
Copy link
Member

@chessai I think you are right, did you have a PR for this?

@chessai
Copy link
Contributor Author

chessai commented Dec 21, 2018

@AlexeyRaga I opened an issue against text, but there's no reason this couldn't live here until it's accepted into text. we can use isAscii, since it seems like these things must be ascii anyway. I'm worried about throwing an exception though; these things will probably need to be wrapped in an Either or Maybe or something if we don't want to just throw exceptions (as decodeUtf8 does.)

@chessai
Copy link
Contributor Author

chessai commented Dec 21, 2018

haskell/text#239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants