Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of BOM leading characters #19

Open
j256 opened this issue Dec 13, 2016 · 0 comments
Open

Handling of BOM leading characters #19

j256 opened this issue Dec 13, 2016 · 0 comments

Comments

@j256
Copy link
Owner

j256 commented Dec 13, 2016

From @yongminyan .

Hey @j256 , I found these issues when I was parsing certain html content that start with BOM, like byte array of "-17, -69, -65, 60, 104, 116, 109, 108, 32" (the first three bytes are UTF-8 BOM and followed by <html tag) or "-1, -2, 60, 0, 104, 0, 116, 0, 109, 0, 108, 0" (the first two bytes are UTF-16 Little-Endian BOM and followed by <html tag), in these cases, the library failed to detect it as text/html, for it to be working, I think we need to fix the issues first and then add proper magic entries, something like

+0      byte 0xEF               
+!:mime text/html
+>1     byte 0xBB               
+>>2    byte 0xBF               UTF-8 Unicode text with BOM
+>>>3   search/1/cb \<html              

and

+# UTF-16 LE
+0      byte 0xFF               
+!:mime text/html
+>1     byte 0xFE               
+>>1    lestring16 \<html                Little-endian UTF-16 Unicode text with BOM

I did not include the magic entries in the pull request as I feel those changes are not very generic, it could happen to other types like xml (i.e., different encoding), not too sure about the best solution?

Also I am not too sure lestring16/bestring16 support [Bbc] options or not, the magic5 spec does not say so, but I see lestring16/bestring16 extends from StringTypes, I mean can we do something like lestring16/cb or not?

It would be great if you can take a look and answer my two questions above, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant