Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emojis in MySQL/MariaDB #20

Open
chrisgherbert opened this issue Aug 7, 2018 · 6 comments
Open

Emojis in MySQL/MariaDB #20

chrisgherbert opened this issue Aug 7, 2018 · 6 comments

Comments

@chrisgherbert
Copy link

Has anyone gotten the emojis to work when the data is loaded into a MySQL or MariaDB database? I'm using utf8mb4 encoding and utf8mb4_unicode_ci collation, but only a small portion of the emojis are displaying properly for me.

@georgedumontier
Copy link

I think so? I did the same collation on my mysql db, but I haven't noticed any messed up emojis. Can you give me an example of one of the tweets you noticed was broken? I'd like to check mine.

@chrisgherbert
Copy link
Author

chrisgherbert commented Aug 7, 2018

Sure, here are a couple examples:

author: 4EVER_SUSAN
publish_date: 12/9/2015 21:26
content: �Today's the day! My limited edition @maccosmetics lipstick "Von Teese" is now on sale:… https://t.co/EjWdMcyNke

author: 6DRUZ
publish_date: 11/3/2016 20:36
content: Live to learn, Mario �� Hardwork really pays off �� @MariaSharapova #inspiringchildren https://t.co/0z09uExQqZ

Really wish these tweets had IDs.

@georgedumontier
Copy link

Ah yeah, I've got the same problem...

It might be an issue with the original data. Looks like it's the same in the csv? The first tweet is line 84788 in the first csv. It's just the unicode replacement character there too.

@EvanCarroll
Copy link

This isn't a MySQL or MariaDB issue, I'm facing the same problem with PostgreSQL. It's malformed characters.

@EvanCarroll
Copy link

Thanks for the help tracking the issue down @chrisgherbert . The issue there is the emoji is the 👏🏻 Clapping Hands: Light Skin Tone. You can figure that out with a hexdump on the chars from the tweet

That actually byte code f0 9f 91 8f f0 9f 8f bb.

Looking at the bytes in the stream I see, 20 ef bf bd ef bf bd 20 You can see that's massively different which means this is likely another encoding error.

@EvanCarroll
Copy link

An important thing if you're struggling parsing these Unicode characters in my fork of the repository they're encoded as U+FFFD � REPLACEMENT CHARACTER. This is because we can't do anything with these corrupt Unicode characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants