Tag strings without a known encoding as ASCII-8BIT. #619

arthurschreiber · 2016-07-08T12:27:43Z

Almost all of the string used inside a git repository are encoding unaware.

This includes things like refnames, commit metadata, path names, and a lot more. The exception is commit messages, which can optionally be tagged with an encoding through a header in the commit metadata. We should only tag strings with an encoding if we know the exact encoding, and otherwise tag them as ASCII 8-BIT (binary).

Almost all of the string used inside a git repository are encoding unaware. This includes things like refnames, commit metadata, path names, and a lot more. The exception is commit messages, which can optionally be tagged with an encoding through a header in the commit metadata. We should only tag strings with an encoding if we know the exact encoding, and otherwise tag them as US-ASCII 8-BIT (binary).

carlosmn · 2016-07-08T15:07:47Z

ext/rugged/rugged_commit.c

-	return rugged_signature_new(
-		git_commit_author(commit),
-		git_commit_message_encoding(commit));
+	return rugged_signature_new(git_commit_author(commit));


This is its own can of worms, but for some use-cases we have assumed that the encoding of the signature is the same as of the commit message. This sometimes even holds true, and this is a place where we kinda have a good idea of what the encoding is likely to be, rather than no idea like in the other cases.

I think this is probably an OK assumption to make. I haven't seen any exceptions regarding this in production (only the path and tag stuff)

tenderlove · 2016-07-08T16:19:33Z

US-ASCII 8-BIT

Just to be clear, US-ASCII and ASCII-8BIT are totally different. US-ASCII is a known encoding, where ASCII-8BIT means "we don't know what the encoding of this string is". All strings should have an encoding, ASCII-8BIT just means that we don't know what the encoding is (or that it's truly binary data like an image or something).

tenderlove · 2016-07-08T16:25:14Z

I was thinking about this last night. I thought it might be nice if we could configure a repository with encodings. Something like this:

repo = Rugged::Repository.new(ARGV[0], path_encoding: ::Encoding::UTF_8)

That way if you're dealing with a repository where you know the encoding of the paths, you can just configure the repo with the encoding to use. We could provide options for the bits of data where we can't know the encoding like paths and tags (those are the only two I can think of, blobs should always be binary).

If we add this configuration value, then we can default it to UTF-8 in order to maintain backwards compatibility.

arthurschreiber · 2016-07-11T08:54:50Z

Just to be clear, US-ASCII and ASCII-8BIT are totally different.

Whoops, you're right, I mixed that up! 😄

arthurschreiber · 2016-07-11T08:58:28Z

I thought it might be nice if we could configure a repository with encodings.

I'm not sure this is a good solution. If you have different people that work on a repo, they might have set different encodings on their machines. I don't think that's too uncommon, especially for people working on windows. 😞

ethomson · 2016-07-11T13:44:23Z

If you have different people that work on a repo, they might have set different encodings on their machines. I don't think that's too uncommon, especially for people working on windows. 😞

On Windows, at least, NTFS stores filenames as UTF-16 and there is, AFAIK, no way to use any other encoding. This is fortunate, since the Windows Git implementations do an insane amount of UTF16 <-> UTF8 conversion to be able to talk to the Windows APIs, since they all speak UTF-16 (or UCS-2 in some cases. Yay!)

Similarly, HFS+ will use a canonically decomposed UTF-16.

I'm just providing data here, I still think that returning these as ASCII-8BIT probably still makes a lot of sense.

arthurschreiber · 2016-08-18T11:48:30Z

@tenderlove Is this going to break anything if we would push this into production as-is?

…r/binary-all-the-things

tenderlove · 2016-08-18T13:22:00Z

I think it will work but we should test. I think we're retagging them as binary in all places so this should be OK

Aaron Patterson
http://tenderlovemaking.com/
I'm on an iPhone so I apologize for top posting.

On Aug 18, 2016, at 7:48 AM, Arthur Schreiber notifications@github.com wrote:

@tenderlove Is this going to break anything if we would push this into production as-is?

―
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

carlosmn reviewed Jul 8, 2016
View reviewed changes

Prefer String#b over String#force_encoding('ascii-8bit').

ef77734

Merge branch 'master' of https://github.com/libgit2/rugged into arthu…

7a9965c

…r/binary-all-the-things

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tag strings without a known encoding as ASCII-8BIT. #619

Tag strings without a known encoding as ASCII-8BIT. #619

arthurschreiber commented Jul 8, 2016 •

edited

carlosmn Jul 8, 2016

tenderlove Jul 8, 2016 •

edited

tenderlove commented Jul 8, 2016 •

edited

tenderlove commented Jul 8, 2016 •

edited

arthurschreiber commented Jul 11, 2016

arthurschreiber commented Jul 11, 2016

ethomson commented Jul 11, 2016

arthurschreiber commented Aug 18, 2016

tenderlove commented Aug 18, 2016

Tag strings without a known encoding as ASCII-8BIT. #619

Are you sure you want to change the base?

Tag strings without a known encoding as ASCII-8BIT. #619

Conversation

arthurschreiber commented Jul 8, 2016 • edited

carlosmn Jul 8, 2016

Choose a reason for hiding this comment

tenderlove Jul 8, 2016 • edited

Choose a reason for hiding this comment

tenderlove commented Jul 8, 2016 • edited

tenderlove commented Jul 8, 2016 • edited

arthurschreiber commented Jul 11, 2016

arthurschreiber commented Jul 11, 2016

ethomson commented Jul 11, 2016

arthurschreiber commented Aug 18, 2016

tenderlove commented Aug 18, 2016

arthurschreiber commented Jul 8, 2016 •

edited

tenderlove Jul 8, 2016 •

edited

tenderlove commented Jul 8, 2016 •

edited

tenderlove commented Jul 8, 2016 •

edited