Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String should be officially immutable & "abc" === "abc" #22193

Closed
StefanKarpinski opened this issue Jun 2, 2017 · 30 comments
Closed

String should be officially immutable & "abc" === "abc" #22193

StefanKarpinski opened this issue Jun 2, 2017 · 30 comments
Assignees
Labels
domain:strings "Strings!"

Comments

@StefanKarpinski
Copy link
Sponsor Member

As of 0.6, you can no longer do "abc".data and access a mutable Vector{UInt8} of the underlying bytes for a string. In 1.0 it would be desirable to have String object be truly immutable and have "abc" === "abc".

@yuyichao
Copy link
Contributor

yuyichao commented Jun 2, 2017

The way I want to implement this is vararg elements (similar to C variable length array elements). It can replace the special case for svec and string and also make strings more ABI comparible with C (the usefulness of that is a different question).

With the ref def and ops in #21912 it should be possible to implement the surface API.

@tkelman
Copy link
Contributor

tkelman commented Jun 2, 2017

You can still mutate strings on 0.6, just not by the previous .data name

julia> a = "abc"
"abc"

julia> v = Vector{UInt8}(a)
3-element Array{UInt8,1}:
 0x61
 0x62
 0x63

julia> v[2] = 0x50
0x50

julia> a
"aPc"

@yuyichao
Copy link
Contributor

yuyichao commented Jun 2, 2017

FWIW, doing this might actually make "abc".data accessible again, though it won't be a Vector{UInt8}.

@JeffBezanson JeffBezanson added the domain:strings "Strings!" label Jun 2, 2017
@sbromberger
Copy link
Contributor

You can still mutate strings on 0.6, just not by the previous .data name

This confuses me a bit. Why should this work, but

julia> push!(v, 0x50)
4-element Array{UInt8,1}:
 0x61
 0x62
 0x63
 0x50

julia> a
"abc"

not impact a?

@StefanKarpinski
Copy link
Sponsor Member Author

You can still mutate strings on 0.6, just not by the previous .data name

Yes, but that need not be the case in 1.0. Having genuinely immutable Strings has many advantages, and we're very well positioned to make such a change since it's already atypical to mutate them. There will always be some ways to mutate strings since you can always poke at memory, but as long as we make it sufficiently hard / put up dire enough warning signs, it's ok.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 2, 2017

This confuses me a bit. Why should this work, but

Neither is defined behavior.

@sbromberger
Copy link
Contributor

@yuyichao ok, but what impact does it have? My interest is also security-related. What happens to that extra byte with respect to a?

@KristofferC
Copy link
Sponsor Member

@yuyichao What command causes undefined behaviour here?

julia> v = Vector{UInt8}(a)
3-element Array{UInt8,1}:
 0x61
 0x62
 0x63

julia> v[2] = 0x50
0x50

My point is, there is no unsafe-stuff used so doing this should be allowed.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 2, 2017

My point is, there is no unsafe-stuff used so doing this should be allowed.

There's no unsafe-stuff used so doing this shouldn't crash.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 2, 2017

What happens to that extra byte with respect to a?

Unrelated

@KristofferC
Copy link
Sponsor Member

Is there any docs what is undefined behaviour then? Like modifying an array from a string.

@sbromberger
Copy link
Contributor

sbromberger commented Jun 2, 2017

Unrelated

can you elaborate? If v is sort of a view into a (that is, &v == &a?), then what happens here? After the push, I cannot mutate a via v any more:

julia> a = "abc"
"abc"

julia> v = Vector{UInt8}(a)
3-element Array{UInt8,1}:
 0x61
 0x62
 0x63

julia> push!(v, 0x50)
4-element Array{UInt8,1}:
 0x61
 0x62
 0x63
 0x50

julia> a
"abc"

julia> v[2] = 0x50
0x50

julia> a
"abc"

@mbauman
Copy link
Sponsor Member

mbauman commented Jun 2, 2017

Resizing the vector forced a re-allocation:

julia> a = "abc"
"abc"

julia> v = Vector{UInt8}(a)
3-element Array{UInt8,1}:
 0x61
 0x62
 0x63

julia> pointer(v)
Ptr{UInt8} @0x000000011bf6f138

julia> push!(v, 0x50)
4-element Array{UInt8,1}:
 0x61
 0x62
 0x63
 0x50

julia> pointer(v)
Ptr{UInt8} @0x000000011be71b58

@sbromberger
Copy link
Contributor

@mbauman that's really interesting; thanks. This is too:

julia> pop!(v)
0x63

julia> a
"ab\0"

@yuyichao
Copy link
Contributor

yuyichao commented Jun 2, 2017

After the push, I cannot mutate a via v any more:

That's what I mean. The byte pushed (and also the new vector) is unrelated to the string.

@sbromberger
Copy link
Contributor

Fun with mutations:

julia> shuffle!(v)
3-element Array{UInt8,1}:
 0x62
 0x63
 0x61

julia> a
"bca"

@sbromberger
Copy link
Contributor

More fun:

a = "abc"
a.len = -9999
println(a)

(don't do this if you want to keep your julia session / terminal running.)

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 2, 2017

There's no unsafe-stuff used so doing this shouldn't crash

Er, except for the modification of the array. I think the gc can get confused when you do that and end up freeing the wrong memory. So, word to the wise: don't try to write to a Vector generated from a String (unless you know for certain the Vector contains the only reference to the string)

@yuyichao
Copy link
Contributor

yuyichao commented Jun 2, 2017

except for the modification of the array

I'm pretty sure the GC is fine with that. I think I've also checked that the realloc one won't cause early free either. Modifying len is another story, it'll crash the GC. This is yet another thing that will be fixed with vararg member since the length will not be a field anymore.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 3, 2017

I thought we decided the other way? Well that's good at least.

@JeffBezanson
Copy link
Sponsor Member

We could also hide the len field now if we wanted to.

@JeffBezanson
Copy link
Sponsor Member

len field has been removed, so strings no longer have any visible structure that can be mutated. We can't make them true immutable objects until we've implemented variable-size objects, which probably won't happen for 1.0. Also, AFAICT we'll still need the ability to share memory between vectors and strings for efficient I/O. I can look at making === work, but these concerns make it a bit dicey.

@JeffBezanson JeffBezanson modified the milestones: 1.x, 1.0 Jul 26, 2017
JeffBezanson added a commit that referenced this issue Jul 26, 2017
treat String as immutable in `===`. part of #22193
@samoconnor
Copy link
Contributor

If we're saying that String is an immutable opaque value, then I think it would make sense for it not to also be an Iterable{Char}. We have codepoints(::String) to get an Iterable{UInt8}. We could have chars(::String) -> Iterable{Char} instead of String being iterable itself.

This would also be consistent with lines(::String), graphemes(::String), words(::String), tokens(::String) etc.

julia> f(collection) = for i in collection println(i) end
julia> f([1,2,3])
1
2
3

julia> f(1)
1

julia> f(["one", "two", "three"])
one
two
three

julia> f("one")
o
n
e

should be ... ?

f("one")
one

@JeffBezanson
Copy link
Sponsor Member

That is highly related to the discussion on whether Strings are scalars in broadcast (#18618). That might be a reason to make strings scalar-like, but I don't think immutability has anything to do with it.

@JeffBezanson
Copy link
Sponsor Member

Also we now have "abc" === "abc" so I think this can be closed.

@EricForgy
Copy link
Contributor

EricForgy commented Feb 20, 2018

This bit me in WebSockets :(

#22193 (comment)

where we apply a mask to the byte representation of a String.

Thank you @samoconnor for finding this. I never would have found the problem myself. Shifu 🙏

image

@StefanKarpinski
Copy link
Sponsor Member Author

Do you have a take-away recommendation from that experience? I believe that in 0.7 you cannot mutate the memory backing a string without at some point doing something explicitly unsafe.

@EricForgy
Copy link
Contributor

My take away is that I am looking forward to v0.7 😊 At least there would have been a warning.

@samoconnor already has an unsafe version

HTTP.bytes("Foo")

https://github.com/JuliaWeb/HTTP.jl/blob/master/src/IOExtras.jl#L15-L32

We should encourage use of that (or similar). No recommended changes from this experience except I'm looking forward to v0.7 😊

That was painful though. I literally sunk a few days into that, but it is all good now 👍

@StefanKarpinski
Copy link
Sponsor Member Author

StefanKarpinski commented Feb 20, 2018

I can sympathize. One of the harder bugs I've ever tracked down involved a mutated string in Ruby where the mutator was deep in some obscure function call.

@samoconnor
Copy link
Contributor

Credit for HTTP.bytes goes to @quinnj. He came up with that as a work around when 0.7 started warning about future copying in the Vector{UInt8} constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!"
Projects
None yet
Development

No branches or pull requests

10 participants