deepcopy: update the meaning for objects like BigInt #24875

rfourquet · 2017-12-01T16:36:18Z

Starting from bug #16667, two fixes have been proposed: #16826 which implements deepcopy(::BigInt) as the identity, and #16999 which implements deepcopy by creating a totally independant new BigInt object.

As far as I could understand, the rationale for the latter (which was the one merged), was to follow a strict interpretation of the docstring for deepcopy, which mentions in particular a "fully independent object" and also "Calling deepcopy on an object should generally have the same effect as serializing and then deserializing it." @yuyichao argued for example that this definition forces us to implement deepcopy such that its result is indistiguishable from serializing and then deserializing.

I believe the correct fix to the initial bug is the first PR, as I see no point in forcing wasting resources on copying the internals of BigInt when the semantics and the API for this type is to be immutable. If the definition of deepcopy prevents defining deepcopy(x::BigInt) = x, then the definition needs to be fixed, which is what I tried to do here (suggestions for improvements are very welcome), in addition to defining deepcopy as the identity. I also cherry-picked the tests implemented by @ScottPJones in his PR.

It's possible that I missed a use case for the stricter interpretation (copying the internals), but none have been exhibited so far. On the other hand, the use case for deepcopy == identity is simply performance, for when one has to deepcopy a structure containing BigInts.

vtjnash · 2017-12-01T17:40:14Z

such that its result is indistiguishable from serializing and then deserializing

I think this is still true – that our serializer does make a new BigInt should be seen as simply an implementation detail which shouldn't be assumed generally. (for example, a hypothetical serializer might use a special representation for common numbers like 0 and 1).

On the other hand, it seems like deepcopy here is intended for cases where you explicitly want a new copy of the internals of BigInt. (otherwise, the user should perhaps have just used the more reliable and fast copy(x); map!(copy, x, x); etc.)

nalimilan · 2017-12-01T17:52:11Z

On the other hand, it seems like deepcopy here is intended for cases where you explicitly want a new copy of the internals of BigInt. (otherwise, the user should perhaps have just used the more reliable and fast copy(x); map!(copy, x, x); etc.)

If you want to copy a DataFrame so that its columns are also copies of the original columns, you need to call deepcopy, which will call deepcopy on all arrays recursively. I don't think in that case you would want to copy the internals of BigInt, which as you said are an implementation detail.

rfourquet · 2017-12-01T17:56:41Z

it seems like deepcopy here is intended for cases where you explicitly want a new copy of the internals of BigInt.

This is the question at stake here, depending on whether we consider deepcopy an internal function or a user-facing function. In the latter case, which I support, we (almost) never want to copy BigInt internals, and in the rare case where it's needed, a custom copying function can be implemented. I tried to explain a possible "rule" for deepcopy in the docstring, with "atoms" and "containers", which agrees with that, I hope it makes sense.

thofma · 2017-12-02T10:32:00Z

You introduce the word "usually" so that deepcopy(x::BigInt) = x won't be an expection anymore?

I don't think this helps. This just opens the door for more and more arbitrary exceptions and less and less predicatible behavior of deepcopy.

I don't understand why this has to be be discussed again. Everything was said in #16999.

rfourquet · 2017-12-02T11:01:24Z

You introduce the word "usually" so that deepcopy(x::BigInt) = x won't be an expection anymore?

I introduce the word "usually" so that "fully independent object" is not taken too strictly or as a binding definition, which I think was the motivation for Yichao's PR.

This just opens the door for more and more arbitrary exceptions and less and less predicatible behavior of deepcopy.

I'm fine with removing this part of the docstring altogether. For me, the part I introduced with "atoms" and "containers" makes more sense, and in this respect, BigInt is not anymore an exception in this PR.

I don't understand why this has to be be discussed again. Everything was said in #16999.

Because no consensus was found in that PR. Yichao's work was to fix a bug in a way that agrees with his interpretation of the docstring of deepcopy. We were at least two persons having another interpretation of it, so I propose here to update the docstring to convey that alternative interpretation. I don't think one of the interpretations is stricly more correct, but I want to defend here the one I find more useful.

chethega · 2017-12-03T15:33:42Z

I disagree. deepcopy is extremely convenient if you construct a big data structure in a nontrivial way, and afterwards want to improve memory-layout for faster access times.

Nm= 5_000_000; V1 = [BigInt(rand(1:100)) for i=1:Nm]; shuffle!(V1);V2 = deepcopy(V1);

sum(V1);@time sum(V1); @time sum(V1);
  1.164729 seconds (7 allocations: 224 bytes)
  1.040056 seconds (7 allocations: 224 bytes)


sum(V2); @time sum(V2); @time sum(V2);
  0.135768 seconds (7 allocations: 224 bytes)
  0.129050 seconds (7 allocations: 224 bytes)

People are using deepcopy for a reason. No tool should presume it knows better than the user.

Edit: I think it would be more important to document the traversal order (I think DFS?) of deepcopy, so that users can better anticipate the resulting memory layout. Offering both DFS and BFS order would be nice but probably not worth the effort.

rfourquet · 2017-12-03T16:29:12Z

deepcopy is extremely convenient if you construct a big data structure in a nontrivial way, and afterwards want to improve memory-layout for faster access times.

That's very interesting, and as far as I can tell the first argument given in favor of copying the internals of BigInt besides when one wants to mutate the BigInt objects, a use case which even Yichao said is not supported by Julia.

People are using deepcopy for a reason. No tool should presume it knows better than the user.

I intended to expand on my motivations here tomorrow, but the gist of it is that I believe there are two classes of use cases for deepcopy, with distinct definitions, and I want to discuss here whether we address both in Base, or we choose one. So I don't agree it's about presuming what is better for the user, rather what use cases we serve. Again it's nice that you exhibited this example. On the other hand, a user wanting to deepcopy a structure containing BigInt objects which have not been shuffled will have better luck with the version in this PR;
for example, copy(V1) is 1000 times faster than deepcopy(V1) on my machine, which should be an indication of the gain we can get, if I'm not mistaken.

So I could imagine using a keyword argument to let the user choose, like deepcopy(x; internals=false).

chethega · 2017-12-03T17:43:29Z

I thought cleaning up memory fragmentation and over-allocated Vectors was the main use case for deepcopy? If I want a shallow (fast) copy I would use copy instead of deepcopy.

While the shuffle is somewhat pathological, I often have temporary allocs or need to apply permutations (e.g. sorting) during construction of big objects, and afterwards need to do a lot of in-place read/write operations. I'm no a BigInt user though, so maybe the BigInt community has different usecases?

rfourquet · 2017-12-03T17:50:46Z

If I want a shallow (fast) copy I would use copy instead of deepcopy.

Of course, but if your BigInt objects are nested in a structure, e.g. an array of arrays of BigInt, you may need a deepcopy of it to also copy the inner arrays, with the intention to mutate the inner arrays, while not wanting to copy the BigInt internals.

chethega · 2017-12-03T18:05:56Z

Of course, but if your BigInt objects are nested in a structure, e.g. an array of arrays of BigInt, you may need a deepcopy of it to also copy the inner arrays, with the intention to mutate the inner arrays, while not wanting to copy the BigInt internals.

Fair enough. So in the end, you think deepcopy should rather try to approximate a lazy-copy?

Then one should probably create a separate function, i.e. deepcopy vs lazydeepcopy, and then think about whether one can do some hare-brained copy-on-write scheme for this?

A lazydeepcopy would actually be quite awesome, and possibly achievable by using shared (mmap-ed) arrays.

rfourquet · 2017-12-03T18:06:29Z

I thought cleaning up memory fragmentation and over-allocated Vectors was the main use case for deepcopy?

I was not aware at all of this use case, it was not mentioned in the discussions concerning BigInt (AFAICR), so it's nice to note it here, as it's also not explicited in the docstring. So input is welcome here to clarify what we want to accomplish with deepcopy.

rfourquet · 2017-12-03T18:14:27Z

So in the end, you think deepcopy should rather try to approximate a lazy-copy?

I didn't think in these terms... but lazy-copy would be a possible implementation of my understanding of deepcopy: copy is when one wants to mutate an object while keeping a version of the original, and deepcopy when one wants to mutate also the contained objects, while still keeping an original intact. But as BigInt is not supposed to be mutated, this is not to be copied during a deepcopy (again, according to this version/definition of deepcopy).

oscardssmith · 2017-12-03T19:29:56Z

I think a lot of the problem is that deepcopy is being used to do two very different things. The first is "get me a version of this object that will not change if I change the original". This should not mess with internal layout. The second is "Make this bigInt use a more memory efficient layout". I think it is clear that the problem here is that this use is not asking for an identical copy as deepcopy wants, but rather a better version with distinct properties from the original. The solution, as I see it, would be to make a new method, maybe simplify that performs the second function, and make deepcopy follow the path of this PR. We should probably also have a news entry explaining this change.

chethega · 2017-12-04T10:47:28Z

After thinking about it, you are right. Deepcopy should split into a deepcopy_fast and a different deepcopy_mem (whatever their names end up being).

Then the future development of both is also clear: deepcopy_mem should attempt to talk to the allocator (e.g.: make new old pools, create the objects as old, allow for easy paging-out of the copy, maybe play some offset-games to prevent false aliasing, ensure that the temporary allocs for the object_id dict do not fragment the same pools as the deepcopy). It should also become part of the performance tips.

deepcopy_fast, on the other hand, can look for more unreliable speedups, like skipping the bignum copies or (one can dream) copy-on-write. Especially a deepcopy_fast of a deepcopy_mem object should allow cow, because the deepcopy_mem is nicely contained and not splattered all over the heap.

vtjnash · 2017-12-04T23:50:05Z

If you want to copy a DataFrame so that its columns are also copies of the original columns, you need to call deepcopy, which will call deepcopy on all arrays recursively

Seems like this is something that should be addressed in DataFrames

Deepcopy should split into a deepcopy_fast and a different deepcopy_mem

I think you mean deepcopy_all and deepcopy_dwim 😜 .

edit: wiktionary link looks like a better reference source for those not familiar with dwim

oscardssmith · 2017-12-05T00:32:21Z

Shouldn't it be split into deepcopy and make_it_faster? Based on the use case presented above it is being used not as "Get me a copy", but rather "simplify the memory layout". Therefore it should not be called deepcopy*, but rather simplify or something to that effect.

rfourquet · 2017-12-05T11:24:28Z

but rather simplify or something to that effect.

I would guess that some people would use this version to know that they can mutate e.g. BigInt object without affecting the original, in which case simplify is not really the idea.

Also, I believe that the two versions would share their implementation in many cases, so it could be more economic for implementors to have only one function, with a keyword for discriminating the variant, which can be ignored or passed down recursively when it doesn't matter, e.g, in a non-rigorous way and forgetting about deepcopy_internal and stackdict:

function deepcopy(x::Array, strict=false) # strict=true to deepcopy as deep as possible, even BigInt internals
    dest = similar(x)
    for i=1:length(x)
        dest[i] = deepcopy(x[i], strict=strict)
    end
end

although strict is not so descriptive either.

nalimilan · 2017-12-05T11:26:09Z

If you want to copy a DataFrame so that its columns are also copies of the original columns, you need to call deepcopy, which will call deepcopy on all arrays recursively

Seems like this is something that should be addressed in DataFrames

@vtnash It could certainly be changed in DataFrames if we wanted to. I mentioned it because that choice depends on how we define copy/deepcopy. The current behavior is useful if you want a shallow copy of a DataFrame so that you can add a column without modifying the original. If we changed copy to recursively call copy on the columns, that cheap operation would no longer be possible.

StefanKarpinski · 2017-12-05T13:27:53Z

I suppose deepcopy could take a depth keyword argument. Then the two-level copy behavior would be deepcopy(df, depth=2).

StefanKarpinski · 2017-12-05T13:30:33Z

FWIW, I haven't read all of this thread but I don't think that copy or deepcopy should need to duplicate BigInts or BigFloats since they are officially immutable. Since we have a "private" and unstable API for carefully mutating these values, there could be a private copy function (bigcopy?) as part of that API but since the public API is immutable, the public copy and deepcopy functions should behave accordingly. This should, of course, be thoroughly documented.

StefanKarpinski · 2017-12-05T13:34:05Z

Also, in the future, we reserve the right to make bugnums actually immutable, so anyone who is relying on the current behavior is going to have their code broken. We may as well break it now.

chethega · 2017-12-05T13:40:25Z

Re @StefanKarpinski

Since you did not read the entire thread, I'll repeat myself. Please reconsider

Nm= 5_000_000; V1 = [BigInt(rand(1:100)) for i=1:Nm]; shuffle!(V1);V2 = deepcopy(V1);

sum(V1);@time sum(V1); @time sum(V1);
  1.164729 seconds (7 allocations: 224 bytes)
  1.040056 seconds (7 allocations: 224 bytes)

sum(V2); @time sum(V2); @time sum(V2);
  0.135768 seconds (7 allocations: 224 bytes)
  0.129050 seconds (7 allocations: 224 bytes)

Currently, deepcopy has the extremely useful side-effect of getting rid of heap-fragmentation and making subsequent access (in DFS==deepcopy order) much more cache-friendly. Regardless of the default behavior of deepcopy, this should remain conveniently accessible. In order to be more realistic, mentally replace shuffle! by some sort!.

StefanKarpinski · 2017-12-05T14:56:18Z

With a private bignum copy API you would just replace V2 = deepcopy(V1) with V2 = bigcopy.(V1).

chethega · 2017-12-05T15:58:25Z

@rfourquet I really like your strict version.

@StefanKarpinski depthlimit:

In principle, deepcopy is called on a directed graph, following object references and respecting object identity (graph homomorphism). What is the proposed behavior if an object X is reachable by pathes of different lengths? Say:

Root <-> A1 <-> A2 <-> X

Now we must copy everything (if we want to preserve object identity), regardless of the depth limit and length of the chain. Preserving object identity by depth level must do something like "take transitive closure, choose favorite way of assigning levels in the Hasse-diagram, go that far for the copy but not further". This sounds far too fancy for base.

StefanKarpinski · 2017-12-06T17:33:18Z

I'm not sure I see the problem with depth-limiting deepcopy: put a limit on recursion equal to the value of depth and stop making copies once the limit reaches zero. You get a copy of something if it's reachable within that number of recursive steps, otherwise you don't. Regardless, the depth thing is irrelevant to whether copy and deepcopy should actually copy bignums or not.

vtjnash · 2017-12-06T18:24:28Z

You get a copy of something if it reaches some that is reachable within that number of recursive steps, otherwise you don't.

ftfy. It's a problem mostly just in that it's subtle, and would give even worse performance than deepcopy does now. (also acceptable would be: "is reached within", which is much faster, but may destroy aliasing)

chethega · 2017-12-08T23:13:31Z

Regardless, the depth thing is irrelevant to whether copy and deepcopy should actually copy bignums or not.

True. What about the strict option? So, say, copy and deepcopy don't copy unless strict=true? Then the default could change in a later version, without breaking anything, if strict turns out to be more useful in practice.

Also, won't the same question apply to strings if they ever become immutable (cf #22193?

StefanKarpinski · 2017-12-08T23:33:56Z

The strict thing seems pretty weird to me. What does it mean in general? When can you pass the strict keyword to copy and deepcopy? Strings are already immutable on master.

chethega · 2017-12-10T19:17:18Z

Strict deepcopy would mean "really, really make a copy; make it so that the resulting object-tree is indistinguishable from serializing-and-deserializing, even if I abuse pointers, and including heap placement". This means that we need to copy the actual bytes of bignums and strings, into newly allocated space (that is allocated in the DFS-order, hence pretty predictably placed on the heap, and the deepcopy works even if someone later uses an unsafe_store to mutate bignums in-place)

Non-strict would mean "give me a semantic deepcopy, as fast as possible", which allows us to be faster and allocate less memory.

In 99% of cases, we have no optimization for "strict=false" and both do the same thing.

After such a change, A=deepcopy(A, strict=true) would be the preferred way of cleaning up an object-tree with bad placement (maybe because A was constructed over a long time, or because the user applied a permutation like sorting some arrays), and B=deepcopy(A, strict=false) would be the preferred way of pulling a copy for algorithmic reasons.

Currently, only "strict=true" is implemented; this pull-request introduces the first optimization for "strict=false", and originally proposed to make the "strict=true" behavior entirely inaccessible. I want the optional argument because I think that strict copies are quite useful.

I think that a change from default "strict=true" to "strict=false" is more breaking than the other way (because of code doing evil pointer tricks, or crucially relying on good memory patterns); if non-strict is supposed to become default, then it would probably be better to do this now?

Regarding strings: Can we implement a deepcopy of strings without copying the actual buffer? Would you feel comfortable if a copy of a string only copied the pointer (with strict=false)? Wouldn't you feel that some way of "nope, I am going to unsafe_store! into this string/ bignum, please allocate and memcopy!" should be provided by the language (e.g. via strict=true)?

Ultimately, there only is a difference for "weird" types that need a custom deepcopy function, like bignum (which abuse pointers and finalizers) or possibly types that directly hook into the runtime, or are builtin.

rfourquet · 2017-12-13T15:23:52Z

It's a bit short on time for designing the strict version before feature freeze. I propose we do that in the 1.x timeframe, and focus on the definition which I think should be the default, without keyword. Here is the proposed definition (slighly updated from the PR diff).
There are informally 2 categories of objects:

containers: usual containers like Set, but also few specific objects like Pair and maybe Nullable, whose contained objects are independant of the functioning of the container. For example, mutating the first element of a vector of mutable objects doesn't affect at all the Vector, similarly for the .second field of a Pair ;
"atoms", everything else; those atoms can contain other objects as implementation detail, but modifying those can alter the good functioning of the parent object; For example manually modifying the .vals field of a MersenneTwister, which is a Vector, will make things go wrong.

We define copy of a container as the copy of every internals of the object, such that the new container contains exactly (as in ===) the same objects as the source.
We define copy for an atom as the copy of every internals of the object, such that the new object can function independantly from the source. (Some atoms may have no copy defined).

We define deepcopy for a container as the copy of it, except that the contained objects are recursively deepcopyed.
We define deepcopy for an atom to be equal to copy.
We define deepcopy for atoms which have no copy methods as the identity (e.g. copy(:symbol) doesn't work, but deepcopy(::Vector{Symbol}) should work.

I'm not sure if we should define a default fall-back as deepcopy(x) = copy(x) or deepcopy(x) = x. I would tend to say the latter, and objects defining copy should also define deepcopy.

StefanKarpinski · 2017-12-14T20:53:50Z

I think we can leave this as is for now and introduce a strict::Bool=true keyword in the future. Having the default behavior be to copy as much as possible is good. This change can be added in the future as an opt-in change for strict=false. We should also document some of the known limitations of deepcopy in the future wrt pointers and objects with finalizers (a particularly dangerous combination).

StefanKarpinski · 2017-12-14T20:56:47Z

Design notes for strict: the object cache may need options like an IOContext which methods can choose to look at if they care to, including whether they should do strict copying or not. That way most definition don't care but types like bignums and strings can look at the attribute and decide whether to return the same object or make an actual copy of it.

vtjnash · 2021-04-03T04:36:31Z

Very outdated

rfourquet and others added 2 commits December 1, 2017 17:32

deepcopy: update the meaning for objects like BigInt

6e7bedc

add tests for deepcopy on BigInt and BigFloat

d23c8bb

rfourquet added the domain:bignums BigInt and BigFloat label Dec 1, 2017

rfourquet added the status:triage This should be discussed on a triage call label Dec 13, 2017

StefanKarpinski removed the status:triage This should be discussed on a triage call label Dec 14, 2017

StefanKarpinski added this to the 1.x milestone Dec 14, 2017

vtjnash closed this Apr 3, 2021

vtjnash deleted the rf/deepcopy branch April 3, 2021 04:36

rfourquet mentioned this pull request Aug 7, 2021

Make @big_str allocate at runtime by default #41792

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepcopy: update the meaning for objects like BigInt #24875

deepcopy: update the meaning for objects like BigInt #24875

rfourquet commented Dec 1, 2017

vtjnash commented Dec 1, 2017

nalimilan commented Dec 1, 2017

rfourquet commented Dec 1, 2017

thofma commented Dec 2, 2017

rfourquet commented Dec 2, 2017

chethega commented Dec 3, 2017 •

edited

rfourquet commented Dec 3, 2017

chethega commented Dec 3, 2017

rfourquet commented Dec 3, 2017

chethega commented Dec 3, 2017

rfourquet commented Dec 3, 2017

rfourquet commented Dec 3, 2017 •

edited

oscardssmith commented Dec 3, 2017

chethega commented Dec 4, 2017

vtjnash commented Dec 4, 2017 •

edited

oscardssmith commented Dec 5, 2017

rfourquet commented Dec 5, 2017

nalimilan commented Dec 5, 2017 •

edited

StefanKarpinski commented Dec 5, 2017

StefanKarpinski commented Dec 5, 2017 •

edited

StefanKarpinski commented Dec 5, 2017

chethega commented Dec 5, 2017

StefanKarpinski commented Dec 5, 2017

chethega commented Dec 5, 2017

StefanKarpinski commented Dec 6, 2017

vtjnash commented Dec 6, 2017

chethega commented Dec 8, 2017 •

edited

StefanKarpinski commented Dec 8, 2017 •

edited

chethega commented Dec 10, 2017

rfourquet commented Dec 13, 2017

StefanKarpinski commented Dec 14, 2017

StefanKarpinski commented Dec 14, 2017

vtjnash commented Apr 3, 2021

deepcopy: update the meaning for objects like BigInt #24875

deepcopy: update the meaning for objects like BigInt #24875

Conversation

rfourquet commented Dec 1, 2017

vtjnash commented Dec 1, 2017

nalimilan commented Dec 1, 2017

rfourquet commented Dec 1, 2017

thofma commented Dec 2, 2017

rfourquet commented Dec 2, 2017

chethega commented Dec 3, 2017 • edited

rfourquet commented Dec 3, 2017

chethega commented Dec 3, 2017

rfourquet commented Dec 3, 2017

chethega commented Dec 3, 2017

rfourquet commented Dec 3, 2017

rfourquet commented Dec 3, 2017 • edited

oscardssmith commented Dec 3, 2017

chethega commented Dec 4, 2017

vtjnash commented Dec 4, 2017 • edited

oscardssmith commented Dec 5, 2017

rfourquet commented Dec 5, 2017

nalimilan commented Dec 5, 2017 • edited

StefanKarpinski commented Dec 5, 2017

StefanKarpinski commented Dec 5, 2017 • edited

StefanKarpinski commented Dec 5, 2017

chethega commented Dec 5, 2017

StefanKarpinski commented Dec 5, 2017

chethega commented Dec 5, 2017

StefanKarpinski commented Dec 6, 2017

vtjnash commented Dec 6, 2017

chethega commented Dec 8, 2017 • edited

StefanKarpinski commented Dec 8, 2017 • edited

chethega commented Dec 10, 2017

rfourquet commented Dec 13, 2017

StefanKarpinski commented Dec 14, 2017

StefanKarpinski commented Dec 14, 2017

vtjnash commented Apr 3, 2021

chethega commented Dec 3, 2017 •

edited

rfourquet commented Dec 3, 2017 •

edited

vtjnash commented Dec 4, 2017 •

edited

nalimilan commented Dec 5, 2017 •

edited

StefanKarpinski commented Dec 5, 2017 •

edited

chethega commented Dec 8, 2017 •

edited

StefanKarpinski commented Dec 8, 2017 •

edited