Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepcopy: update the meaning for objects like BigInt #24875

Closed
wants to merge 2 commits into from

Conversation

rfourquet
Copy link
Member

Starting from bug #16667, two fixes have been proposed: #16826 which implements deepcopy(::BigInt) as the identity, and #16999 which implements deepcopy by creating a totally independant new BigInt object.

As far as I could understand, the rationale for the latter (which was the one merged), was to follow a strict interpretation of the docstring for deepcopy, which mentions in particular a "fully independent object" and also "Calling deepcopy on an object should generally have the same effect as serializing and then deserializing it." @yuyichao argued for example that this definition forces us to implement deepcopy such that its result is indistiguishable from serializing and then deserializing.

I believe the correct fix to the initial bug is the first PR, as I see no point in forcing wasting resources on copying the internals of BigInt when the semantics and the API for this type is to be immutable. If the definition of deepcopy prevents defining deepcopy(x::BigInt) = x, then the definition needs to be fixed, which is what I tried to do here (suggestions for improvements are very welcome), in addition to defining deepcopy as the identity. I also cherry-picked the tests implemented by @ScottPJones in his PR.

It's possible that I missed a use case for the stricter interpretation (copying the internals), but none have been exhibited so far. On the other hand, the use case for deepcopy == identity is simply performance, for when one has to deepcopy a structure containing BigInts.

@rfourquet rfourquet added the domain:bignums BigInt and BigFloat label Dec 1, 2017
@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 1, 2017

such that its result is indistiguishable from serializing and then deserializing

I think this is still true – that our serializer does make a new BigInt should be seen as simply an implementation detail which shouldn't be assumed generally. (for example, a hypothetical serializer might use a special representation for common numbers like 0 and 1).

On the other hand, it seems like deepcopy here is intended for cases where you explicitly want a new copy of the internals of BigInt. (otherwise, the user should perhaps have just used the more reliable and fast copy(x); map!(copy, x, x); etc.)

@nalimilan
Copy link
Member

On the other hand, it seems like deepcopy here is intended for cases where you explicitly want a new copy of the internals of BigInt. (otherwise, the user should perhaps have just used the more reliable and fast copy(x); map!(copy, x, x); etc.)

If you want to copy a DataFrame so that its columns are also copies of the original columns, you need to call deepcopy, which will call deepcopy on all arrays recursively. I don't think in that case you would want to copy the internals of BigInt, which as you said are an implementation detail.

@rfourquet
Copy link
Member Author

it seems like deepcopy here is intended for cases where you explicitly want a new copy of the internals of BigInt.

This is the question at stake here, depending on whether we consider deepcopy an internal function or a user-facing function. In the latter case, which I support, we (almost) never want to copy BigInt internals, and in the rare case where it's needed, a custom copying function can be implemented. I tried to explain a possible "rule" for deepcopy in the docstring, with "atoms" and "containers", which agrees with that, I hope it makes sense.

@thofma
Copy link
Contributor

thofma commented Dec 2, 2017

You introduce the word "usually" so that deepcopy(x::BigInt) = x won't be an expection anymore?

I don't think this helps. This just opens the door for more and more arbitrary exceptions and less and less predicatible behavior of deepcopy.

I don't understand why this has to be be discussed again. Everything was said in #16999.

@rfourquet
Copy link
Member Author

You introduce the word "usually" so that deepcopy(x::BigInt) = x won't be an expection anymore?

I introduce the word "usually" so that "fully independent object" is not taken too strictly or as a binding definition, which I think was the motivation for Yichao's PR.

This just opens the door for more and more arbitrary exceptions and less and less predicatible behavior of deepcopy.

I'm fine with removing this part of the docstring altogether. For me, the part I introduced with "atoms" and "containers" makes more sense, and in this respect, BigInt is not anymore an exception in this PR.

I don't understand why this has to be be discussed again. Everything was said in #16999.

Because no consensus was found in that PR. Yichao's work was to fix a bug in a way that agrees with his interpretation of the docstring of deepcopy. We were at least two persons having another interpretation of it, so I propose here to update the docstring to convey that alternative interpretation. I don't think one of the interpretations is stricly more correct, but I want to defend here the one I find more useful.

@chethega
Copy link
Contributor

chethega commented Dec 3, 2017

I disagree. deepcopy is extremely convenient if you construct a big data structure in a nontrivial way, and afterwards want to improve memory-layout for faster access times.

Nm= 5_000_000; V1 = [BigInt(rand(1:100)) for i=1:Nm]; shuffle!(V1);V2 = deepcopy(V1);

sum(V1);@time sum(V1); @time sum(V1);
  1.164729 seconds (7 allocations: 224 bytes)
  1.040056 seconds (7 allocations: 224 bytes)


sum(V2); @time sum(V2); @time sum(V2);
  0.135768 seconds (7 allocations: 224 bytes)
  0.129050 seconds (7 allocations: 224 bytes)

People are using deepcopy for a reason. No tool should presume it knows better than the user.

Edit: I think it would be more important to document the traversal order (I think DFS?) of deepcopy, so that users can better anticipate the resulting memory layout. Offering both DFS and BFS order would be nice but probably not worth the effort.

@rfourquet
Copy link
Member Author

deepcopy is extremely convenient if you construct a big data structure in a nontrivial way, and afterwards want to improve memory-layout for faster access times.

That's very interesting, and as far as I can tell the first argument given in favor of copying the internals of BigInt besides when one wants to mutate the BigInt objects, a use case which even Yichao said is not supported by Julia.

People are using deepcopy for a reason. No tool should presume it knows better than the user.

I intended to expand on my motivations here tomorrow, but the gist of it is that I believe there are two classes of use cases for deepcopy, with distinct definitions, and I want to discuss here whether we address both in Base, or we choose one. So I don't agree it's about presuming what is better for the user, rather what use cases we serve. Again it's nice that you exhibited this example. On the other hand, a user wanting to deepcopy a structure containing BigInt objects which have not been shuffled will have better luck with the version in this PR;
for example, copy(V1) is 1000 times faster than deepcopy(V1) on my machine, which should be an indication of the gain we can get, if I'm not mistaken.

So I could imagine using a keyword argument to let the user choose, like deepcopy(x; internals=false).

@chethega
Copy link
Contributor

chethega commented Dec 3, 2017

I thought cleaning up memory fragmentation and over-allocated Vectors was the main use case for deepcopy? If I want a shallow (fast) copy I would use copy instead of deepcopy.

While the shuffle is somewhat pathological, I often have temporary allocs or need to apply permutations (e.g. sorting) during construction of big objects, and afterwards need to do a lot of in-place read/write operations. I'm no a BigInt user though, so maybe the BigInt community has different usecases?

@rfourquet
Copy link
Member Author

If I want a shallow (fast) copy I would use copy instead of deepcopy.

Of course, but if your BigInt objects are nested in a structure, e.g. an array of arrays of BigInt, you may need a deepcopy of it to also copy the inner arrays, with the intention to mutate the inner arrays, while not wanting to copy the BigInt internals.

@chethega
Copy link
Contributor

chethega commented Dec 3, 2017

Of course, but if your BigInt objects are nested in a structure, e.g. an array of arrays of BigInt, you may need a deepcopy of it to also copy the inner arrays, with the intention to mutate the inner arrays, while not wanting to copy the BigInt internals.

Fair enough. So in the end, you think deepcopy should rather try to approximate a lazy-copy?

Then one should probably create a separate function, i.e. deepcopy vs lazydeepcopy, and then think about whether one can do some hare-brained copy-on-write scheme for this?

A lazydeepcopy would actually be quite awesome, and possibly achievable by using shared (mmap-ed) arrays.

@rfourquet
Copy link
Member Author

I thought cleaning up memory fragmentation and over-allocated Vectors was the main use case for deepcopy?

I was not aware at all of this use case, it was not mentioned in the discussions concerning BigInt (AFAICR), so it's nice to note it here, as it's also not explicited in the docstring. So input is welcome here to clarify what we want to accomplish with deepcopy.

@rfourquet
Copy link
Member Author

rfourquet commented Dec 3, 2017

So in the end, you think deepcopy should rather try to approximate a lazy-copy?

I didn't think in these terms... but lazy-copy would be a possible implementation of my understanding of deepcopy: copy is when one wants to mutate an object while keeping a version of the original, and deepcopy when one wants to mutate also the contained objects, while still keeping an original intact. But as BigInt is not supposed to be mutated, this is not to be copied during a deepcopy (again, according to this version/definition of deepcopy).

@oscardssmith
Copy link
Member

I think a lot of the problem is that deepcopy is being used to do two very different things. The first is "get me a version of this object that will not change if I change the original". This should not mess with internal layout. The second is "Make this bigInt use a more memory efficient layout". I think it is clear that the problem here is that this use is not asking for an identical copy as deepcopy wants, but rather a better version with distinct properties from the original. The solution, as I see it, would be to make a new method, maybe simplify that performs the second function, and make deepcopy follow the path of this PR. We should probably also have a news entry explaining this change.

@chethega
Copy link
Contributor

chethega commented Dec 4, 2017

After thinking about it, you are right. Deepcopy should split into a deepcopy_fast and a different deepcopy_mem (whatever their names end up being).

Then the future development of both is also clear: deepcopy_mem should attempt to talk to the allocator (e.g.: make new old pools, create the objects as old, allow for easy paging-out of the copy, maybe play some offset-games to prevent false aliasing, ensure that the temporary allocs for the object_id dict do not fragment the same pools as the deepcopy). It should also become part of the performance tips.

deepcopy_fast, on the other hand, can look for more unreliable speedups, like skipping the bignum copies or (one can dream) copy-on-write. Especially a deepcopy_fast of a deepcopy_mem object should allow cow, because the deepcopy_mem is nicely contained and not splattered all over the heap.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 4, 2017

If you want to copy a DataFrame so that its columns are also copies of the original columns, you need to call deepcopy, which will call deepcopy on all arrays recursively

Seems like this is something that should be addressed in DataFrames

Deepcopy should split into a deepcopy_fast and a different deepcopy_mem

I think you mean deepcopy_all and deepcopy_dwim 😜 .

edit: wiktionary link looks like a better reference source for those not familiar with dwim

@oscardssmith
Copy link
Member

Shouldn't it be split into deepcopy and make_it_faster? Based on the use case presented above it is being used not as "Get me a copy", but rather "simplify the memory layout". Therefore it should not be called deepcopy*, but rather simplify or something to that effect.

@rfourquet
Copy link
Member Author

but rather simplify or something to that effect.

I would guess that some people would use this version to know that they can mutate e.g. BigInt object without affecting the original, in which case simplify is not really the idea.

Also, I believe that the two versions would share their implementation in many cases, so it could be more economic for implementors to have only one function, with a keyword for discriminating the variant, which can be ignored or passed down recursively when it doesn't matter, e.g, in a non-rigorous way and forgetting about deepcopy_internal and stackdict:

function deepcopy(x::Array, strict=false) # strict=true to deepcopy as deep as possible, even BigInt internals
    dest = similar(x)
    for i=1:length(x)
        dest[i] = deepcopy(x[i], strict=strict)
    end
end

although strict is not so descriptive either.

@nalimilan
Copy link
Member

nalimilan commented Dec 5, 2017

If you want to copy a DataFrame so that its columns are also copies of the original columns, you need to call deepcopy, which will call deepcopy on all arrays recursively

Seems like this is something that should be addressed in DataFrames

@vtnash It could certainly be changed in DataFrames if we wanted to. I mentioned it because that choice depends on how we define copy/deepcopy. The current behavior is useful if you want a shallow copy of a DataFrame so that you can add a column without modifying the original. If we changed copy to recursively call copy on the columns, that cheap operation would no longer be possible.

@StefanKarpinski
Copy link
Sponsor Member

I suppose deepcopy could take a depth keyword argument. Then the two-level copy behavior would be deepcopy(df, depth=2).

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Dec 5, 2017

FWIW, I haven't read all of this thread but I don't think that copy or deepcopy should need to duplicate BigInts or BigFloats since they are officially immutable. Since we have a "private" and unstable API for carefully mutating these values, there could be a private copy function (bigcopy?) as part of that API but since the public API is immutable, the public copy and deepcopy functions should behave accordingly. This should, of course, be thoroughly documented.

@StefanKarpinski
Copy link
Sponsor Member

Also, in the future, we reserve the right to make bugnums actually immutable, so anyone who is relying on the current behavior is going to have their code broken. We may as well break it now.

@chethega
Copy link
Contributor

chethega commented Dec 5, 2017

Re @StefanKarpinski

Since you did not read the entire thread, I'll repeat myself. Please reconsider

Nm= 5_000_000; V1 = [BigInt(rand(1:100)) for i=1:Nm]; shuffle!(V1);V2 = deepcopy(V1);

sum(V1);@time sum(V1); @time sum(V1);
  1.164729 seconds (7 allocations: 224 bytes)
  1.040056 seconds (7 allocations: 224 bytes)

sum(V2); @time sum(V2); @time sum(V2);
  0.135768 seconds (7 allocations: 224 bytes)
  0.129050 seconds (7 allocations: 224 bytes)

Currently, deepcopy has the extremely useful side-effect of getting rid of heap-fragmentation and making subsequent access (in DFS==deepcopy order) much more cache-friendly. Regardless of the default behavior of deepcopy, this should remain conveniently accessible. In order to be more realistic, mentally replace shuffle! by some sort!.

@StefanKarpinski
Copy link
Sponsor Member

With a private bignum copy API you would just replace V2 = deepcopy(V1) with V2 = bigcopy.(V1).

@chethega
Copy link
Contributor

chethega commented Dec 5, 2017

@rfourquet I really like your strict version.

@StefanKarpinski depthlimit:

In principle, deepcopy is called on a directed graph, following object references and respecting object identity (graph homomorphism). What is the proposed behavior if an object X is reachable by pathes of different lengths? Say:

Root <-> A1 <-> A2 <-> X

Now we must copy everything (if we want to preserve object identity), regardless of the depth limit and length of the chain. Preserving object identity by depth level must do something like "take transitive closure, choose favorite way of assigning levels in the Hasse-diagram, go that far for the copy but not further". This sounds far too fancy for base.

@StefanKarpinski
Copy link
Sponsor Member

I'm not sure I see the problem with depth-limiting deepcopy: put a limit on recursion equal to the value of depth and stop making copies once the limit reaches zero. You get a copy of something if it's reachable within that number of recursive steps, otherwise you don't. Regardless, the depth thing is irrelevant to whether copy and deepcopy should actually copy bignums or not.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 6, 2017

You get a copy of something if it reaches some that is reachable within that number of recursive steps, otherwise you don't.

ftfy. It's a problem mostly just in that it's subtle, and would give even worse performance than deepcopy does now. (also acceptable would be: "is reached within", which is much faster, but may destroy aliasing)

@chethega
Copy link
Contributor

chethega commented Dec 8, 2017

Regardless, the depth thing is irrelevant to whether copy and deepcopy should actually copy bignums or not.

True. What about the strict option? So, say, copy and deepcopy don't copy unless strict=true? Then the default could change in a later version, without breaking anything, if strict turns out to be more useful in practice.

Also, won't the same question apply to strings if they ever become immutable (cf #22193?

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Dec 8, 2017

The strict thing seems pretty weird to me. What does it mean in general? When can you pass the strict keyword to copy and deepcopy? Strings are already immutable on master.

@chethega
Copy link
Contributor

Strict deepcopy would mean "really, really make a copy; make it so that the resulting object-tree is indistinguishable from serializing-and-deserializing, even if I abuse pointers, and including heap placement". This means that we need to copy the actual bytes of bignums and strings, into newly allocated space (that is allocated in the DFS-order, hence pretty predictably placed on the heap, and the deepcopy works even if someone later uses an unsafe_store to mutate bignums in-place)

Non-strict would mean "give me a semantic deepcopy, as fast as possible", which allows us to be faster and allocate less memory.

In 99% of cases, we have no optimization for "strict=false" and both do the same thing.

After such a change, A=deepcopy(A, strict=true) would be the preferred way of cleaning up an object-tree with bad placement (maybe because A was constructed over a long time, or because the user applied a permutation like sorting some arrays), and B=deepcopy(A, strict=false) would be the preferred way of pulling a copy for algorithmic reasons.

Currently, only "strict=true" is implemented; this pull-request introduces the first optimization for "strict=false", and originally proposed to make the "strict=true" behavior entirely inaccessible. I want the optional argument because I think that strict copies are quite useful.

I think that a change from default "strict=true" to "strict=false" is more breaking than the other way (because of code doing evil pointer tricks, or crucially relying on good memory patterns); if non-strict is supposed to become default, then it would probably be better to do this now?

Regarding strings: Can we implement a deepcopy of strings without copying the actual buffer? Would you feel comfortable if a copy of a string only copied the pointer (with strict=false)? Wouldn't you feel that some way of "nope, I am going to unsafe_store! into this string/ bignum, please allocate and memcopy!" should be provided by the language (e.g. via strict=true)?

Ultimately, there only is a difference for "weird" types that need a custom deepcopy function, like bignum (which abuse pointers and finalizers) or possibly types that directly hook into the runtime, or are builtin.

@rfourquet rfourquet added the status:triage This should be discussed on a triage call label Dec 13, 2017
@rfourquet
Copy link
Member Author

It's a bit short on time for designing the strict version before feature freeze. I propose we do that in the 1.x timeframe, and focus on the definition which I think should be the default, without keyword. Here is the proposed definition (slighly updated from the PR diff).
There are informally 2 categories of objects:

  • containers: usual containers like Set, but also few specific objects like Pair and maybe Nullable, whose contained objects are independant of the functioning of the container. For example, mutating the first element of a vector of mutable objects doesn't affect at all the Vector, similarly for the .second field of a Pair ;
  • "atoms", everything else; those atoms can contain other objects as implementation detail, but modifying those can alter the good functioning of the parent object; For example manually modifying the .vals field of a MersenneTwister, which is a Vector, will make things go wrong.

We define copy of a container as the copy of every internals of the object, such that the new container contains exactly (as in ===) the same objects as the source.
We define copy for an atom as the copy of every internals of the object, such that the new object can function independantly from the source. (Some atoms may have no copy defined).

We define deepcopy for a container as the copy of it, except that the contained objects are recursively deepcopyed.
We define deepcopy for an atom to be equal to copy.
We define deepcopy for atoms which have no copy methods as the identity (e.g. copy(:symbol) doesn't work, but deepcopy(::Vector{Symbol}) should work.

I'm not sure if we should define a default fall-back as deepcopy(x) = copy(x) or deepcopy(x) = x. I would tend to say the latter, and objects defining copy should also define deepcopy.

@StefanKarpinski
Copy link
Sponsor Member

I think we can leave this as is for now and introduce a strict::Bool=true keyword in the future. Having the default behavior be to copy as much as possible is good. This change can be added in the future as an opt-in change for strict=false. We should also document some of the known limitations of deepcopy in the future wrt pointers and objects with finalizers (a particularly dangerous combination).

@StefanKarpinski StefanKarpinski removed the status:triage This should be discussed on a triage call label Dec 14, 2017
@StefanKarpinski StefanKarpinski added this to the 1.x milestone Dec 14, 2017
@StefanKarpinski
Copy link
Sponsor Member

Design notes for strict: the object cache may need options like an IOContext which methods can choose to look at if they care to, including whether they should do strict copying or not. That way most definition don't care but types like bignums and strings can look at the attribute and decide whether to return the same object or make an actual copy of it.

@vtjnash vtjnash closed this Apr 3, 2021
@vtjnash vtjnash deleted the rf/deepcopy branch April 3, 2021 04:36
@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 3, 2021

Very outdated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:bignums BigInt and BigFloat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants