Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support maps and sets that preserve insertion order #11

Open
pniederw opened this issue Sep 10, 2016 · 9 comments
Open

Support maps and sets that preserve insertion order #11

pniederw opened this issue Sep 10, 2016 · 9 comments

Comments

@pniederw
Copy link

Would be great to have maps and sets that preserve insertion order, a la https://github.com/frankiesardo/linked or https://github.com/amalloy/ordered.

My main motivation for maintaining insertion order is to have/provide a more deterministic user experience (cf. ES6 maps).

@GlenKPeterson
Copy link
Owner

GlenKPeterson commented Sep 10, 2016

For the Insertion-order Set, you can wrap a map. Here's an example of implementing ImSet in 46 lines of Java. Instead of wrapping another set like that example does, wrap a ImMap<E,Integer> like:

class InsertionOrderSet<E> implements ImSet<E> {
    private final ImMap<E,Integer> inner = PersistentHashMap.empty();
    private int index = 0;

    InsertionOrderSet(ImMap<E,Integer> m, int i) { inner = m; index = i; }

    @Override public ImSet<E> put(E e) {
        int j = i + 1;
        return new InsertionOrderSet(inner.put(e, j), j)
    }

    @Override public UnmodIterator<E> iterator() {
        return inner.toImSortedSet((a, b) -> a.getValue() - b.getValue())
                    .map(entry -> entry.getKey())
                    .toImList()
                    .iterator()
    }

For the map, do something similar. Still wrapping a map, but the inner map should be of type ImMap<K,Tuple2<V,Integer>> so that each value becomes a pair of the normal value and an index.

If you don't need it immutable, there's always java.util.LinkedHashSet

In my own work, if I want things sorted, I use a SortedMap or SortedSet. I seem to always want everything alphabetized or in another order that can be calculated after the fact. If I want to preserve whatever order things came in, I use a List.

I'm not running off to implement this as part of Paguro (UncleJim) because:

  1. I don't know of any significant performance shortcuts to make it worth custom implementations of insertion-order sets and maps.
  2. I think the need for such things is low (but don't really know).
  3. It's pretty easy to write your own.

I think a paper describing a great implementation of a highly efficient, immutable, insertion-order map would be the most likely thing to change my mind. Demonstrating a need could help convince me too.

That's my first reaction, but I'll think about it. This Serialization thing took a long time to do. I'm also renaming the project, then I'd like to get back to the RRB-Tree, then get a Java 7 version out. Even if I felt strongly that this belonged in this project, it would probably start at priority 4 at the highest. I started this project 2-3 years ago with the idea of building a language on top of it, so I'm eager to move on to that at some point too.

@pniederw
Copy link
Author

Just to be clear, the idea is to always use insertion-ordered maps instead of unsorted maps, in order to provide a more deterministic user experience. I believe this is why ES6 only has insertion-ordered maps (after all it comes at a cost). https://plus.google.com/+FrankieSardo/posts/1mHL85NgX4z provides some more rationale.

Sorting on every iteration, or maintaining a backing sorted map, seems expensive.

Some of the other Java persistent collection libraries already do (e.g. https://github.com/anjbur/java-immutable-collections/tree/master/src/main/java/org/javimmutable/collections/inorder) or plan to (e.g. usethesource/capsule#5) provide insertion-ordered maps.

No worries if you don't consider this worth the effort. I just thought I'd voice my interest in this feature.

@GlenKPeterson
Copy link
Owner

GlenKPeterson commented Sep 16, 2016

Map:
https://github.com/GlenKPeterson/Paguro/blob/2016-09-17_InsertionOrderMap/src/main/java/org/organicdesign/fp/collections/ImInsertOrderMap.java

Set:
https://github.com/GlenKPeterson/Paguro/blob/2016-09-17_InsertionOrderMap/src/main/java/org/organicdesign/fp/collections/ImInsertOrderSet.java

Change Log:
https://github.com/GlenKPeterson/Paguro/blob/2016-09-17_InsertionOrderMap/CHANGE_LOG.md

Here's the branch you can pull down and build to try out:
https://github.com/GlenKPeterson/Paguro/tree/2016-09-17_InsertionOrderMap

IMPORTANT: It looks like javac 1.8.0_31 will NOT build this code, but 1.8.0.91 will.

I'm still very ambivalent about whether to merge them or not. I agree they are useful and less surprising than the behavior of hash maps. But is that worth the performance hit? Worth a bigger jar file? I don't know.

We need to actually measure the relative performance. I like to measure many sizes in multiples of 10, creation and iteration using JMH (Java Microbenchmarking Harness). But before that, we probably need Mutable versions for speed. For some operations it's fair to make comparisons against building PersistentHash(Map/Set)s and assume similar mutable ratios, but I need to think about that more.

I'm thinking about splitting Paguro into Paguro-Core and Paguro-Extras. That way people who want an incredibly tiny jar, get an even smaller one, and those who want more features, get their features. It's a no-brainer to include this in Paguro-Extras. I'm not promising that today, but it's a thought for the future.

Regarding your helpful links:

NOTE: I redacted some of this in a later post
I can't see Sardo's code because his link is broken and I can't comment there for some reason. But I disagree with his claim of updates in Constant Time. The argument for PersistentList's Amortized Constant Time is that when limited by the maximum size of an 32-bit integer, log32 is effectively a constant of not more than 7, and the tail (internal optimization) makes this log32 operation only happen once in every 32 appends. I think that means the logarithmic factor is a maximum of 7/32 and can thus be ignored in Big O and you can call append() Amortized Constant. get(int i) is still log32, but it's actually a much faster operation in general due to radically lower constant factors.

PersistentHashSet is (I think) log8 for each put() and get(), which is harder to call constant, especially since there's no tail. You can also have duplicate hashcodes, thus exceeding the size of a 32-bit int. A TreeMap has log2 for put() and contains() with no mitigating factors. That's undeniably logarithmic. I know constants aren't included in big O, but I'm expecting roughly half the speed and double the memory of either PersistentHashMap or PersistentTreeMap. The difference between log8 log32 and log2 will make the slower TreeMap determine the speed as the two collections grow.

I think all immutable solutions use a PersistentTreeMap to track insertion order (WRONG). Everyone else's solution updates the tree map another data structure on insert, and mine updates only when the iterator is called.

I like converting it at the last moment for a few reasons:

  1. it only incurs the overhead of building the treeMap if you actually call the iterator, which you may never do.
  2. (Untested theory:) Building a sorted set/map in order (or reverse order) causes more shuffling of internal nodes than building in random order.[1] HashMaps iterate in random order, so the total cost of building the collection and iterating in insertion order should be lower on average.
  3. You could cache the sorted set/map that the iterator is generated from so that repeated calls to iterator() would be instantaneous. The extra variable adds to the constant cost of calling assoc() to create the structure, but only a little bit. I'm not recommending this, just pointing out the possibility.

Footnotes:

[1] Proof: if you're given elements in sorted order and you don't re-balance the binary tree, you'd end up with the internal structure of a linked list which is O(n) lookup instead of O(log2n).

Each node of the ideal tree should have half the sub-nodes on either side of it. Therefore, ideal insertion order which yields the ideal tree without any re-balancing is the same as a breadth-first traversal of the completed balanced tree. One example of such ordering (for a tree of size n):

n/2, n/4, 3n/4, n/8, 3n/8, 5n/8, 7n/8, n/16, 3n/16, ...

Random order would vary be between worst-case and ideal, but that's still better than guaranteed worst case!

@GlenKPeterson
Copy link
Owner

In Clojure, small HashMaps preserve insertion order (it uses an ArrayMap for up to 8 or 9 items):
https://clojuredocs.org/clojure.core/array-map

Iterating through 8 items in an array happens as quickly as any data structure on the JVM. Maybe that's all you really need? Or do you really need big sorted HashMaps?

@pniederw
Copy link
Author

Small maps will be the common case, but I need to maintain order in any case.

@GlenKPeterson
Copy link
Owner

I re-read Frankie Sardo's post a few more times. I think the argument that the insertion order map has a deterministic iteration order and is therefore repeatable as it grows is a strong one. Though I will point out that for a given map, the insertion order does not change. It only changes when you add or remove items.

His idea of storing a pointer to the key instead of the node is quite clever. I missed that detail the first few times through. So he's not creating a TreeSet. I still think calling it "Amortized Constant Time" is a stretch, but I am ready to agree that his solution leaves the Big O characteristics of the Clojure HashMap unchanged. That's probably better than my solution if you ever iterate through the HashMap.

Preserving insertion order is a good idea. I'm glad you proposed it! Is it worth a 2.5x slow-down? That's where things get murky. My rule of thumb is that less-than-15% is a very reasonable price for simplicity, but 2x is too expensive. I'm not ready to replace Map and Set with insertion-order versions at this time. A more efficient algorithm would go a long way toward changing my mind.

You are welcome to use the code I provided under the Eclipse 1.0 license. I mean, you can use it under the Apache 2.0 license with the understanding that it calls Eclipse 1.0 code (originally from Clojure), so unless you plan to rewrite that part, you'd be bound by the terms of both license agreements. You can even use it in proprietary apps, without giving anything back besides attribution. Did I mention 100% test coverage? My consideration of whether to include it permanently in this project should not be holding you up anymore. If you find bugs, I'll probably patch them the following weekend.

I'm still leaning toward putting this in a Paguro-Extras package. Then I'd get a sense of how many people would choose to pay the performance price for it.

That said, I started looking at RRB-Tree again this weekend. I think that's a higher priority for me for no other reason than wanting to finish what I started. Also, I'm really excited about it potentially replacing the PersistentVector. Not sure which is a higher priority for you...

@GlenKPeterson
Copy link
Owner

GlenKPeterson commented Sep 19, 2016

OK, I came up with a different implementation that uses java.util.TreeMap for speed inside the iterator. Iterators are inherently mutable and not thread safe, so I might as well use the fastest sorted collection. It also only uses the temporary treeset when the wrapped hashmap iterator doesn't return things in the order we want:

https://github.com/GlenKPeterson/Paguro/blob/2016-09-17_InsertionOrderMap/src/main/java/org/organicdesign/fp/collections/ImInsertOrderMap2.java#L91

In theory, this would only incur a portion of the O(log2 n) of java.util.TreeMap. I'm guessing if the first item could be anywhere in the wrapped hashmap, but it will average showing up after iterating about half-way through. The next few calls to next() will be equally likely to either grow or shrink the temporary treemap, so it probably remains at the size of half the un-iterated portion of the wrapped hashmap. But the temporary tree map shrinks linearly after that. So I think we only incur an average O(log2 (n/2)) at the beginning of the iteration, O(log2 (n/4)) half-way through the iteration, O(log2 (n/8)) three quarters of the way through, etc. I think it's still log2, but generally for small values of n.

Iteration Benchmarks: Goes from roughly 3x slower than a PersistentHashMap at 10 items to 10x slower at 100K items. This is better than the first pass which went from 15x slower at 10 items to over 30x at 100K items. I need to time Sardo's code and look at this again when I've got more time.

@pniederw
Copy link
Author

Thanks for all your work on this. I'll definitely give it a shot.

Preserving insertion order is a good idea. I'm glad you proposed it! Is it worth a 2.5x slow-down? That's where things get murky. My rule of thumb is that less-than-15% is a very reasonable price for simplicity, but 2x is too expensive.

Given that I consider Paguro a very general-purpose collection library, I probably wouldn't make insertion-ordered sets/maps the default, but would offer them as an option. I think they warrant a (say) 15% increase in Jar size, but Paguro-Extras would also work for me.

That said, I started looking at RRB-Tree again this weekend. I think that's a higher priority for me for no other reason than wanting to finish what I started. Also, I'm really excited about it potentially replacing the PersistentVector. Not sure which is a higher priority for you...

Makes total sense. I'm happy to use either!

@GlenKPeterson
Copy link
Owner

RRB Tree is done, but converting to Kotlin and whatever I'm doing with Kategory has crept in at a higher priority. It still probably makes sense to add insertion-order maps at some point. I think about this request probably every week, but not ready to commit yet.

I keep thinking that a linked or doubly-linked list is the right data structure to manage the ordering. Sardo's explanation link is now broken, so I can't go back and look at his theory. Just his implementation. If you have a copy of his theory, that could help. Anyway, I feel a little bad asking you to do any work when I'm not planning to implement this for some months.

Since there are 2 data structures involved, I don't think it will affect the performance much to implement the slower one as a wrapper around the faster one. I've got to get through the Kotlin conversion first... I'll tentatively put this in for version 4.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants