Support maps and sets that preserve insertion order #11

pniederw · 2016-09-10T02:17:33Z

Would be great to have maps and sets that preserve insertion order, a la https://github.com/frankiesardo/linked or https://github.com/amalloy/ordered.

My main motivation for maintaining insertion order is to have/provide a more deterministic user experience (cf. ES6 maps).

GlenKPeterson · 2016-09-10T19:51:31Z

For the Insertion-order Set, you can wrap a map. Here's an example of implementing ImSet in 46 lines of Java. Instead of wrapping another set like that example does, wrap a ImMap<E,Integer> like:

class InsertionOrderSet<E> implements ImSet<E> {
    private final ImMap<E,Integer> inner = PersistentHashMap.empty();
    private int index = 0;

    InsertionOrderSet(ImMap<E,Integer> m, int i) { inner = m; index = i; }

    @Override public ImSet<E> put(E e) {
        int j = i + 1;
        return new InsertionOrderSet(inner.put(e, j), j)
    }

    @Override public UnmodIterator<E> iterator() {
        return inner.toImSortedSet((a, b) -> a.getValue() - b.getValue())
                    .map(entry -> entry.getKey())
                    .toImList()
                    .iterator()
    }

For the map, do something similar. Still wrapping a map, but the inner map should be of type ImMap<K,Tuple2<V,Integer>> so that each value becomes a pair of the normal value and an index.

If you don't need it immutable, there's always java.util.LinkedHashSet

In my own work, if I want things sorted, I use a SortedMap or SortedSet. I seem to always want everything alphabetized or in another order that can be calculated after the fact. If I want to preserve whatever order things came in, I use a List.

I'm not running off to implement this as part of Paguro (UncleJim) because:

I don't know of any significant performance shortcuts to make it worth custom implementations of insertion-order sets and maps.
I think the need for such things is low (but don't really know).
It's pretty easy to write your own.

I think a paper describing a great implementation of a highly efficient, immutable, insertion-order map would be the most likely thing to change my mind. Demonstrating a need could help convince me too.

That's my first reaction, but I'll think about it. This Serialization thing took a long time to do. I'm also renaming the project, then I'd like to get back to the RRB-Tree, then get a Java 7 version out. Even if I felt strongly that this belonged in this project, it would probably start at priority 4 at the highest. I started this project 2-3 years ago with the idea of building a language on top of it, so I'm eager to move on to that at some point too.

pniederw · 2016-09-13T05:01:37Z

Just to be clear, the idea is to always use insertion-ordered maps instead of unsorted maps, in order to provide a more deterministic user experience. I believe this is why ES6 only has insertion-ordered maps (after all it comes at a cost). https://plus.google.com/+FrankieSardo/posts/1mHL85NgX4z provides some more rationale.

Sorting on every iteration, or maintaining a backing sorted map, seems expensive.

Some of the other Java persistent collection libraries already do (e.g. https://github.com/anjbur/java-immutable-collections/tree/master/src/main/java/org/javimmutable/collections/inorder) or plan to (e.g. usethesource/capsule#5) provide insertion-ordered maps.

No worries if you don't consider this worth the effort. I just thought I'd voice my interest in this feature.

GlenKPeterson · 2016-09-16T12:18:02Z

Map:
https://github.com/GlenKPeterson/Paguro/blob/2016-09-17_InsertionOrderMap/src/main/java/org/organicdesign/fp/collections/ImInsertOrderMap.java

Set:
https://github.com/GlenKPeterson/Paguro/blob/2016-09-17_InsertionOrderMap/src/main/java/org/organicdesign/fp/collections/ImInsertOrderSet.java

Change Log:
https://github.com/GlenKPeterson/Paguro/blob/2016-09-17_InsertionOrderMap/CHANGE_LOG.md

Here's the branch you can pull down and build to try out:
https://github.com/GlenKPeterson/Paguro/tree/2016-09-17_InsertionOrderMap

IMPORTANT: It looks like javac 1.8.0_31 will NOT build this code, but 1.8.0.91 will.

I'm still very ambivalent about whether to merge them or not. I agree they are useful and less surprising than the behavior of hash maps. But is that worth the performance hit? Worth a bigger jar file? I don't know.

We need to actually measure the relative performance. I like to measure many sizes in multiples of 10, creation and iteration using JMH (Java Microbenchmarking Harness). But before that, we probably need Mutable versions for speed. For some operations it's fair to make comparisons against building PersistentHash(Map/Set)s and assume similar mutable ratios, but I need to think about that more.

I'm thinking about splitting Paguro into Paguro-Core and Paguro-Extras. That way people who want an incredibly tiny jar, get an even smaller one, and those who want more features, get their features. It's a no-brainer to include this in Paguro-Extras. I'm not promising that today, but it's a thought for the future.

Regarding your helpful links:

NOTE: I redacted some of this in a later post
I can't see Sardo's code because his link is broken and I can't comment there for some reason. But I disagree with his claim of updates in Constant Time. The argument for PersistentList's Amortized Constant Time is that when limited by the maximum size of an 32-bit integer, log₃₂ is effectively a constant of not more than 7, and the tail (internal optimization) makes this log₃₂ operation only happen once in every 32 appends. I think that means the logarithmic factor is a maximum of 7/32 and can thus be ignored in Big O and you can call append() Amortized Constant. get(int i) is still log₃₂, but it's actually a much faster operation in general due to radically lower constant factors.

PersistentHashSet is ~~(I think) log₈ for each put() and get(), which is~~ harder to call constant~~, especially~~ since there's no tail. You can also have duplicate hashcodes, thus exceeding the size of a 32-bit int. A TreeMap has log₂ for put() and contains() with no mitigating factors. That's undeniably logarithmic. I know constants aren't included in big O, but I'm expecting roughly half the speed and double the memory of either PersistentHashMap or PersistentTreeMap. The difference between ~~log₈~~ log₃₂ and log₂ will make the slower TreeMap determine the speed as the two collections grow.

I think all immutable solutions use a PersistentTreeMap to track insertion order (WRONG). Everyone else's solution updates ~~the tree map~~ another data structure on insert, and mine updates only when the iterator is called.

I like converting it at the last moment for a few reasons:

it only incurs the overhead of building the treeMap if you actually call the iterator, which you may never do.
(Untested theory:) Building a sorted set/map in order (or reverse order) causes more shuffling of internal nodes than building in random order.[1] HashMaps iterate in random order, so the total cost of building the collection and iterating in insertion order should be lower on average.
You could cache the sorted set/map that the iterator is generated from so that repeated calls to iterator() would be instantaneous. The extra variable adds to the constant cost of calling assoc() to create the structure, but only a little bit. I'm not recommending this, just pointing out the possibility.

Footnotes:

[1] Proof: if you're given elements in sorted order and you don't re-balance the binary tree, you'd end up with the internal structure of a linked list which is O(n) lookup instead of O(log₂n).

Each node of the ideal tree should have half the sub-nodes on either side of it. Therefore, ideal insertion order which yields the ideal tree without any re-balancing is the same as a breadth-first traversal of the completed balanced tree. One example of such ordering (for a tree of size n):

n/2, n/4, 3n/4, n/8, 3n/8, 5n/8, 7n/8, n/16, 3n/16, ...

Random order would vary be between worst-case and ideal, but that's still better than guaranteed worst case!

GlenKPeterson · 2016-09-18T00:31:58Z

In Clojure, small HashMaps preserve insertion order (it uses an ArrayMap for up to 8 or 9 items):
https://clojuredocs.org/clojure.core/array-map

Iterating through 8 items in an array happens as quickly as any data structure on the JVM. Maybe that's all you really need? Or do you really need big sorted HashMaps?

pniederw · 2016-09-18T21:35:43Z

Small maps will be the common case, but I need to maintain order in any case.

GlenKPeterson · 2016-09-19T11:53:39Z

I re-read Frankie Sardo's post a few more times. I think the argument that the insertion order map has a deterministic iteration order and is therefore repeatable as it grows is a strong one. Though I will point out that for a given map, the insertion order does not change. It only changes when you add or remove items.

His idea of storing a pointer to the key instead of the node is quite clever. I missed that detail the first few times through. So he's not creating a TreeSet. I still think calling it "Amortized Constant Time" is a stretch, but I am ready to agree that his solution leaves the Big O characteristics of the Clojure HashMap unchanged. That's probably better than my solution if you ever iterate through the HashMap.

Preserving insertion order is a good idea. I'm glad you proposed it! Is it worth a 2.5x slow-down? That's where things get murky. My rule of thumb is that less-than-15% is a very reasonable price for simplicity, but 2x is too expensive. I'm not ready to replace Map and Set with insertion-order versions at this time. A more efficient algorithm would go a long way toward changing my mind.

You are welcome to use the code I provided under the Eclipse 1.0 license. I mean, you can use it under the Apache 2.0 license with the understanding that it calls Eclipse 1.0 code (originally from Clojure), so unless you plan to rewrite that part, you'd be bound by the terms of both license agreements. You can even use it in proprietary apps, without giving anything back besides attribution. Did I mention 100% test coverage? My consideration of whether to include it permanently in this project should not be holding you up anymore. If you find bugs, I'll probably patch them the following weekend.

I'm still leaning toward putting this in a Paguro-Extras package. Then I'd get a sense of how many people would choose to pay the performance price for it.

That said, I started looking at RRB-Tree again this weekend. I think that's a higher priority for me for no other reason than wanting to finish what I started. Also, I'm really excited about it potentially replacing the PersistentVector. Not sure which is a higher priority for you...

GlenKPeterson · 2016-09-19T14:35:51Z

OK, I came up with a different implementation that uses java.util.TreeMap for speed inside the iterator. Iterators are inherently mutable and not thread safe, so I might as well use the fastest sorted collection. It also only uses the temporary treeset when the wrapped hashmap iterator doesn't return things in the order we want:

https://github.com/GlenKPeterson/Paguro/blob/2016-09-17_InsertionOrderMap/src/main/java/org/organicdesign/fp/collections/ImInsertOrderMap2.java#L91

In theory, this would only incur a portion of the O(log₂ n) of java.util.TreeMap. I'm guessing if the first item could be anywhere in the wrapped hashmap, but it will average showing up after iterating about half-way through. The next few calls to next() will be equally likely to either grow or shrink the temporary treemap, so it probably remains at the size of half the un-iterated portion of the wrapped hashmap. But the temporary tree map shrinks linearly after that. So I think we only incur an average O(log₂ (n/2)) at the beginning of the iteration, O(log₂ (n/4)) half-way through the iteration, O(log₂ (n/8)) three quarters of the way through, etc. I think it's still log₂, but generally for small values of n.

Iteration Benchmarks: Goes from roughly 3x slower than a PersistentHashMap at 10 items to 10x slower at 100K items. This is better than the first pass which went from 15x slower at 10 items to over 30x at 100K items. I need to time Sardo's code and look at this again when I've got more time.

pniederw · 2016-09-19T21:16:35Z

Thanks for all your work on this. I'll definitely give it a shot.

Preserving insertion order is a good idea. I'm glad you proposed it! Is it worth a 2.5x slow-down? That's where things get murky. My rule of thumb is that less-than-15% is a very reasonable price for simplicity, but 2x is too expensive.

Given that I consider Paguro a very general-purpose collection library, I probably wouldn't make insertion-ordered sets/maps the default, but would offer them as an option. I think they warrant a (say) 15% increase in Jar size, but Paguro-Extras would also work for me.

That said, I started looking at RRB-Tree again this weekend. I think that's a higher priority for me for no other reason than wanting to finish what I started. Also, I'm really excited about it potentially replacing the PersistentVector. Not sure which is a higher priority for you...

Makes total sense. I'm happy to use either!

GlenKPeterson · 2017-10-18T00:36:00Z

RRB Tree is done, but converting to Kotlin and whatever I'm doing with Kategory has crept in at a higher priority. It still probably makes sense to add insertion-order maps at some point. I think about this request probably every week, but not ready to commit yet.

I keep thinking that a linked or doubly-linked list is the right data structure to manage the ordering. Sardo's explanation link is now broken, so I can't go back and look at his theory. Just his implementation. If you have a copy of his theory, that could help. Anyway, I feel a little bad asking you to do any work when I'm not planning to implement this for some months.

Since there are 2 data structures involved, I don't think it will affect the performance much to implement the slower one as a wrapper around the faster one. I've got to get through the Kotlin conversion first... I'll tentatively put this in for version 4.2.

…and that only as needed, for iteration.

GlenKPeterson added the enhancement label Sep 10, 2016

GlenKPeterson added question and removed enhancement labels Sep 10, 2016

GlenKPeterson added enhancement and removed question labels Sep 13, 2016

GlenKPeterson added the question label Apr 20, 2017

GlenKPeterson removed the question label Oct 18, 2017

GlenKPeterson modified the milestones: 4.2, 4.2: MultiMap/MultiSet, 4.1 Insertion Order Map/Set Oct 18, 2017

GlenKPeterson referenced this issue Aug 6, 2019

Second implementation of ImInsertOrderMap(2) uses a mutable treeset, …

1a7e27b

…and that only as needed, for iteration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support maps and sets that preserve insertion order #11

Support maps and sets that preserve insertion order #11

pniederw commented Sep 10, 2016

GlenKPeterson commented Sep 10, 2016 •

edited

pniederw commented Sep 13, 2016

GlenKPeterson commented Sep 16, 2016 •

edited

GlenKPeterson commented Sep 18, 2016

pniederw commented Sep 18, 2016

GlenKPeterson commented Sep 19, 2016

GlenKPeterson commented Sep 19, 2016 •

edited

pniederw commented Sep 19, 2016

GlenKPeterson commented Oct 18, 2017

Support maps and sets that preserve insertion order #11

Support maps and sets that preserve insertion order #11

Comments

pniederw commented Sep 10, 2016

GlenKPeterson commented Sep 10, 2016 • edited

pniederw commented Sep 13, 2016

GlenKPeterson commented Sep 16, 2016 • edited

Regarding your helpful links:

Footnotes:

GlenKPeterson commented Sep 18, 2016

pniederw commented Sep 18, 2016

GlenKPeterson commented Sep 19, 2016

GlenKPeterson commented Sep 19, 2016 • edited

pniederw commented Sep 19, 2016

GlenKPeterson commented Oct 18, 2017

GlenKPeterson commented Sep 10, 2016 •

edited

GlenKPeterson commented Sep 16, 2016 •

edited

GlenKPeterson commented Sep 19, 2016 •

edited