Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import GeoWave Indexing Code #20

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

lossyrob
Copy link
Member

Supersedes #18

This PR imports the GeoWave "index" subproject into SFCurve. With this, representatives of all of the GeoWave, GeoMesa and GeoTrellis indexing code will be in this repository. GeoTrellis plans to move to using this indexing by default for it's 2.0 release.

All namespaces have been moved to org.locationtech.sfcurve.geowave.

With all of our indexing code under one roof, we can then move to the phase where we try to bring it all together under one-api-to-rule-them-all. Pie in the sky? Maybe ☁️

The code in this PR is set for IP review by an Eclipse CQ https://dev.eclipse.org/ipzilla/show_bug.cgi?id=14322

James McClain added 6 commits June 7, 2017 18:31
Taken from GeoWave commit 73ae05af4358d5c8741963eb692a9189ebf616b9.

Signed-off-by: James McClain <jmcclain@azavea.com>
From 73ae05af4358d5c8741963eb692a9189ebf616b9
* Version 2.0 which accompanies this distribution and is available at
* http://www.apache.org/licenses/LICENSE-2.0.txt
******************************************************************************/
package org.locationtech.sfcurve.geowave.geotime.index.dimension;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd opine that naming any of our projects in the package names is inappropriate. We haven't done it for GeoMesa or GeoTrellis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but also the GeoWave code is another implementation of Hilbert. What's unique about this approach is that it takes GeoWave's strategy specifically. I think it will make it a lot easier to reference and have everyone know what we are talking about. GeoTrellis's Z curve implementation was the original Z curve, and was only a Z curve - it didn't include a lot of the bells and whistles that the geowave-index project contains (tiering, periodicity). GeoMesa's bells and whistles still live in GeoMesa - you are using the baseline Z curve implementations as they've been defined and modified here.

If there's a better name for it than something like hilbert2, I'm up for a better option. It doesn't make me uncomfortable to reference GeoWave from the codebase, however, especially since the subproject is a straight up pull from geowave-index (in our PR for GeoTrellis, we have also labelled the indexing strategy that leans on this as GeoWave).

/cc @rfecher, not sure if putting GeoWave into the naming makes sense to you, or if you know of a better name that you'd like more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me @lossyrob -- keep the name geowave or perhaps hilbert-geowave.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to this, the underlying Hilbert implementation is the Uzaygezen library.;)

@pomadchin
Copy link
Member

@jnh5y
Copy link
Contributor

jnh5y commented Mar 5, 2018

@pomadchin sweet! Looks like there's a quick merge conflict to resolve. Can you or @lossyrob knock that out?

@pomadchin
Copy link
Member

@jnh5y sure!

@pomadchin
Copy link
Member

pomadchin commented Mar 6, 2018

@jnh5y probably it's ready for your review

@lossyrob
Copy link
Member Author

I'm rethinking our approach here.

GeoWave's initial IP contribution has been approved by the IP team at Eclipse, so the IP considerations are no longer an issue. However, in the meantime this PR has gotten pretty stale as GeoWave's geowave-core-index project, which this pulls from, has had several commits in the meantime.

The original purpose of this PR was to get GeoWave indexing code into SFCurve so we could A. rely on the geowave indexing code in GeoTrellis and B. gain a dependency on SFCurve from GeoTrellis, which would push the SFCurve story along.

Considering the effort it would take to keep this SFCurve GeoWave code with the geowave-core-index project, it seems like it might be smarter to forgo (B) and accomplish (A) by simply relying on the geowave project in GeoTrellis. I'm of the mind to close this PR and move GeoTrellis to rely on geowave-core-index for this functionality.

Thoughts? @echeipesh @rfecher @jnh5y

@rfecher
Copy link

rfecher commented Mar 19, 2018

Agreed, it really only works if each of the projects are depending on the same baseline, or else it just ends up being divergent forks. To that end, geowave needs to change its dependency to sfcurve, or geotrellis needs to change its dependency to geowave-core-index - I'm game for either although currently the least moving parts and easiest for us in the short term will be to just support geowave-core-index with the idea that you can load up geowave's issues all the same with index concerns. We can evolve that fairly easily at any point if it ends up making sense to move it over to sfcurve. Or do you think we should have more immediate plans to move it over?

there's a significant change coming in geowave-core-index immediately following our 0.9.7 release, later this week. I talked to @echeipesh about it a couple months ago. It also will involve packages changing to org.locationtech.geowave.index in addition to fairly major code re-work - primarily geared around separating the well-sorted portion of an indexing approach from the specifically unsorted portion (so separate SortIndexStrategy and PartitionIndexStrategy). Previously, we had the indexing approach serve as a bi-directional map between multi-dimensional coordinates/bounds and single dimensional keys/ranges, but that was a bit too simplistic of a model because of the concerns of hotspotting (equally distributing reads/writes across nodes). With what you have here we combined the partition and sort concept together into a CompoundIndexStrategy. But the changes separating it out ends up being cleaner for hbase, accumulo, and bigtable - for example for things like keeping statistics, you end up wanting to keep a histogram of sort keys binned by partition keys. Moreover, it is really essential when the key-value store explicitly has a place for "hash" or "partition" key separated out from the "range" or "sort" key - like Cassandra and DynamoDB. For key-values stores like these it is even more helpful to keep the sorted and unsorted portions explicitly separate within the underlying "index strategy."

@jnh5y
Copy link
Contributor

jnh5y commented Mar 19, 2018

I'm fine with the code moving over whenever. Generally, I'm for moving common pieces out as projects like that. Admittedly, that does require more coordination.

For the short term, it might make sense for GeoTrellis to depend on a released version of the GeoWave indexing code. Meanwhile, the GeoWave folks can sort out the changes in the pipeline; once those are ironed out, they can contribute to SFCurve. As that happens, we can cut a quick release so that GW and GT can depend on the moved code.

Does that work for everyone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants