Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Long term] Look into Supersonic query API #11

Open
velvia opened this issue Sep 3, 2015 · 9 comments
Open

[Long term] Look into Supersonic query API #11

velvia opened this issue Sep 3, 2015 · 9 comments

Comments

@velvia
Copy link
Member

velvia commented Sep 3, 2015

@samklr
Copy link

samklr commented Sep 28, 2015

Link expired?

@darkjh
Copy link
Contributor

darkjh commented Sep 28, 2015

@velvia still expired

@samklr
Copy link

samklr commented Sep 28, 2015

Lol. Still expired ...

@velvia
Copy link
Member Author

velvia commented Oct 2, 2015

I finally found a live link - though not sure how much longer this will be up too. Download the PDF while you can.
https://code.google.com/p/supersonic/downloads/list

@velvia
Copy link
Member Author

velvia commented Jan 10, 2016

So, Supersonic is C++. There is also Apache Drill, but that might be C++ too.

@velvia
Copy link
Member Author

velvia commented Jan 13, 2016

I think in the short term that playing with Spark's Catalyst optimizer to get columnar or at least vector wise execution is the best bet. Here is a video:

http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/

Some thoughts:

  • We could introduce an extra physical planner stage that does vector computation before passing it to the normal Aggregate* steps. However, we don't want to receive an RDD[InternalRow], but rather an RDD[Segment].
  • We could introduce something called "aggregation / expression pushdown", at first specific to the Filo data source only, that pushes down the columnar expressions / aggregation and grouping expressions. Then, the Filo data source could do computations on each segment and return an RDD[Row], hopefully with far fewer rows, for Spark to compute.

@velvia
Copy link
Member Author

velvia commented Jan 19, 2016

More notes on where in Spark codebase to look for SQL Optimizer stages (Spark 1.5.x):

  • Overall query execution flow: SQLContext#QueryExecution inner class
  • Step 1: SQL (or DataFrame DSL) is converted to a LogicalPlan tied to a new DataFrame instance (see LogicalPlan.scala, and DataFrame.logicalPlan)
  • Step 2: org.apache.spark.sql.catalyst.analysis.Analyzer goes over LogicalPlan, resolves references, produces another LogicalPlan
  • Step 3: Spark calls the CacheManager to determine if cached tables should be used --> withCachedData LogicalPlan
  • Step 4: org.apache.spark.sql.catalyst.optimizer.Optimizer optimizes the LogicalPlan
  • Step 5: SparkPlanner uses various SparkStrategies to convert the LogicalPlan into a SparkPlan.
    • These are all in the org.apache.spark.sql.execution package
    • For Joins, see SparkStrategies.{LeftSemiJoin, CanBroadcast, EquiJoinSelection}.
    • See the DataSourceStrategy for how pushdown predicates are implemented
  • Step 6: The SparkPlans execute() method is called, which returns an RDD[InternalRow]

Custom execution strategies can be inserted -- see SQLContext.experimental variable.

Changing the optimizer steps might require a custom optimizer and a custom SQLContext/QueryExecution class.

@velvia
Copy link
Member Author

velvia commented Feb 10, 2016

A current Spark ticket for pushing down aggregations into DataSources:

https://issues.apache.org/jira/browse/SPARK-12449

See Santiago's comment right above mine, for links to how Druid, Magellan, HBase and other folks are modifying Spark Catalyst plans to get aggregation done on server side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants