Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Spark DataFrames or ElasticSearch terms query to speed up depencies extraction #88

Open
frittentheke opened this issue Mar 30, 2020 · 0 comments

Comments

@frittentheke
Copy link
Contributor

frittentheke commented Mar 30, 2020

Requirement - what kind of business use case are you trying to solve?

Runtime and resource consumption is very high when running the dependencies job on a large dataset.

Problem - what in Jaeger blocks you from solving the requirement?

Currently the Spark job uses JavaEsSpark.esJsonRDD which has no optimizations (DataFrames and their pushdown - https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-pushdown would be). So apart from a custom query (i.e. as in PR #86) all docs are fetched and instantiated into full Span objects, even though not all fields of the spans are required for the dependency extraction. This causes many gigabytes of data transferred from ES to the instance running the Spark job and creates a massive memory footprint as well as lots of turnover / GC on the JVM running the job.

Also the write of the extracted dependencies is not done via the dependencystore API (https://github.com/jaegertracing/jaeger/blob/master/storage/dependencystore/interface.go), but directly to elasticsearch. While not a problem in itself it requires changes to the storage to always cover "both sides"

Proposal - what do you suggest to solve the problem or improve the existing situation?

  1. I propose to use Spark DataFrames when accessing and selecting data from ElasticSearch to make use of the pushdown applying filters and aggregations directly in the storage layer.
  2. Potentially using the terms query build into ES could speed things up even more (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html#query-dsl-terms-lookup) -- this will need to be done in chunks of 65k terms though, but fully runs inside ES then.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant