Use Spark DataFrames or ElasticSearch terms query to speed up depencies extraction #88

frittentheke · 2020-03-30T12:53:06Z

Requirement - what kind of business use case are you trying to solve?

Runtime and resource consumption is very high when running the dependencies job on a large dataset.

Problem - what in Jaeger blocks you from solving the requirement?

Currently the Spark job uses JavaEsSpark.esJsonRDD which has no optimizations (DataFrames and their pushdown - https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-pushdown would be). So apart from a custom query (i.e. as in PR #86) all docs are fetched and instantiated into full Span objects, even though not all fields of the spans are required for the dependency extraction. This causes many gigabytes of data transferred from ES to the instance running the Spark job and creates a massive memory footprint as well as lots of turnover / GC on the JVM running the job.

Also the write of the extracted dependencies is not done via the dependencystore API (https://github.com/jaegertracing/jaeger/blob/master/storage/dependencystore/interface.go), but directly to elasticsearch. While not a problem in itself it requires changes to the storage to always cover "both sides"

Proposal - what do you suggest to solve the problem or improve the existing situation?

I propose to use Spark DataFrames when accessing and selecting data from ElasticSearch to make use of the pushdown applying filters and aggregations directly in the storage layer.
Potentially using the terms query build into ES could speed things up even more (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html#query-dsl-terms-lookup) -- this will need to be done in chunks of 65k terms though, but fully runs inside ES then.

frittentheke mentioned this issue Mar 30, 2020

Use single ElasticSearch index to store dependencies jaegertracing/jaeger#2143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Spark DataFrames or ElasticSearch terms query to speed up depencies extraction #88

Use Spark DataFrames or ElasticSearch terms query to speed up depencies extraction #88

frittentheke commented Mar 30, 2020 •

edited

Use Spark DataFrames or ElasticSearch terms query to speed up depencies extraction #88

Use Spark DataFrames or ElasticSearch terms query to speed up depencies extraction #88

Comments

frittentheke commented Mar 30, 2020 • edited

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Proposal - what do you suggest to solve the problem or improve the existing situation?

frittentheke commented Mar 30, 2020 •

edited