feat: Compute head offset for Spark connector micro batch mode. #439

jiangmichaellll · 2021-01-05T20:11:03Z

This PR does 2 things:

Fixed a bug in micro batch mode where the endoffset should be set inside the reader instead of passed in as a fixed one.
Uses a rate limited headoffset reader that ensures each topic refresh is at most once per min.

codecov · 2021-01-05T20:31:53Z

Codecov Report

Merging #439 (fa3e803) into master (2099751) will increase coverage by 0.08%.
The diff coverage is 42.10%.

@@             Coverage Diff              @@
##             master     #439      +/-   ##
============================================
+ Coverage     71.08%   71.16%   +0.08%     
- Complexity      914      916       +2     
============================================
  Files           167      168       +1     
  Lines          4831     4855      +24     
  Branches        243      246       +3     
============================================
+ Hits           3434     3455      +21     
- Misses         1257     1259       +2     
- Partials        140      141       +1

Impacted Files	Coverage Δ	Complexity Δ
...le/cloud/pubsublite/internal/TopicStatsClient.java	`0.00% <ø> (ø)`	`0.00 <0.00> (ø)`
...loud/pubsublite/internal/TopicStatsClientImpl.java	`66.66% <0.00%> (-33.34%)`	`3.00 <0.00> (ø)`
...m/google/cloud/pubsublite/spark/PslDataSource.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (ø)`
...e/cloud/pubsublite/spark/PslDataSourceOptions.java	`8.77% <0.00%> (-2.34%)`	`2.00 <0.00> (ø)`
...oud/pubsublite/spark/LimitingHeadOffsetReader.java	`64.70% <64.70%> (ø)`	`2.00 <2.00> (?)`
...le/cloud/pubsublite/spark/PslMicroBatchReader.java	`70.45% <100.00%> (-7.05%)`	`12.00 <0.00> (ø)`
...ud/pubsublite/internal/wire/SubscriberBuilder.java	`40.74% <0.00%> (-7.09%)`	`2.00% <0.00%> (ø%)`
...oud/pubsublite/internal/wire/PublisherBuilder.java	`62.26% <0.00%> (-6.49%)`	`3.00% <0.00%> (ø%)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4ca069...5756e2e. Read the comment docs.

dpcollins-google

LGTM following comments

...le-cloud-pubsublite/src/main/java/com/google/cloud/pubsublite/internal/TopicStatsClient.java

...-sql-streaming/src/main/java/com/google/cloud/pubsublite/spark/LimitingHeadOffsetReader.java

...-sql-streaming/src/main/java/com/google/cloud/pubsublite/spark/PerTopicHeadOffsetReader.java

...park-sql-streaming/src/main/java/com/google/cloud/pubsublite/spark/PslDataSourceOptions.java

dpcollins-google · 2021-01-08T21:44:30Z

...-sql-streaming/src/main/java/com/google/cloud/pubsublite/spark/LimitingHeadOffsetReader.java

-      return cachedHeadOffsets.get(CACHE_KEY);
+      Map<Partition, Offset> partitionOffsetMap = new HashMap<>();
+      for (int i = 0; i < topicPartitionCount; i++) {
+        partitionOffsetMap.put(Partition.of(i), cachedHeadOffsets.get(Partition.of(i)));


You still need to do this asynchronously. The fanout means all of these lookups will be serialized. Use AsyncLoadingCache.java and Futures.allAsList instead to do this.

I used https://github.com/ben-manes/caffeine, as the guava lib was soft deprecated (https://b.corp.google.com/issues/171496465, and many other places). I need to pull in some future conversion as well (similar to https://source.corp.google.com/search?q=net.javacrumbs.futureconverter.java8guava.FutureConverter.toCompletableFuture).

...park-sql-streaming/src/main/java/com/google/cloud/pubsublite/spark/PslDataSourceOptions.java

pubsublite-spark-sql-streaming/pom.xml

...-sql-streaming/src/main/java/com/google/cloud/pubsublite/spark/LimitingHeadOffsetReader.java

...park-sql-streaming/src/main/java/com/google/cloud/pubsublite/spark/PslDataSourceOptions.java

pubsublite-spark-sql-streaming/pom.xml

…headoffset-reader

headoffset

eca393c

jiangmichaellll requested a review from a team as a code owner January 5, 2021 20:11

product-auto-label bot added the api: pubsublite Issues related to the googleapis/java-pubsublite API. label Jan 5, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Jan 5, 2021

format

2a9f9d9

jiangmichaellll assigned dpcollins-google Jan 5, 2021

upate

fa3e803

jiangmichaellll requested a review from a team as a code owner January 6, 2021 20:33

dpcollins-google requested changes Jan 7, 2021

View reviewed changes

update

49ad80e

jiangmichaellll requested a review from dpcollins-google January 7, 2021 21:34

dpcollins-google reviewed Jan 8, 2021

View reviewed changes

...park-sql-streaming/src/main/java/com/google/cloud/pubsublite/spark/PslDataSourceOptions.java Show resolved Hide resolved

jiangmichaellll added 2 commits January 11, 2021 16:37

update

0b99b2d

upadte

d6d0241

jiangmichaellll requested a review from dpcollins-google January 11, 2021 21:51