[BEAM-8376] Initial version of firestore connector JavaSDK #10187

djelekar · 2019-11-21T15:53:25Z

Added initial version of firestore connector for JavaSDK.

R: @chamikaramj

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

chamikaramj

Thanks!

chamikaramj · 2019-12-02T16:01:36Z

...io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/firestore/FirestoreIO.java

+
+import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument;
+
+public class FirestoreIO {


Please add experimental annotations to this and other public interfaces.

chamikaramj · 2019-12-02T18:27:43Z

...io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/firestore/FirestoreIO.java

+        }
+
+        /**
+         * Returns a new {@link Write} that writes to the Cloud Firestore for the specified collection.


Probably it'll be good a brief description on collection IDs and/or point to some link on Firestore Website.

chamikaramj · 2019-12-02T18:27:58Z

...io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/firestore/FirestoreIO.java

+        }
+
+        /**
+         * Returns a new {@link Write} that writes to the Cloud Firestore for the specified collection key.


Ditto (on documentation). What' the difference between collection ID and key ID ?

Collection ID is the unique identifier of Firestore collection (e.g. "userSessions"), while Key ID is the unique identifier for a specific document in a collection (e.g. "5a0062ef-3fe2-490d-b1be-06f5bf94a39d").

I added a link to the Firestore data model page which explains it pretty well.

Do you think that is enough, or should we adjust naming/docs additionally?

chamikaramj · 2019-12-02T18:47:35Z

...gle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/firestore/FirestoreWriterFn.java

+    /**
+     * Writes a batch of mutations to Cloud Firestore.
+     *
+     * <p>If a commit fails, it will be retried up to {@link #MAX_RETRIES} times. All mutations in


Can this result in duplicate data for records that were written in the previous try ?

This doc is actually false. I've used Firestore "batched writes" which makes writing mutations an atomic operation. So all either succeeds or fails -> no risk of duplicating entries. I updated the docs.

chamikaramj · 2019-12-02T18:57:48Z

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/firestore/WriteBatcherImpl.java

+@VisibleForTesting
+public class WriteBatcherImpl implements WriteBatcher, Serializable {
+    /** Target time per RPC for writes. */
+    static final int FIRESTORE_BATCH_TARGET_LATENCY_MS = 5000;


Please document these constants.

chamikaramj · 2019-12-02T19:05:09Z

...gle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/firestore/FirestoreWriterFn.java

+                }
+
+                for (WriteResult result : future.get()) {
+                    LOG.info("Update time : " + result.getUpdateTime());


Probably will be useful to log the firestoreKey here as well.

chamikaramj · 2019-12-02T19:06:27Z

...gle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/firestore/FirestoreWriterFn.java

+                // Break if the commit threw no exception.
+                break;
+            } catch (FirestoreException exception) {
+                if (exception.getCode() == 4) {


So 4 means DEADLINE_EXCEEDED ? Please clarify in the comment.

chamikaramj · 2019-12-02T19:07:20Z

...gle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/firestore/FirestoreWriterFn.java

+                // Only log the code and message for potentially-transient errors. The entire exception
+                // will be propagated upon the last retry.
+                LOG.error(
+                        "Error writing batch of {} mutations to Firestore ({}): {}",


Log the firestoreKey ?

chamikaramj · 2019-12-02T19:09:22Z

.../google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/firestore/MovingAverage.java

+import org.apache.beam.sdk.transforms.Sum;
+import org.apache.beam.sdk.util.MovingFunction;
+
+class MovingAverage {


Please add Javadocs to clarify why we need this.

chamikaramj · 2019-12-02T19:40:04Z

...cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/firestore/AdaptiveThrottlerTest.java

+ * Tests for {@link AdaptiveThrottler}.
+ */
+@RunWith(JUnit4.class)
+public class AdaptiveThrottlerTest {


Please also add unit test for the sink. See here for some of the example unit tests that we could add for Firestore.
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1Test.java

In addition please also consider adding an integration test similar to here: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/datastore/V1WriteIT.java

fredzqm · 2019-12-05T19:33:06Z

...cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/firestore/FirestoreBatchRequest.java

+    }
+
+    public WriteBatch batchWithKey(List<T> input, String collection, String key) {
+        WriteBatch batch = firestoreClient.batch();


Please note that WriteBatch is atomic transaction.
Large WriteBatch could lead to contention and high error rates.

We are working on launching a batchWrite API for non-atomic data ingestion use cases. Before it gets launched, the next best option is writing each document separately.

(BTW, we have a max 500 write per commit limit.)

Please make sure this is addressed.

fredzqm · 2019-12-14T02:13:02Z

Please note that WriteBatch is atomic transaction.
Large WriteBatch could lead to contention and high error rates.

We are working on launching a batchWrite API for non-atomic data ingestion use cases. Before it gets launched, the next best option is writing each document separately.

(BTW, we have a max 500 write per commit limit.)

chamikaramj · 2019-12-18T18:01:29Z

Thanks @fredzqm.

@djelekar please let us know when current comments are addressed.

chamikaramj · 2020-01-07T02:02:51Z

Any updates ?

Thanks.

djelekar · 2020-01-13T15:45:32Z

Hi @chamikaramj , yes the tests are there and I addressed all your comments in code. However, I'm having issues running integration tests on my machine (windows). Once, that is working I'll bump you for one more review.

Hi @fredzqm, I understand your concerns, however, we have better experience running jobs with WriteBatch command. Namely, by doing data imports on large amounts of data, we witnessed faster processing times with WriteBatch then single writes. Can you point me to some documentation or else that aligns with your point?

Additionally, the code utilizes AdaptiveThrottler, with the max batch value of 500, so hopefully, that obeys with the limit.

clement · 2020-02-05T02:52:03Z

Hi @djelekar, I work on the Firestore backend, and chiming in to second @fredzqm point. There are two interlocking issues when using atomic WriteBatch for large ingestion (throughput) jobs.

First, under load and based on size, Firestore will split your dataset across multiple servers. When writing atomically to multiple documents, this increase the chance that the write will need to coordinate a 2-phase commit across multiple servers, which will increase the latency of the operation.

Second, Firestore uses a pessimistic locking model under the hood. If the WriteBatch takes longer to execute (because of the issue above, or just because it is doing more work) it will be holding locks longer and disrupt unrelated read/write traffic on the document or index entries.

I can see reasons why the experience looks better with WriteBatch, for example:

when using single writes, those should be asynchronous, and can (and should) be parallelized more aggressively
if the ingestion key range is not split, or not actively accessed by other processes, there will initially be no contention and good performance with WriteBatch, however there is a limit to how much throughput you will get from them once the ingestion runs longer and ramps up to more parallelism.

Does that make sense? We are hoping to launch a dedicated feature for writing batches in a non-atomic fashion, but it is unclear at this point when this will be generally available, and as @fredzqm point out, single writes are the best option for now.

stale · 2020-04-25T10:45:29Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

willbattel · 2020-04-25T19:29:18Z

Not stale

willbattel · 2020-06-08T21:13:35Z

Hey @fredzqm @clement are there any updates regarding the mentioned non-atomic batch writing capability?

aaltay · 2020-06-25T20:10:44Z

What is the state of this PR? Do you need help?

stale · 2020-08-29T08:27:41Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

stale · 2020-09-05T19:12:03Z

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

willbattel · 2020-11-15T00:12:53Z

I noticed the aforementioned bulk writer feature for Firestore has been made available in some clients, such as Java and Node. Can the PR now be continued using this new API for parallelized writes?

oeclint · 2020-12-26T03:12:37Z

Does the new batch writer mean we could reopen this?

Stefan Djelekar added 3 commits November 21, 2019 16:37

[BEAM-8376] Initial version of firestore connector JavaSDK

ab88901

[BEAM-8376] Added missing firestore dependencies

228fc80

Merge remote-tracking branch 'upstream/master'

d94ed53

chamikaramj requested changes Dec 2, 2019

View reviewed changes

Stefan Djelekar added 4 commits December 4, 2019 10:57

Merge remote-tracking branch 'upstream/master'

63219b6

[BEAM-8376] Code review changes: Added docs and minor refactoring

52a9f80

[BEAM-8376] Updated exception code

5795556

[BEAM-8376] Updated documetation for batched writes

19c6c87

fredzqm suggested changes Dec 5, 2019

View reviewed changes

Stefan Djelekar added 2 commits December 12, 2019 14:56

[BEAM-8376] Refactored, added tests and using autovalues

9129ec2

Refresh from upstream master

08d1ac4

Merge branch 'master' of github.com:apache/beam

129843a

Stefan Djelekar added 3 commits January 14, 2020 10:09

Merge remote-tracking branch 'upstream/master'

f594c21

Refresh from upstream master

46b1a40

Refresh from upstream master

908df55

Merge remote-tracking branch 'upstream/master'

6d22654

probot-autolabeler bot added build examples gcp io java kotlin labels Feb 25, 2020

stale bot added the stale label Apr 25, 2020

stale bot removed the stale label Apr 25, 2020

stale bot added the stale label Aug 29, 2020

stale bot closed this Sep 5, 2020

willbattel mentioned this pull request Jun 16, 2021

[BEAM-8376] Google Cloud Firestore Connector - Add Firestore V1 Write Operations #14261

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-8376] Initial version of firestore connector JavaSDK #10187

[BEAM-8376] Initial version of firestore connector JavaSDK #10187

djelekar commented Nov 21, 2019

chamikaramj left a comment

chamikaramj Dec 2, 2019

chamikaramj Dec 2, 2019

chamikaramj Dec 2, 2019

djelekar Dec 4, 2019

chamikaramj Dec 2, 2019

djelekar Dec 5, 2019

chamikaramj Dec 2, 2019

chamikaramj Dec 2, 2019

chamikaramj Dec 2, 2019

chamikaramj Dec 2, 2019

chamikaramj Dec 2, 2019

chamikaramj Dec 2, 2019

fredzqm Dec 5, 2019 •

edited

fredzqm Dec 14, 2019

fredzqm commented Dec 14, 2019

chamikaramj commented Dec 18, 2019

chamikaramj commented Jan 7, 2020

djelekar commented Jan 13, 2020

clement commented Feb 5, 2020

stale bot commented Apr 25, 2020

willbattel commented Apr 25, 2020 •

edited

willbattel commented Jun 8, 2020

aaltay commented Jun 25, 2020

stale bot commented Aug 29, 2020

stale bot commented Sep 5, 2020

willbattel commented Nov 15, 2020

oeclint commented Dec 26, 2020


		import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument;

		public class FirestoreIO {

[BEAM-8376] Initial version of firestore connector JavaSDK #10187

[BEAM-8376] Initial version of firestore connector JavaSDK #10187

Conversation

djelekar commented Nov 21, 2019

R: @chamikaramj

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fredzqm Dec 5, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fredzqm commented Dec 14, 2019

chamikaramj commented Dec 18, 2019

chamikaramj commented Jan 7, 2020

djelekar commented Jan 13, 2020

clement commented Feb 5, 2020

stale bot commented Apr 25, 2020

willbattel commented Apr 25, 2020 • edited

willbattel commented Jun 8, 2020

aaltay commented Jun 25, 2020

stale bot commented Aug 29, 2020

stale bot commented Sep 5, 2020

willbattel commented Nov 15, 2020

oeclint commented Dec 26, 2020

fredzqm Dec 5, 2019 •

edited

willbattel commented Apr 25, 2020 •

edited