[low-code CDK] Rsumable full refresh support for low-code streams #38300

brianjlai · 2024-05-16T20:11:43Z

Closes https://github.com/airbytehq/airbyte-internal-issues/issues/7000 (nice issue id)

What

https://www.loom.com/share/97b0ee3b050448aba2cbfdd5b7d91e3d

This PR makes all low-code streams support resumable full refresh if they implement a paginator and are not already implemented as incremental syncs.

This approach should work with substreams since the new cursor adheres to the same interface as DatetimeBasedCursor which can be used by a per-partition cursor, but I've explicitly gated it off to reduce the scope.

How

At a high level the changes are:

A new checkpoint reader that uses a stream's underlying Cursor implementation to manage how state is persisted and accessed
A new ResumableFullRefreshCursor which adheres to the Cursor interface
Updating all pagination strategies to support optionally reseting based on an incoming page/offset/cursor
Updating the simple retriever to allow for reading single pages and the flow for restarting a sync from an incoming page state
Updating the model to component transformer to automatically instantiate the RFR cursor when the criteria match

A few notes on the design

The new CursorBasedCheckpointReader - Something that is really nice about this is that it is both low-code and Python stream agnostic. So long as a stream's implements the newly promoted Cursor concept, we can now avoid the checkpoint reader needing to scale w/ the complexities of state

I've tried to leverage the existing paginator logic as much as possible and avoid making the retriever and cursor have to re-implement behavior that already exists. However, it did require changes to allow reseting to a new value since there was no way to start pagination from anywhere but the start.

You'll also notice that the entire flow leverages the newly promoted Cursor and StreamSlice types that were recently promoted in #38077 . This helps manage the complexity of the reader code when we can make clear assumptions on what is supplied to the source and passed back from the cursor.

Review guide

resumable_full_refresh_cursor.py
cursor_pagination_strategy.py
page_increment.py
offset_increment.py
default_paginator.py
simple_retriever.py
checkpoint_reader.py
model_to_component_transformer.py

User Impact

A few potential breaking changes:

PaginationStrategy.reset() now can take in an optional parameter. This would affect custom pagination strategies, but a look through our repo only shows one example on source-guardian-api which is minimally used
Cursor.select_state() has been added to the interface. This was originally unused by the PerPartitionCursor, but we will eventually need this for substream state. I wired most of it up, but custom components of Cursors/DatetimeBasedCursors need to implement this. It's unused by connectors right now

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

…paginator

vercel · 2024-05-16T20:11:47Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		May 22, 2024 8:03pm

brianjlai · 2024-05-16T20:14:45Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/resumable_full_refresh_cursor.py

+        """
+        return False
+
+    def select_state(self, stream_slice: Optional[StreamSlice] = None) -> Optional[StreamState]:


This is a new interface method that we need in order to sanely read through state from the CheckpointReader. It isn't that useful for unnested streams, but this is critical for us to be able to sanely parse substream state as we iterate over parent record state in the CheckpointReader when we do the work

brianjlai · 2024-05-16T20:15:08Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/resumable_full_refresh_cursor.py

+        """
+        return True
+
+    def is_greater_than_or_equal(self, first: Record, second: Record) -> bool:


This is the one field that really doesn't fit the existing cursor interface. We can default to false, and ultimately we don't even care about the record since we close the slice based on the page number.

brianjlai · 2024-05-17T00:00:52Z

airbyte-cdk/python/airbyte_cdk/sources/streams/checkpoint/checkpoint_reader.py

+        self._current_slice: Optional[StreamSlice] = None
+        self._finished_sync = False
+
+    def next(self) -> Optional[Mapping[str, Any]]:


This method is a bit more complicated than it needs to be for streams because it can iterate in two dimensions: over incoming static set of slices and dynamically based on the current cursor's stream state.

Since this is only scoped to full refresh streams (not substreams nor incremental), there should only be one static slice to loop over). But we support it regardless and positions us better for substream low-code RFR when we prioritize it.

girarda

I'm a fan of this! This is a pretty big change but the abstractions make it manageable. ✅

I left a few comments but it's mostly just request for a few more comments.

girarda · 2024-05-17T21:15:16Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/declarative_stream.py

@@ -157,3 +159,8 @@ def state_checkpoint_interval(self) -> Optional[int]:
            important state is the one at the beginning of the slice
        """
        return None
+
+    def get_cursor(self) -> Optional[Cursor]:


more evidence the cursor belongs to the stream, not the retriever

girarda · 2024-05-17T21:23:17Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

+        elif hasattr(model.retriever, "paginator") and model.retriever.paginator and not stream_slicer:
+            # To incrementally deliver RFR for low-code we're first implementing this for streams that do not use
+            # nested state like substreams or those using list partition routers
+            return ResumableFullRefreshCursor(parameters={})


👍
Is there a follow up issue to support substreams?

yep I have it filed here when the original spec was written: https://github.com/airbytehq/airbyte-internal-issues/issues/7528

girarda · 2024-05-17T21:35:06Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

@@ -68,6 +71,7 @@ def __post_init__(self, parameters: Mapping[str, Any]) -> None:
        self._last_record: Optional[Record] = None
        self._parameters = parameters
        self._name = InterpolatedString(self._name, parameters=parameters) if isinstance(self._name, str) else self._name
+        self._synced_partitions: MutableMapping[Any, bool] = dict()


can you add a comment clarifying what the True means here? looking at a the code, it seems to mean that we started syncing a records from a partition.

It's also not obvious that this is only used for RFR.

+1, I made a comment down there but I'm getting stuck on this

good suggestion, will add!

girarda · 2024-05-17T21:38:22Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

+            # Before syncing the RFR stream, we check if the job's prior attempt was successful and don't need to fetch more records
+            # The platform deletes stream state for full refresh streams before starting a new job, so we don't need to worry about
+            # this value existing for the initial attempt
+            if stream_state.get(FULL_REFRESH_SYNC_COMPLETE_KEY):


thinking out loud: the fact that there are essentially two different code paths that are mutually exclusive for a given stream makes me think there's an opportunity to introduce a new abstraction where one implementation would be RFR and the other the standard read_records.

I'd love to simplify this, but I'm not sure what the right abstraction is yet

Yeah I would agree with that. Its kind of our way of not needing to re-implement read_records() like we do for Python sources. It's not immediately clear at what level the abstraction needs to sit at yet. But I can't imagine pulling this out into its own abstraction later will be much of a lift.

girarda · 2024-05-17T21:39:28Z

airbyte-cdk/python/airbyte_cdk/sources/streams/checkpoint/checkpoint_reader.py

+        self._current_slice: Optional[StreamSlice] = None
+        self._finished_sync = False
+
+    def next(self) -> Optional[Mapping[str, Any]]:


can you add a docstring describing the algorithm in English?

girarda · 2024-05-17T21:41:07Z

airbyte-cdk/python/airbyte_cdk/sources/streams/checkpoint/checkpoint_reader.py

+    to the Cursor interface.
+    """
+
+    def __init__(self, cursor: Cursor, stream_slices: Iterable[Optional[Mapping[str, Any]]], read_state_from_cursor: bool = False):


can you add a comment explaining when a user of this class how to configure read_state_from_cursor?

girarda · 2024-05-17T21:45:07Z

airbyte-cdk/python/airbyte_cdk/sources/streams/checkpoint/checkpoint_reader.py

+                self._current_slice = self._get_next_slice()
+                return self._current_slice
+            if self._read_state_from_cursor:
+                state_for_slice = self._cursor.select_state(self._current_slice.get("partition"))


nit: I think it's preferable to call self._current_slice.partition instead of get as slowly gets us away from thinking of the slice as an arbitrary dict.

actually i realize it should just be self._current_slice since select_state() takes in the slice itself, not the partition. But regardless this is fixed and we don't reference the arbitrary dict. Thanks!

brianjlai · 2024-05-21T07:03:59Z

Ran regression tests successfully against 3 sources/connections:

source-greenhouse - https://github.com/airbytehq/airbyte/actions/runs/9166779354/job/25202834408
source-confluence (for offset increment) - https://github.com/airbytehq/airbyte/actions/runs/9155901010/job/25169175081
source-freshdesk - https://github.com/airbytehq/airbyte/actions/runs/9167188398/job/25203934173

airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/resumable_full_refresh_cursor.py

.../python/airbyte_cdk/sources/declarative/requesters/paginators/strategies/offset_increment.py

airbyte-cdk/python/airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

erohmensing · 2024-05-21T14:44:38Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

+        # Always return an empty generator just in case no records were ever yielded
+        yield from []


Non blocking, feels like this should be somehow baked into a base implementation of records_generator_fn instead of being an extra check here 🤔

airbyte-cdk/python/airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

erohmensing · 2024-05-21T14:52:57Z

airbyte-cdk/python/airbyte_cdk/sources/streams/checkpoint/cursor.py

+        """
+        Get the state value of a specific stream_slice. For incremental or resumable full refresh cursors which only manage state in
+        a single dimension this is the entire state object. For per-partition cursors used by substreams, this returns the state of
+        a specific parent delineated by the incoming slice's partition object.
+        """


Helpful, thank you!

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py

erohmensing · 2024-05-21T15:00:23Z

airbyte-cdk/python/airbyte_cdk/sources/streams/checkpoint/checkpoint_reader.py

+    Right now only low-code connectors provide cursor implementations, but the logic is extensible to any stream that adheres
+    to the Cursor interface.


Is this true? Thinking about file-based and concurrent. Is this because the Abstract class for the declarative cursor is separate from the ones that concurrent and file-based use?

I would say it is for the current type of Cursor we've promoted up. We're in a messy spot where we have a couple of different cursor interfaces though which is not great. But I will clarify this comment since you are right that it's confusing

erohmensing · 2024-05-21T16:10:47Z

airbyte-cdk/python/airbyte_cdk/sources/streams/checkpoint/checkpoint_reader.py

+                # Unlike RFR cursors that iterate dynamically based on how stream state is updated, most cursors operate on a
+                # fixed set of slices determined before reading records. They should just iterate to the next slice
+                self._current_slice = self._get_next_slice()


Seeing a parallel between partition generators that generate partitions in advance and the fact that we were talking about creating partitions dynamically based on whether a next page cursor exists 🤔 just an observation

erohmensing · 2024-05-21T16:13:07Z

airbyte-cdk/python/unit_tests/sources/streams/test_streams_core.py

    )
-    assert exp == airbyte_stream
+    assert airbyte_stream == exp


brianjlai · 2024-05-22T20:23:02Z

freshdesk

greenhouse

passing regression tests

sentry-io · 2024-05-28T16:16:19Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ airbyte_cdk.sources.declarative.exceptions.ReadException: Request to https://api.outreach.io/mailings?count=false&page%5Bafter%5D=eyJjbiI6IkFwaVYyOjpNYWlsa... /usr/local/lib/python3.9/site-packages/airbyte_... View Issue
‼️ airbyte_cdk.sources.declarative.exceptions.ReadException: Request to https://mixpanel.com/api/2.0/engage?project_id=2911845&page_size=1000 failed with stat... /usr/local/lib/python3.9/site-packages/airbyte_... View Issue

_{Did you find this useful? React with a 👍 or 👎}

resumable full refresh support for low-code streams that implement a …

cabc93a

…paginator

octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/source/freshdesk connectors/source/gitlab labels May 16, 2024

brianjlai commented May 16, 2024

View reviewed changes

brianjlai added 2 commits May 16, 2024 13:57

fix mypy

6b021e9

add is_resumable to the catalog now that the protocol supports it

5f597a7

brianjlai commented May 17, 2024

View reviewed changes

brianjlai removed connectors/source/freshdesk connectors/source/gitlab labels May 17, 2024

octavia-squidington-iii added connectors/source/freshdesk connectors/source/gitlab labels May 17, 2024

brianjlai added 2 commits May 16, 2024 17:01

remove extra files

11c0ff8

formatting

a26d30b

octavia-squidington-iii removed the area/connectors Connector related issues label May 17, 2024

brianjlai marked this pull request as ready for review May 17, 2024 00:14

brianjlai requested a review from a team as a code owner May 17, 2024 00:14

brianjlai requested review from girarda and erohmensing May 17, 2024 00:14

brianjlai removed connectors/source/freshdesk connectors/source/gitlab labels May 17, 2024

girarda approved these changes May 17, 2024

View reviewed changes

update freshdesk lockfile to run regression testing

4071469

octavia-squidington-iii added area/connectors Connector related issues connectors/source/freshdesk labels May 20, 2024

bump cdk for live tests on greenhouse and confluence

8f4a1e7

octavia-squidington-iii added the connectors/source/confluence label May 20, 2024

octavia-squidington-iii added the connectors/source/greenhouse label May 20, 2024

brianjlai added 2 commits May 20, 2024 17:21

fix custom components to run regression tests

20d01e8

fix freshdesk dependency for regression test

e19644f

erohmensing approved these changes May 21, 2024

View reviewed changes

brianjlai added 7 commits May 22, 2024 00:05

PR feedback and a few small tweaks

d001858

Merge branch 'master' into brian/rfr_low_code

2ed34fa

rebase + poetry lock

1204e46

fix test

2d35cf8

temporarily updating manifests to test cursor and offset pagination

3f544cc

remove connector changes for tests and update var names

f968da7

remove changes to connector manifests

a2693c1

octavia-squidington-iii removed the area/connectors Connector related issues label May 22, 2024

brianjlai merged commit 040f141 into master May 22, 2024
30 checks passed

brianjlai deleted the brian/rfr_low_code branch May 22, 2024 20:23

brianjlai removed connectors/source/confluence connectors/source/freshdesk connectors/source/greenhouse labels May 22, 2024

brianjlai restored the brian/rfr_low_code branch May 23, 2024 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[low-code CDK] Rsumable full refresh support for low-code streams #38300

[low-code CDK] Rsumable full refresh support for low-code streams #38300

brianjlai commented May 16, 2024 •

edited

vercel bot commented May 16, 2024 •

edited

brianjlai May 16, 2024

brianjlai May 16, 2024

brianjlai May 17, 2024

girarda left a comment

girarda May 17, 2024

girarda May 17, 2024

brianjlai May 21, 2024

girarda May 17, 2024

erohmensing May 21, 2024

brianjlai May 21, 2024

girarda May 17, 2024

brianjlai May 21, 2024

girarda May 17, 2024

brianjlai May 21, 2024

girarda May 17, 2024

girarda May 17, 2024

brianjlai May 21, 2024

brianjlai commented May 21, 2024

erohmensing May 21, 2024

erohmensing May 21, 2024

erohmensing May 21, 2024

brianjlai May 22, 2024

erohmensing May 22, 2024

erohmensing May 21, 2024

erohmensing May 21, 2024

brianjlai commented May 22, 2024 •

edited

sentry-io bot commented May 28, 2024 •

edited

		# Always return an empty generator just in case no records were ever yielded
		yield from []

		Right now only low-code connectors provide cursor implementations, but the logic is extensible to any stream that adheres
		to the Cursor interface.

[low-code CDK] Rsumable full refresh support for low-code streams #38300

[low-code CDK] Rsumable full refresh support for low-code streams #38300

Conversation

brianjlai commented May 16, 2024 • edited

What

How

A few notes on the design

Review guide

User Impact

Can this PR be safely reverted and rolled back?

vercel bot commented May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

girarda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianjlai commented May 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianjlai commented May 22, 2024 • edited

sentry-io bot commented May 28, 2024 • edited

Suspect Issues

brianjlai commented May 16, 2024 •

edited

vercel bot commented May 16, 2024 •

edited

brianjlai commented May 22, 2024 •

edited

sentry-io bot commented May 28, 2024 •

edited