🚨 🚨 ✨ Source Tik Tok Marketing: Migration to Low-Code #38316

darynaishchenko · 2024-05-17T16:10:21Z

What

resolved: https://github.com/airbytehq/airbyte-internal-issues/issues/7824

How

Migrated source to use low-code cdk instead of python cdk.
Regression tests are described here: #38316 (comment)
Main changes:

State: Previously all incremental streams used incorrect state without partition. On low-code cdk all incremental streams use per partition state.
Lifetime reports: Previously implementation used lifetime=true as request param, which is deprecated on API v1.3. Now lifetime reports use query_lifetime=true, with this param start_date and end_date should not be provided. Exception: advertiser_lifetime_report: API v1.3 doesn't allow query_lifetime=true` with advertiser reports, so this stream was implemented exactly as in py version with start_date and end_date query params(range >=365d)
Advertiser Ids stream: schema was changed to use advertiser_id as type of stream to be up to date with API docs.
Discover for configs with granularity: In py implementation were missing streams(campaigns_audience_reports, ad_group_audience_reports_by_platform, ad_group_audience_reports_by_country, ads_audience_reports_by_country, advertisers_audience_reports_by_country, campaigns_audience_reports_by_platform, advertisers_audience_reports_by_platform, ads_audience_reports_by_platform, ads_audience_reports_by_province), which users with provided granularity actually can use but streams method didn't return them. For configs with granularity source removes granularity from stream name as it was previously named.

Review guide

User Impact

Breaking change users will need to follow migration guide for affected streams.

Can this PR be safely reverted and rolled back?

Breaking change due to changes in schema and state format.

YES 💚
NO ❌

vercel · 2024-05-17T16:10:25Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Jun 12, 2024 11:25am

…ed streams

…pe for dimensions transformations

…to-low-code

artem1205 · 2024-05-31T12:47:58Z

...source-tiktok-marketing/source_tiktok_marketing/components/semi_incremental_record_filter.py

+        else:
+            stream_state = stream_states[0]
+        kwargs = {"stream_state": stream_state, "stream_slice": stream_slice, "next_page_token": next_page_token}
+        return [record for record in records if self._filter_interpolator.eval(self.config, record=record, **kwargs)]


Please update to align with original signature (Iterable)

airbyte-integrations/connectors/source-tiktok-marketing/source_tiktok_marketing/manifest.yaml

…to-low-code

airbyte-integrations/connectors/source-tiktok-marketing/metadata.yaml

…to-low-code

darynaishchenko · 2024-06-11T15:38:31Z

Regression test results:
test_catalog_are_the_same [failed] – updated advertiser_id: integer - string. (breaking change described in the docs)

TestDataIntegrity.test_record_schema_match_without_state [failed] - Value of root['properties']['budget']['type'] changed from "integer" to "number". Value of root['properties']['roas_bid']['type'] changed from "integer" to "number". (same error for all fields with type number in schema but actual type is integer).
Both versions have type number, but default type transformer was added in low code version so 0 value is changed to 0.0. For db with transformations(e.g. BigQuery) it’s not a breaking change as destination already converts this data values to a number.
Streams are in a list of breaking changes affected by state changes, so users will do refresh&clear anyway.
This change from 0 to 0.0 occured due to added Default schema normalization in low code to be compatible with stream schemas that was added for api v1.2.0 and in v1.3.0 some fields have new type. For example *_id was changed from integer to string and stream schemas for v.1.2.0 use integer as type.

TestDataIntegrity.test_all_records_are_the_same_without_state [failed] - Same differences with integer/number as above.

Read URLs: some requests in py version due to HttpAvailabilityStrategy

PS: Reviewer can ask me to send the full html report in slack dm. Regression tests were running locally as I needed to change start date in config and chose testing without state due to breaking changes.

brianjlai

I think the manifest and the schemas overall look good and given the size of the manifest and number of streams, I am going to trust that we've carefully run live tests to verify that the changes are working and the breaking changes are expected. I didn't see anything glaring.

I did however have some questions to clarify my understanding for the custom components and some suggestions on the code itself. Especially around why exactly we need for two types of advertiser id (+ids) partition routers.

brianjlai · 2024-06-11T18:39:43Z

airbyte-integrations/connectors/source-tiktok-marketing/metadata.yaml

+          field as API docs declares it.
+          Users will need to reset source configuration, refresh the source schema and reset the impacted streams after upgrading.
+          For more information, see our migration documentation for source TikTok Marketing.
+        upgradeDeadline: "2024-08-07" # TODO: update this date before merge


@katmarkham what is the actual intended deadline date?

brianjlai · 2024-06-11T18:47:30Z

airbyte-integrations/connectors/source-tiktok-marketing/source_tiktok_marketing/manifest.yaml

+    path: "{{ parameters['path'] }}"
+    http_method: GET
+    error_handler:
+      type: CompositeErrorHandler


Nit: Can we get rid of the CompositeErrorHandler and just use DefaultErrorHandler at the top level. This is not on you, but we are making plans to deprecate this component and we can save the trouble of fixing this connector if we get rid of it now

updated with DefaultErrorHandler

brianjlai · 2024-06-11T19:58:29Z

...ource-tiktok-marketing/source_tiktok_marketing/components/advertiser_ids_partition_router.py

+    Custom AdvertiserIdsPartitionRouter and AdvertiserIdPartitionRouter partition routers are used to get advertiser_ids
+    as slices for streams where it uses as request param.
+
+    When user uses sandbox account it's impossible to get advertiser_ids via API.


"When using a sandbox account, it's impossible to get advertiser_ids via API."

updated comment

brianjlai · 2024-06-11T20:12:55Z

...ource-tiktok-marketing/source_tiktok_marketing/components/advertiser_ids_partition_router.py

+    Main difference between AdvertiserIdsPartitionRouter and AdvertiserIdPartitionRouter is
+    that AdvertiserIdPartitionRouter returns multiple advertiser_ids in a one slice when id is not provided,
+    e.g. {"advertiser_ids": '["11111111", "22222222"]', "parent_slice": {}}.
+    And AdvertiserIdPartitionRouter returns single slice for every advertiser_id as usual.


I think this sentence is unclear because you might have a typo. Starting with "is that ...":

AdvertiserIdPartitionRouter returns multiple advertiser_ids in a one slice when id is not provided
AdvertiserIdPartitionRouter returns single slice for every advertiser_id as usual.

I think one of these should reference AdvertiserIdsPartitionRouter.

But in the context of these components, can you add to you comment why we even need to return a single slice with multiple advertiser_ids in the same slice. This seems unnecessarily complex given that the other flow is returning one slice per ID which is the normal convention.

I think we should try to avoid multiple advertiser_ids in a single slice if we can avoid it. Couldn't we just read in the config's advertiser ids and return multiple slices?

I think one of these should reference AdvertiserIdsPartitionRouter.

updated with correct names.

I think we should try to avoid multiple advertiser_ids in a single slice if we can avoid it. Couldn't we just read in the config's advertiser ids and return multiple slices?

MultipleAdvertiserIdPartitionRouter is used only in one stream Advertisers to fetch more then one advertisers at once, reduce amount of requests. It's how it was implemented in py version of the connector. All other streams use SingleAdvertiserIdPartitionRouter. advertiser ids is not required property in the config so we can't avoid reading parent stream.

brianjlai · 2024-06-11T22:20:10Z

...ource-tiktok-marketing/source_tiktok_marketing/components/advertiser_ids_partition_router.py

+            yield StreamSlice(partition={"advertiser_ids": json.dumps(slices[i : min(end, i + step)]), "parent_slice": {}}, cursor_slice={})
+
+
+class SingleAdvertiserIdPartitionRouter(MultipleAdvertiserIdsPartitionRouter):


Can you incorporate some of the above description into this partition router as well. I think my above question still is relevant. Why can't we return one slice per partition_value_in_config for better consistency?

added description.

Why can't we return one slice per partition_value_in_config for better consistency?

We return one slice per partition_value_in_config if it's in the provided config or read stream slices if it's not provided.

We return one slice per partition_value_in_config if it's in the provided config or read stream slices if it's not provided.

Makes sense thanks.

MultipleAdvertiserIdPartitionRouter is used only in one stream Advertisers to fetch more then one advertisers at once, reduce amount of requests. It's how it was implemented in py version of the connector. All other streams use SingleAdvertiserIdPartitionRouter. advertiser ids is not required property in the config so we can't avoid reading parent stream.

Got it, I think that's the missing part I wanted to understand was the why and it sounds like reducing the amount of requests is the intent. Can you include that in the description so we know why we have a different component.

I'm also curious in your migration, how much are we reducing requests? I imagine customers don't have that many advertiser_ids so I'm surprised that putting this into a single slice saves us that many requests when its only used by the advertisers stream.
To be honest, I'm questioning whether or not this is even worth keeping. If we are only saving a few requests, I think it would be better to just have one cursor even if it is a minor regression for one stream. I think it just depends on how much we save

brianjlai · 2024-06-11T22:44:33Z

...source-tiktok-marketing/source_tiktok_marketing/components/semi_incremental_record_filter.py

+    ) -> Iterable[Mapping[str, Any]]:
+        stream_states = None
+        if stream_state:
+            stream_states = [


It's not always clear from the shape of the state object, but I'm pretty certain that we only have one cursor per partition, and that fits with your below logic that we get only the first stream_states[0].

I think we can simplify this logic a bit and it also avoids us having to iterate over the entire state object every time:

stream_state = next((p["cursor"] for p in stream_state["states"] if p["partition"][self._partition_field] == stream_slice[self._partition_field]), {}) kwargs = {"stream_state": stream_state, "stream_slice": stream_slice, "next_page_token": next_page_token} ...

This should hopefully get the first (and only match of partition) and then defaults to {} and the rest can operate on the same flow. Let me know if that makes sense and works with your original intent

It works as expected. Thanks for suggestion. Updated code.

brianjlai · 2024-06-11T23:29:40Z

airbyte-integrations/connectors/source-tiktok-marketing/source_tiktok_marketing/manifest.yaml

+          app_id: "{{ config.get('credentials', config.get('environment', {})).get('app_id', 0) }}"
+        request_headers: {}
+        authenticator:
+          $ref: "#/definitions/authenticator"


I notice that we have a lot of places in the stream definitions themselves that assign authenticator to $ref: "#/definitions/authenticator". Since we only have one type of authenticator, why do we also need to assign it here? It looks like in some of the reusable components, we're already assigning the authenticator

Thanks, I refactored this part of code, moved #/definitions/authenticator" to requester and removed re-assigning of it.

brianjlai · 2024-06-11T23:59:14Z

...s/source-tiktok-marketing/source_tiktok_marketing/components/hourly_datetime_based_cursor.py

+        self, start: datetime.datetime, end: datetime.datetime, step: Union[datetime.timedelta, Duration]
+    ) -> List[StreamSlice]:
+        start = start.replace(hour=0, minute=0, second=0)
+        return super()._partition_daterange(start, end, step)


This might result in some duplicative code, but I think we should be cautious not to override the private method _partition_daterange(). Because it's intended to be private, we're more likely to make underlying changes to this w/o considering how dependent connectors use this method.

Instead, can we override the public stream_slices() implementation which is the main method that in turn invokes the logic for _partition_daterange().

So it would look roughly like:

def stream_slices(): # get best end time # reset start to the 0 hour of the provided start # copy the logic from the `_partition_daterange`

I don't love the duplicated logic so its not a perfect solution, but it does make it more resilient to any upstream changes that could break the connector a republish

moved replacing of h/m/s to stream_slices

brianjlai

A note on naming and Just one last discussion point on the need for the MultipleAdvertiserIdsPartitionRouter. given it's only used on one stream, depending on how drastically it reduces requests, I think we might want to get rid of it even if that deviates from the original behavior.

What I want to figure out is how much the separate partition router benefits us. Basically, does combining the advertiser ids into a single slice results in them all getting bundled up and we only have to go through a single full iteration? Versus, if we separate them into individual slices and that means we have to perform one full iteration per advertiser_id slice. For example, if we have 5 advertiser_ids, then we end up making 5x the requests. If thats the case we can leave as is.

After we clear that up this is good to go. nice work!

brianjlai · 2024-06-12T20:54:26Z

...ource-tiktok-marketing/source_tiktok_marketing/components/advertiser_ids_partition_router.py

+            yield StreamSlice(partition={"advertiser_ids": json.dumps(slices[i : min(end, i + step)]), "parent_slice": {}}, cursor_slice={})
+
+
+class SingleAdvertiserIdPartitionRouter(MultipleAdvertiserIdsPartitionRouter):


We return one slice per partition_value_in_config if it's in the provided config or read stream slices if it's not provided.

Makes sense thanks.

MultipleAdvertiserIdPartitionRouter is used only in one stream Advertisers to fetch more then one advertisers at once, reduce amount of requests. It's how it was implemented in py version of the connector. All other streams use SingleAdvertiserIdPartitionRouter. advertiser ids is not required property in the config so we can't avoid reading parent stream.

Got it, I think that's the missing part I wanted to understand was the why and it sounds like reducing the amount of requests is the intent. Can you include that in the description so we know why we have a different component.

I'm also curious in your migration, how much are we reducing requests? I imagine customers don't have that many advertiser_ids so I'm surprised that putting this into a single slice saves us that many requests when its only used by the advertisers stream.
To be honest, I'm questioning whether or not this is even worth keeping. If we are only saving a few requests, I think it would be better to just have one cursor even if it is a minor regression for one stream. I think it just depends on how much we save

brianjlai · 2024-06-12T21:10:30Z

...ource-tiktok-marketing/source_tiktok_marketing/components/advertiser_ids_partition_router.py

+from airbyte_cdk.sources.declarative.types import StreamSlice
+
+
+class MultipleAdvertiserIdsPartitionRouter(SubstreamPartitionRouter):


Let's rename this to MultipleAdvertiserIdsPerPartition and the below one SingleAdvertiserIdPerPartition. I think we want to really make it clear that what is happening is that it is how many advertiser_ids are in one partition. Because when named SingleAdvertiserIdPartitionRouter makes it sound like we only ever get one advertiser id.

darynaishchenko added 2 commits May 17, 2024 19:02

updated dependencies

5de6ef0

migrate streams to low code

9dcfc02

darynaishchenko self-assigned this May 17, 2024

darynaishchenko marked this pull request as draft May 17, 2024 16:10

octavia-squidington-iii added area/connectors Connector related issues connectors/source/tiktok-marketing labels May 17, 2024

darynaishchenko added 19 commits May 20, 2024 12:02

updated poetry.lock

e69706c

updated transfromations, schema normalization and custom part router

77f0240

support configs without credentials

6c3dd55

added lookback for report streams

ccc3f15

end date for report streams

bf2416c

added include deleted for report streams

3e42045

added include deleted for ads, ad_groups and campaigns streams

7316447

updated abnormal state

977edc0

format fix

991a7bb

moved spec to manifest

9a9abd5

moved schemas to manifest

6984e22

deleted streams.py

87cbfad

updated custom components, default start date, check stream, discover…

5ee1960

…ed streams

added unit tests

7369e20

updated abnormal_state

d83a775

updated expected records

587ccab

support secret and app_id for config with environment, added value_ty…

7b8687f

…pe for dimensions transformations

updated streams for old configs with granularity

c6130f3

bump version, breaking change docs

75a439b

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label May 23, 2024

vercel bot deployed to Preview May 23, 2024 16:01 View deployment

Merge branch 'master' into daryna/source-tik-tok-marketing/migartion-…

4fc6588

…to-low-code

vercel bot deployed to Preview May 23, 2024 17:20 View deployment

update airbyte-cdk

e417a88

darynaishchenko marked this pull request as ready for review May 23, 2024 17:34

octavia-squidington-iv requested a review from a team May 23, 2024 17:36

darynaishchenko added 3 commits May 24, 2024 15:01

changed cdk tag to low-code

4feffe0

updated expected_records2

5e86892

removed using real secrets in unit tests

727f533

darynaishchenko requested a review from a team May 24, 2024 15:46

updated metrics param handling

61c3f17

darynaishchenko mentioned this pull request May 27, 2024

🐛 Source Tiktok: include_deleted argument does not affect ads, ad_groups, and campaigns streams #32596

Closed

darynaishchenko added 2 commits May 28, 2024 14:48

added start_from_page:1 to pagination strategy

b4ec22b

updated include deleted desc in spec

68ba6dc

artem1205 reviewed May 31, 2024

View reviewed changes

airbyte-integrations/connectors/source-tiktok-marketing/source_tiktok_marketing/manifest.yaml Outdated Show resolved Hide resolved

airbyte-integrations/connectors/source-tiktok-marketing/source_tiktok_marketing/manifest.yaml Outdated Show resolved Hide resolved

darynaishchenko added 2 commits June 4, 2024 15:06

refactor code, fixed incremntal for hourly reports

9bca6a9

added integration tests for empty streams

13e464b

darynaishchenko requested a review from artem1205 June 4, 2024 12:12

Merge branch 'master' into daryna/source-tik-tok-marketing/migartion-…

50cb576

…to-low-code

vercel bot deployed to Preview June 4, 2024 12:38 View deployment

darynaishchenko mentioned this pull request Jun 5, 2024

fix(airbyte-cdk): client_side_incremental fix end_datetime comparison #38874

Open

2 tasks

darynaishchenko requested a review from girarda June 5, 2024 12:47

katmarkham approved these changes Jun 10, 2024

View reviewed changes

airbyte-integrations/connectors/source-tiktok-marketing/metadata.yaml Outdated Show resolved Hide resolved

darynaishchenko added 3 commits June 11, 2024 14:30

Merge branch 'master' into daryna/source-tik-tok-marketing/migartion-…

85cbe37

…to-low-code

updated docs

9b04c46

updated upgradeDeadline

9495b87

brianjlai reviewed Jun 12, 2024

View reviewed changes

darynaishchenko added 2 commits June 12, 2024 14:13

refactor code

f4f8a19

format fix

e9e2a0c

darynaishchenko requested a review from brianjlai June 12, 2024 11:50

brianjlai reviewed Jun 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚨 🚨 ✨ Source Tik Tok Marketing: Migration to Low-Code #38316

🚨 🚨 ✨ Source Tik Tok Marketing: Migration to Low-Code #38316

darynaishchenko commented May 17, 2024 •

edited

vercel bot commented May 17, 2024 •

edited

artem1205 May 31, 2024

darynaishchenko Jun 4, 2024

darynaishchenko commented Jun 11, 2024

brianjlai left a comment

brianjlai Jun 11, 2024

brianjlai Jun 11, 2024

darynaishchenko Jun 12, 2024

brianjlai Jun 11, 2024

darynaishchenko Jun 12, 2024

brianjlai Jun 11, 2024

darynaishchenko Jun 12, 2024

brianjlai Jun 11, 2024

darynaishchenko Jun 12, 2024

brianjlai Jun 12, 2024

brianjlai Jun 11, 2024

darynaishchenko Jun 12, 2024

brianjlai Jun 11, 2024

darynaishchenko Jun 12, 2024

brianjlai Jun 11, 2024

darynaishchenko Jun 12, 2024

brianjlai left a comment

brianjlai Jun 12, 2024

brianjlai Jun 12, 2024

		yield StreamSlice(partition={"advertiser_ids": json.dumps(slices[i : min(end, i + step)]), "parent_slice": {}}, cursor_slice={})


		class SingleAdvertiserIdPartitionRouter(MultipleAdvertiserIdsPartitionRouter):

		from airbyte_cdk.sources.declarative.types import StreamSlice


		class MultipleAdvertiserIdsPartitionRouter(SubstreamPartitionRouter):

🚨 🚨 ✨ Source Tik Tok Marketing: Migration to Low-Code #38316

Are you sure you want to change the base?

🚨 🚨 ✨ Source Tik Tok Marketing: Migration to Low-Code #38316

Conversation

darynaishchenko commented May 17, 2024 • edited

What

How

Review guide

User Impact

Can this PR be safely reverted and rolled back?

vercel bot commented May 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darynaishchenko commented Jun 11, 2024

brianjlai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianjlai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darynaishchenko commented May 17, 2024 •

edited

vercel bot commented May 17, 2024 •

edited