Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tap-marketo seems to suffer from several performance issues #87

Open
haleemur opened this issue Nov 8, 2023 · 0 comments · May be fixed by #88
Open

tap-marketo seems to suffer from several performance issues #87

haleemur opened this issue Nov 8, 2023 · 0 comments · May be fixed by #88

Comments

@haleemur
Copy link

haleemur commented Nov 8, 2023

We use tap-marketo, and have noticed several performance issues with it.

The ETL Sub Process Timeline

To visualize the issue, let me first break down the various stages of the etl workflow with tap-marketo's leads stream. The extract--load process is essentially the following:

do while export_end < job_start
   request marketo to create the export file and wait for it to be ready (1: dark blue)
   download the generated csv file (2: orange)
   parse the csv file row by row and call singer.write_record (3: red)

finish the tap process, generate output file (5: light blue)
copy file to s3 (6: green)

image

Description of Visualization

  1. time required for step 1 is dictated by marketo and we don't have any options to optimize it from the perspective of the tap.
  2. step 2 is governed by the network bandwidth
  3. notice how step 3 (parsing the csv) takes an inordinate amount of time, and that is something we should be able to optimize
  4. notice how the steps 1, 2 & 3 are repeated for a second time. it seems to me that the while loop is not terminating when it should

Optimization Opportunity

investigating the codebase, i found that parsing the csv has the following optimization opportunities

  1. reading the csv involves an extra generator expression with a call to str.replace. this cost is increases with downloaded rows.
  2. for reach row, call format_values, which in turn builds available_fields and calls format_value on each field. It stands to reason that we should optimize format_value as that is executed in the inner-most loop. this cost increases as the product of downloaded rows & fields in each row. format_value uses several inefficient constructs, described below.
  3. determining available_fields in format_values seems inefficient as well, as the fields are not going to change for a given stream. this cost increases with downloaded rows
  4. the extra loop seems to be due to export_end getting millisecond-truncated, export_start is set to export_end at the end of the while loop, and the comparison export_end < job_start returns true for 1 extra time because job_start is not millisecond-truncated, and thus is almost always a few milliseconds greater than export_end

Opportunity to add more user-configuration

ATTRIBUTION_WINDOW_DAYS = 1 is used to set the start of the extract back by 1 day. While this is a good default option for people ingesting data on a daily basis, it is not suitable for extractions running multiple times a day, as we are re-extracting the same record multiple time leading to additional data processing within the database.

It would be useful to be able to over-ride this hard-coded value through a parameter.

Internally at my workplace, we have improved the above mentioned issues, and have noticed our run time reduce by over 50%. Should the maintainers of tap-marketo agree, I'm happy to port the improvements over.

Opportunity to optimize format_value

isinstance is not required in this method.

csv.reader (or DictReader) will only return strings in our usage, and its possible that it might return floats if QUOTE_NONNUMERIC is specified. Quoting the csv module's documentation

Each row read from the csv file is returned as a list of strings. No automatic data type conversion is performed unless the QUOTE_NONNUMERIC format option is specified (in which case unquoted fields are transformed into floats).

the warning message feel unnecessary.

evidently the conversion to integer is correct, and it is marketo's standard practice to format percent-fields as a decimal. the warning message emitted for each percent field for each row imposes an undue linearly increasing cost

conversion to int is twice as fast with int(float(value))

this sequence of operation would still work if value was of non-string type (recall that csv.reader returns strings by default but could also return floats if QUOTE_NONNUMERIC is set), which the current method is unable to handle becasue find method only exists for strings.

str.find is slow (or at least slow for our critical loop), and the added if-else adds cost that we should try to void.

the null-value early return could leverage a set for better lookup performance.

we should just define a global constant (like below) and use that for lookup.

Even compared with a list of size 3, a lookup on a set will be more efficient. Having it be global means that the reference data is allocated only once per process.

NULL_VALUES = {None, '', 'null}
@haleemur haleemur linked a pull request Nov 8, 2023 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant