Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize extract from marketo #88

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

haleemur
Copy link

@haleemur haleemur commented Nov 8, 2023

Description of change

  • adds more context to some log messages. this additional context has proved useful in identifying performance issues
  • reduces number of function calls to str.replace & an additional layer of redirection via a generator expression when reading the saved csv file.
  • Go back to using csv.DictReader for better ergonomics.
  • extracts some immutable computation (available_fields & individual fields' output schema["type"]) from within the inner loop to outside of the loop. This reduces the complexity required to process individual rows of the csv. This is done by two new methods: get_stream_fields & get_output_typemap.
  • optimize format_value function (see below)
  • prevent a final loop of data processing due to export_start < job_started because the left side's microsecond- truncated while the right side is not.
  • make attribution_window user-configurable (currently this is hard-coded ATTRIBUTION_WINDOW_DAYS = 1)
  • avoid unnecessary setting of max_bookmark within the loop iterating over the csv file when the marketo client supports corona (max_bookmark is set outside of the loop)

optimize format_value

format_value is used in the inner most loop of the csv file processing. as such it is performance critical.

the existing method performs the following operations that harm performance.

  1. uses isinstance to check for types. this is unnecessary as the input value either None or of type string.
  2. uses str.find in parsing integers
  3. allocates field_type on each function call and determines the value via an if-else block
  4. performs a membership check on a list to determine the output type (rather than an equality check)
  5. emits a log message whenever a marketo-percent field is encountered. this harms performance for any workloads where multiple percent-fields are extracted.
  6. uses multiple nested conditions to determine the right formatting logic to apply
  7. allocates a list & performs a membership check against this list list to determine if the resulting value should be None.

The optimized version

  1. Uses value is None or value == '' or value == 'null' to return early if value should be None.
  2. Eliminates isinstance checks used in conversion to integer & boolean and replaces the logic with appropriate functions that can work with both string & non string types.
  3. In 1 specific case for integer, it avoids calling str.find and additional conditional logic with int(float(value))
  4. The new version also avoids emitting a log event as it seems that the handling of marketo-percentage custom type is correct.
  5. the signature of the new function accepts _type instead of schema and avoids looking up schema["type"]. _type is either a string or a tuple and this allows the method to replace conditions such as "boolean" in field_type with _type == "boolean"

Manual QA steps

Note that we're running this tap in production, and have verified that both the csv processing time is reduced and the unnecessary extract due to microsecond-truncation is avoided

Below i'm copying over some of the internal validation that we did.

I verified performance of the format_value refactor wrt parsing integers

This may be verified by anyone by running the following in a jupyter notebook cell.

def to_int1(x):
    n = x.find('.')
    if n > 0:
        return int(x[:n])
    return int(x)

def to_int2(x):
    return int(float(x))

print("testing to_int1('65.00')")
%timeit to_int1('65.00')
print("testing to_int2('65.00')")
%timeit to_int2('65.00')

in my case, I observed the following

testing to_int1('65.00')
153 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
testing to_int2('65.00')
78.7 ns ± 0.463 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

verified the performance improvement with different output types.

another snippet that can be run in a jupter notebook:

import pendulum
import singer

def format_value1(value, schema):  # 7 ifs + 3 nested-ifs
    if not isinstance(schema["type"], list):
        field_type = [schema["type"]]
    else:
        field_type = schema["type"]

    if value in [None, "", 'null']:
        return None
    elif schema.get("format") == "date-time":
        return pendulum.parse(value).isoformat()
    elif "integer" in field_type:
        if isinstance(value, int):
            return value

        # Custom Marketo percent type fields can have decimals, so we drop them
        decimal_index = value.find('.')
        if decimal_index > 0:
            value = value[:decimal_index]
        return int(value)
    elif "string" in field_type:
        return str(value)
    elif "number" in field_type:
        return float(value)
    elif "boolean" in field_type:
        if isinstance(value, bool):
            return value
        return value.lower() == "true"

    return value

def format_value3(value, _type): # 5 total conditions, 0 allocations
    """
    Format the values read from the csv line based on the stream's schema.

    NOTE: This function gets called on each field of each row, and is thus its performance
    is critical to processing the downloaded csv file.

    returns any of:
        None
        datetime string
        integer
        float
        boolean
        string 
        compoud data structure, dict / list: default last option
    """
    if value is None or value == '' or value == 'null':
        return None
    if "string" == _type:
        return str(value)
    if _type == "integer":
        return int(float(value))
    if _type == "number":
        return float(value)
    if _type == "date-time":
        return pendulum.parse(value).isoformat()
    if _type == "boolean":
        return str(value).lower() == "true"  
    return value
test_cases = {
    'any_return_null_0': (None, {"type": ["null", "string"]}),
    'any_return_null_1': ('', {"type": ["null", "string"]}),
    'any_return_null_2': ('null', {"type": ["null", "string"]}),
    'boolean_return_original_0': (True, {"type": ["null", "boolean"]}),
    'boolean_return_parsed_0': ("True", {"type": ["null", "boolean"]}),
    'string_return_original_0': ("123aljdfakjflkajljlkj23rjlkjajflkj43k1j5l1j5j134lja907b0981o5hb580rjbrb", {"type": ["null", "string"]}),
    'integer_return_original_0': (123, {"type": ["null", "integer"]}),
    'integer_return_parsed_0': ("123.123", {"type": ["null", "integer"]}),
    'integer_return_parsed_1': ("123", {"type": ["null", "integer"]}),
    'number_return_original_0': (123.23, {"type": ["null", "number"]}),
    'number_return_parsed_1': ("12.3", {"type": ["null", "number"]}),
    'number_return_parsed_0': ("123", {"type": ["null", "number"]}),
    'datetime_return_parsred_0': ('2023-10-31T22:29:13.243Z', {"type": ["null", "string"], "format": "date-time"}),
    'object_return_original_0': ({"hello": "world"}, {"type": ["null", "object"]}),
}

for name, case in test_cases.items():
    print(f'format_value1,{name},', end='')
    value, value_type = case
    %timeit format_value1(value, value_type)

    value_type3 = [schema.get("format", t) for t in value_type["type"] if t != "null"][0]
    print(f'format_value3,{name},', end='')
    %timeit format_value3(value, value_type3)

This resulted in the following output for me:

format_value1,any_return_null_0,52.2 ns ± 0.218 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,any_return_null_0,22.7 ns ± 0.101 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,any_return_null_1,60.5 ns ± 0.297 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,any_return_null_1,28 ns ± 3.22 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,any_return_null_2,68.5 ns ± 0.116 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,any_return_null_2,30.4 ns ± 0.189 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,boolean_return_original_0,180 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,boolean_return_original_0,63.8 ns ± 0.582 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,boolean_return_parsed_0,205 ns ± 1.05 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
format_value3,boolean_return_parsed_0,54.5 ns ± 0.349 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,string_return_original_0,134 ns ± 1.43 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,string_return_original_0,52.7 ns ± 0.325 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,integer_return_original_0,126 ns ± 0.574 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,integer_return_original_0,64.6 ns ± 0.38 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,integer_return_parsed_0,279 ns ± 0.959 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
format_value3,integer_return_parsed_0,53.4 ns ± 0.2 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,integer_return_parsed_1,240 ns ± 1.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
format_value3,integer_return_parsed_1,53.7 ns ± 0.135 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,number_return_original_0,167 ns ± 0.364 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,number_return_original_0,67.8 ns ± 0.158 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,number_return_parsed_1,180 ns ± 0.299 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,number_return_parsed_1,55.8 ns ± 0.298 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,number_return_parsed_0,179 ns ± 0.336 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,number_return_parsed_0,54 ns ± 0.161 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,datetime_return_parsred_0,4.79 µs ± 28 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
format_value3,datetime_return_parsred_0,53.8 ns ± 0.115 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,object_return_original_0,178 ns ± 0.322 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,object_return_original_0,65.7 ns ± 0.313 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Visualizing Performance Gains

Cleaning the output and plotting the speedup yields the following chart (lower is better)

test-case-performance

An interesting observation is that the date-time formatting takes about 5 µs, whereas the rest of the timings of the optimized version are in the range of 100 ns

This suggests that the performance of format_values is dominated by the parsing of datetime fields, and that we should look to optimize pendulum.parse(value).isoformat(). the version of pendulum used is from 2017, and the current beta version of pendulum rewrote several extensions in rust. We should update this dependency to benefit from any performance improvement.

I also tested for the equivalency between the two methods using the above test cases

for name, case in test_cases.items():
    assert (format_value1(*case) is None and format_value2(*case) is None) or (format_value1(*case) == format_value2(*case)), f'equivalence assertion failed for case: {case}'

no assertion error was raised.

Risks

  • i don't think there are any risks. that at this point, these changes have been tested in production in our org

Rollback steps

  • revert this branch

Related Issues:

@haleemur haleemur marked this pull request as draft November 12, 2023 15:05
@haleemur haleemur force-pushed the optimize-extract-from-marketo branch from e2f16f0 to f581d75 Compare November 14, 2023 02:51
@haleemur haleemur changed the title draft: Optimize extract from marketo Optimize extract from marketo Nov 14, 2023
@haleemur haleemur marked this pull request as ready for review November 14, 2023 03:11
@haleemur haleemur mentioned this pull request Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tap-marketo seems to suffer from several performance issues
1 participant