Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Protobuf Performance Refactor #230

Closed
wants to merge 7 commits into from
Closed

Conversation

craiglabenz
Copy link

@craiglabenz craiglabenz commented May 4, 2021

Aims to speed up naive usage on proto-plus-python as a wrapper around primitive protobuf objects.

This PR replaces attribute misses on proto-plus objects, which wound up in the __getattr__ override to query the hidden protobuf object directly, with descriptors which do the same inside a caching layer to optimize repeated access. This descriptor/caching layer also aims to avoid unnecessary marshaling altogether when circumstances allow, further reducing the runtime of basic tasks.


The following is a performance chart of several simple tasks executed against the same proto definitions, first as raw protobuf objects and again as a proto-plus object. The two definitions are:

Note: All tasks were run 10,000 times.

Generate a(n)... pb2 proto-plus (before refactor) proto-plus (with refactor) refactor impact
empty instance 3ms 10ms 7ms ⬇️ 30% 🟢
populated instance 9ms 168ms 118ms ⬇️ 30% 🟢
populated nested objs 38ms 211ms 163ms ⬇️ 23% 🟢
populated nested objs (1x attribute reads) 39ms 326ms 419ms ⬆️ 29% 🔴
populated nested objs (10x attribute reads) 101ms 2009ms 706ms ⬇️ 65% 🟢
populated w serialization 20ms 195ms 280ms ⬆️ 43% 🔴
populated w lots of fields 25ms 624ms 440ms ⬇️ 30% 🟢
MapFields populated during init 124ms 799ms 626ms ⬇️ 21% 🟢
MapFields populated after init -- 1079ms 369ms ⬇️ 66% 🟢
deserializing objects 116ms 204ms 180ms ⬇️ 12% 🟢

When used as a dependency in python-datastore, the following proto-plus versions have these performance hydrating loaded 500 entities (network I/O is excluded):

datastore version notes before #230 with #230
1.15.3 before proto-plus 360ms N/A
2.0.0 with proto-plus 675ms 800ms
2.1.2 proto-plus usage optimized
via googleapis/python-datastore#155)
550ms 750ms

Resolves #222

@craiglabenz craiglabenz requested a review from a team as a code owner May 4, 2021 22:57
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label May 4, 2021
@codecov
Copy link

codecov bot commented May 4, 2021

Codecov Report

Merging #230 (d723f18) into master (6a43682) will decrease coverage by 5.09%.
The diff coverage is 84.22%.

Impacted file tree graph

@@             Coverage Diff             @@
##            master     #230      +/-   ##
===========================================
- Coverage   100.00%   94.90%   -5.10%     
===========================================
  Files           20       20              
  Lines          862     1158     +296     
  Branches       149      238      +89     
===========================================
+ Hits           862     1099     +237     
- Misses           0       32      +32     
- Partials         0       27      +27     
Impacted Files Coverage Δ
proto/modules.py 100.00% <ø> (ø)
proto/marshal/marshal.py 73.93% <41.42%> (-26.07%) ⬇️
proto/message.py 98.00% <92.75%> (-2.00%) ⬇️
proto/marshal/collections/maps.py 96.55% <94.28%> (-3.45%) ⬇️
proto/fields.py 97.39% <96.12%> (-2.61%) ⬇️
proto/enums.py 100.00% <100.00%> (ø)
proto/marshal/collections/repeated.py 100.00% <100.00%> (ø)
proto/marshal/rules/struct.py 93.33% <0.00%> (-6.67%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e25dd7d...d723f18. Read the comment docs.

@craiglabenz
Copy link
Author

Note for code reviewers: There are a few spots in this PR that are pseudo-experiments, and they're decorated with GitHub comments, but please feel free to critique it all!

Copy link
Author

@craiglabenz craiglabenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some (hopefully) helpful mile-markers for code review.

value = self._set_coercion(value)
value = self._hydrate_dicts(value)

# Warning: `always_commit` is hacky!
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comment says, this is hacky. Sadly, it is a lynch-pin for the whole refactor.

def _throw_value_error(proto_type, primitive):
raise ValueError(f"Unacceptable value {type(primitive)} for type {proto_type}")

def validate_primitives(self, field, primitive):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is unfortunate, but is the only way to recreate immediate validation when attempting something like my_obj.some_string_field = 123.

@@ -269,6 +287,14 @@ def __new__(mcls, name, bases, attrs):
def __prepare__(mcls, name, bases, **kwargs):
return collections.OrderedDict()

def add_to_class(cls, name, value):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This whole contribute_to_class pattern is heavily (or, more like entirely) borrowed from Django.

Copy link
Contributor

@tseaver tseaver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran out of time today to look at more than proto/fields.py. Hope the review is helpful anyway.

proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
@busunkim96
Copy link
Contributor

@craiglabenz Thank you for putting this together!

@software-dov Could you take a look?

Copy link
Contributor

@software-dov software-dov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, going to need a few others to fully understand what's going on.

proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Outdated
Comment on lines 386 to 388
# For the most part, only primitive values can be returned natively, meaning
# this is either a Message itself, in which case, since we're dealing with the
# underlying pb object, we need to sync all deferred fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a little confusing. The 'either' has no second branch.

proto/marshal/collections/maps.py Outdated Show resolved Hide resolved
proto/marshal/marshal.py Outdated Show resolved Hide resolved
proto/marshal/marshal.py Show resolved Hide resolved
proto/marshal/marshal.py Outdated Show resolved Hide resolved
proto/message.py Outdated Show resolved Hide resolved
proto/message.py Outdated Show resolved Hide resolved
@software-dov
Copy link
Contributor

Can you provide a link to your benchmark code? I've had a hard time creating good benchmarks that mimic real world message manipulation patterns.

Copy link
Contributor

@software-dov software-dov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second review round, looking great!

Comment on lines +414 to +423
class _NoneType:
def __bool__(self):
return False

def __eq__(self, other):
"""All _NoneType instances are equal"""
return isinstance(other, _NoneType)


_none = _NoneType()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outrageous nitpicking: we're not checking for None so much as trying to use a unique canary distinct from regular None. This could be condensed into

_canary = object()
...
value = getattr(instance, self.instance_attr_name, _canary)
if self.field.can_get_natively and value is not _canary:

Comment on lines 35 to 36
max_int_32 = 1<<31 - 1
min_int_32 = max_int_32 * -1
min_int_32 = (max_int_32 * -1) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is correct. Isn't the correct min
(max_int_32 * -1) - 1?

proto/marshal/marshal.py Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
proto/fields.py Outdated Show resolved Hide resolved
proto/message.py Outdated
Comment on lines 646 to 649
if obj and hasattr(obj, '_update_pb'):
obj._update_pb()
if obj and hasattr(obj, '_update_nested_pb'):
obj._update_nested_pb()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a few readthroughs to understand why a field value may be a message and not have _update_pb or _update_nested_pb. I bet that if we changed the conditional logic a bit it might be more clear.

if obj and isinstance(obj, Message):
    obj._update_pb()
    obj._update_nested_pb()

I still need to mull over this logic a little bit. It seems like we might be able to skip fields under certain circumstances. Also, I'm not sure if the ordering matters. I'll need to do some experiments and come up with examples and or counterexamples.

proto/message.py Show resolved Hide resolved
proto/message.py Show resolved Hide resolved
proto/message.py Outdated Show resolved Hide resolved
"""Loops over any stale fields and syncs them to the underlying pb2 object.
"""
merge_params = {}
for field_name in getattr(self, '_stale_fields', set()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: we can just take the existence of _stale_fields as given.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LIkewise here, constructing a set is expensive compared to using an already created (or interned) value.

@craiglabenz
Copy link
Author

Closing this PR because it optimizes use cases that are different from what we see in client libraries like python-datastore and python-firestore.

Specifically, caching attributes is only helpful if they are read > 1 time, yet python-datastore converts all proto-plus instances into dictionary-like objects; meaning each field is read exactly once before all future reads are performed against that dictionary.

If you find this and would like to know more, reach out.

@craiglabenz craiglabenz deleted the 222-pb-performance branch May 25, 2021 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

proto-plus introduces substantial overhead to interactions with pb.
4 participants