-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full pickle support #242
base: master
Are you sure you want to change the base?
Full pickle support #242
Conversation
These tests should catch copy-paste bugs, because there is inherently some code duplication involved in dynamically defining from multiple template functions. I tested the tests using 10 examples of mix-ups and all were caught as either `FAIL`, `ERROR` or failure in `import pydot`. (This commit is part of a series of changes in how the `get_*`, `set_*`, `create_*` and `write_*` methods are dynamically created.)
This small refactoring prepares for a later commit that will switch to binding the methods to the class instead of to the instance. The instance attribute `prog` will not be available then anymore, because no instance exists yet at the time the class is created. Therefore, the reference is now replaced by the string `'dot'`. Note that this does not actually change the default value at all. It always got fixed to the value of `prog` at the time of function definition, which was always `'dot'`, as can be seen from line 1713. The default value already never changed along with the instance attribute value anymore after function definition. Side note: - One could argue that `None` would make a better default value here. However, changing that now would be changing the external behavior of the methods themselves, while this series aims to change the way in which they are created with minimal external effects. I plan to propose a switch to `None` together with other changes to `create_*`/`write_*` in another series, though. (This commit is part of a series of changes in how the `get_*`, `set_*`, `create_*` and `write_*` methods are dynamically created.)
Not seeing any customizations that necessitate the use of the magic method `__setattr__()` here, so using `setattr()` should be equivalent. [[1]] [[2]] Timeit is reporting 20 to 80us reductions in object instantation times, all 20% drops relative to pydot 1.4.1. No other changes in external behavior are expected. [1]: https://stackoverflow.com/questions/14756289/setattrobject-name-value-vs-object-setattr-name-value [2]: https://stackoverflow.com/questions/7559170/whats-the-difference-between-setattr-and-object-setattr (This commit is part of a series of changes in how the `get_*`, `set_*`, `create_*` and `write_*` methods are dynamically created.)
Replace lambdas with function definitions. In a later commit, we will switch to binding the function as a method to the class instead of to the instance. In that process, we will be changing some of its attributes, so then the function object needs to be assigned to a name. A regular function `def` takes care of that assignment already. Some other reasons: - Using lambdas as class/instance methods is frowned upon. Error code E731 in flake8, for example. [[1]] [[2]] - Easier to add docstrings to a function later. [[3]] - Lambdas cannot contain statements. Here this stands in the way of replacing `__setitem__()` with a regular assignment later. Lambdas return functions as well, so no changes in behavior are expected [[4]] [[5]]. Timeit is not showing any significant changes in performance. The use of the same name (`func`) for these two short-lived functions triggers pylint error `E0102`/`function-redefined`. Fixing this by giving the functions two different names might actually increase the risk of copy-paste bugs, while the redefining of the functions actually still continues because we are in a loop. It is just that pylint will not see it anymore then. I plan to discuss how this pylint error should be addressed in the pull request containing this commit. For now, a pragma is added to suppress it. [1]: https://realpython.com/python-lambda/#python-classes [2]: https://pycodestyle.pycqa.org/en/latest/intro.html#error-codes [3]: pydot@f374f66 [4]: https://stackoverflow.com/questions/12264834/what-is-the-difference-for-python-between-lambda-and-regular-function [5]: https://stackoverflow.com/questions/44840101/in-python-lambda-expression-is-fast-or-the-normal-function (This commit is part of a series of changes in how the `get_*`, `set_*`, `create_*` and `write_*` methods are dynamically created.)
We now bind the dynamically created methods to their class instead of generating them for each instance separately. Now, they should get created only once, at module load, and from then on get the same treatment as regularly defined methods. This should lead to better support for serialization by pickle, by ensuring that the methods are an integral part of the class, and thereby always available to instances, also after unpickling. Furthermore, this brings significant speedups and reduced memory usage for pydot objects. Notes on pickle: - This commit prevents errors after unpickling like the following: AttributeError: 'Dot' object has no attribute 'write_png' AttributeError: 'Dot' object has no attribute 'get_bgcolor' - The alternative of recreating the methods during unpickling by letting `__setstate__` or `__reduce__` call `__init__()` again was not chosen because it seems fundamentally wrong [[1]] [[2]] [[3]]. - The other alternative of serializing the instance methods along with the rest of the instance was not chosen, because: - It also feels fundamentally wrong. Why serialize methods that should never differ between instances to begin with? - Pickle refuses to work on the instance-bound methods, even after lambdas are changed to functions. This could be seen by temporarily removing `__getstate__`: AttributeError: Can't pickle local object 'Dot.__init__.<locals>.new_method' AttributeError: Can't pickle local object 'Common.create_attribute_methods.<locals>.func' - Even if an alternative to `pickle` like `dill` is used, the serializing of the methods means slower pickling and increased size of the representations. - Moving these methods from the instance to the class also eliminates the need for the current custom `__getstate__` and `__setstate__` methods. They will be removed in a later commit. [1]: https://docs.python.org/3.9/library/pickle.html#pickling-class-instances [2]: https://stackoverflow.com/questions/50308214/python-3-alternatives-for-getinitargs#comment87633434_50308545 [3]: https://nedbatchelder.com/blog/202006/pickles_nine_flaws.html#h_init_isnt_called Notes on performance: - Module load time increased between 1.5 to 2.0ms, which is only 2% on the load time reported by bash's `time`, but 60% on the `self` for `pydot` reported by `python -X importtime`. - Time saved on each object instantiation, according to `timeit`: 0.1 to 0.3ms, depending on the class, but for all it translates into a further 74 to 78% drop relative to pydot 1.4.1. - The more objects in a graph, the sooner there will be a net time saving. I estimate the break-even point to be around 5 to 15 objects. The speedups are most noticeable when building graphs from the ground up programmatically, like `test_names_of_a_thousand_nodes` from the test suite, which cuts its run time by almost 300ms or 92%. - Memory savings reported by `pympler.asizeof()` show the sizes of Dot, Edge and Node instances are cut by 20, 9 and 7 KiB respectively, which translates to 87 to 90% reductions. - Measured using: Celeron N3350, Linux 4.19 amd64, Debian 10.6, CPython 3.7.3, timeit, -X importtime, dill 0.3.2, Pympler 0.9. Further notes: - As part of the dynamic creation of each method, its function is now also renamed (`__name__` and `__qualname__`) to the name it would have had if it had been defined in the class itself. This prevents that a temporary, generic name shows up in tracebacks. - The binding to the class is performed from class decorators. Some alternatives that were considered: - Metaclasses: Got it to work, but more complex than class decorators and may cause metaclass conflicts for the subclassing user because of their inheritance rules. Many authors suggest to use decorators instead of metaclasses where possible. [[4]] [[5]] - `__init_subclass__`: Got it to work from class `Common`, but the problem is that we are not customizing all its subclasses in the same way: Subclass `Dot` needs output format methods, some other subclasses need DOT attribute getters/setters, and some others do not need anything at all. This can be solved by adding arguments to the `__init_subclass__` signature and/or inspection of the subclass, plus the necessary logic to switch between the different cases then. Or, to prevent that, by overhauling the class hierarchy and adding some new classes between `Common` and the subclasses. In both cases, class decorators seem a lot simpler in the end. [4]: https://www.python.org/dev/peps/pep-3129/#rationale [5]: https://realpython.com/python-metaclasses/#is-this-really-necessary - `create_attributes_methods()`, which creates the `get_*`/`set_*` methods, is transformed to a decorator factory: It now takes only the set of DOT attributes and returns the true decorator function that will add exactly that set to the class. It can be used as a parameterized class decorator [[6]] [[7]] or from a metaclass. - Since it does not need to be inherited by subclasses anymore, it is moved from class `Common` to module-level. - An alternative to using a parameterized decorator was to let the decorator determine which set of DOT attributes to apply based on inspection of the class. But switching based on the class name string seems fragile. Switching by type is difficult for custom subclasses because `issubclass()` does not work when classes are still being created. Reading the set from a class attribute can work, as `create_format_methods()` shows, though I wonder if it will work from a metaclass `__new__()` for example, which runs much earlier in the class creation process. Also, it raises questions about the meaning of such attribute in subclasses that already inherit all the methods created for their base class. For example, `Cluster` needs to override the set with its own while `Subgraph` keeps the set of its parent. Also, the sets of DOT attributes are currently kept as global constants and would then either have to be moved to the classes (API change) or kept in two places. So, in this case, the parameterized solution seemed to require the least modifications and provide an easy way to pass custom DOT attributes for further subclassing. [6]: https://stackoverflow.com/questions/681953/how-to-decorate-a-class/44556596#44556596 [7]: https://stackoverflow.com/questions/5929107/decorators-with-parameters (This commit is part of a series of changes in how the `get_*`, `set_*`, `create_*` and `write_*` methods are dynamically created.)
This commit wraps up a series of changes aiming to provide full support for pickle serialization of pydot objects. The last few commits ensured that dynamically defined methods are bound to their classes instead of their instances. With that in place, there is no more need to limit serialization of objects dictionaries to only their `obj_dict` attribute. Therefore, this commit removes the custom `__getstate__` and `__setstate__` methods. This makes pickle return to its normal behavior of serializing the entire dictionary, which means that an unpickled object will now have all the same attributes as its original. This resolves errors like this from unpickled objects: AttributeError: 'Dot' object has no attribute 'prog' AttributeError: 'Dot' object has no attribute 'shape_files' Together with the last few commits, issues like pydot#127, pydot#216 and pydot#217 should now be fixed for pickle and similar tools like dill. Background: Git history shows that in the Initial import of 2007 (pydot 0.9.10, commit c0233c6), `__getstate__` and `__setstate__` selected the entire instance's dictionary and from that deleted the convenience methods. In 2010 (between pydot 1.0.2 and 1.0.4, commit b125c5d) this was changed to selecting only the `obj_dict` attribute of the instance's dictionary, which did not include the convenience methods to start with. That change was obviously related to issue pydot#16 that was closed a few days before it. It seems that the idea was to move towards keeping all instance-related data under a single `obj_dict` attribute and not have any other important data in attributes directly on the instance. However, currently, a newly created `Dot` instance contains, besides the `obj_dict`, the attributes `prog` and `shape_files`. Both are set during instantiation (`Dot.__init__()`), and as such will normally not be recreated on unpickling [[1]]. Still, they seem important enough that they should actually get pickled, which is what this commit accomplishes. The `prog` attribute is just a short string, and `shape_files` just a list of file paths, so it seems unlikely there could ever have been any real concerns about pickling them. Instances of `Graph`, `Subgraph`, `Cluster`, `Node` and `Edge` do not have dictionary entries other than `obj_dict`, at least not on instantiation. [1]: https://docs.python.org/3.9/library/pickle.html#pickling-class-instances Some examples sizes for a `Dot` object with 1 named node serialized by dill, an alternative to pickle: - Before the previous commit, when instances still had all those convenience methods bound to them, but `__getstate__` would only pickle `obj_dict`: 351 bytes. - Same, but then what happens when `__getstate__` would not exist or a `__reduce__` method is added that does not call `__getstate__` and the entire object dictionary gets pickled: 22986 bytes. That shows the weight of all those methods that get packed then. - With this commit, now that the previous commit removed all those methods and we can safely start pickling the entire dictionary again: 413 bytes. The extra 62 bytes compared to the first case reflect the object attributes `prog` and `shapefiles` that get pickled now as well. (Measured using `len(dill.dumps(g))`. Dill was used because pickle was not able to pickle pydot objects before. It can now. For the final result, I checked that pickle gave the same result as dill.) Further notes: - The ChangeLog will mention that pickled objects created with the previous code are incompatible with the new code and vice versa. This is not unheard of with pickle in general [[2]]. Class versioning and conversion [[3]] could solve this, but my guess is that most use cases have no need for this. Otherwise please report or provide a patch. [2]: https://nedbatchelder.com/blog/202006/pickles_nine_flaws.html#h_old_pickles_look_like_old_code [3]: https://docs.python.org/3.9/library/pickle.html#what-can-be-pickled-and-unpickled - Because the Python documentation lists `__getstate__` and `__setstate__` only in relation with pickle [[4]], I expect that this change will have not have any side effects on other uses of pydot. [4]: https://docs.python.org/3.9/genindex-_.html - This commit will cause Pylint 2.6.0 to emit 9 errors like: pydot.py:520:19: E1101: Instance of 'Common' has no 'obj_dict' member (no-member) Actually, this reflects an existing situation not caused by this commit. Pylint just did not report it as strongly before, because it saw `obj_dict` get assigned to from `__setstate__`. Before this commit, there was only this 1 warning: pydot.py:431:8: W0201: Attribute 'obj_dict' defined outside __init__ (attribute-defined-outside-init) I considered adding an `__init__()` method to class `Common` just to set `obj_dict` to an empty dictionary, but decided it would not be good to do that just for Pylint. None of the subclasses currently call `super().__init__()` anyway and the `Common` docstring clearly states that it should not be used directly. There seems to be room for unifying the `__init__()` methods of the various classes and some deduplication to `Common.__init__()`. That would bring a meaningful resolution to these Pylint errors. However, such changes deserve to be discussed separately. - Pickle-related tests were updated and extended. Notes: - The main pickle tests are now performed on a `Dot` instance instead of on a `Graph`, because `Dot` is the only class that has both the dynamically generated `create_*`/`write_*` and `get_*`/`set_*` methods. - Regarding the removal of the use of a tuple in the initialization of one of the edges: Support for that was removed in 2016 (pydot 1.2.4, commit de30c14) and it would lead to a `TypeError` now that we start requesting a string representation. If tuple support is ever restored again, it should have its own test anyway.
9cb6e0d
to
5bd4a38
Compare
Rebased on top of pydot 2.0.0.dev0 commit e48d90c, which is basically pydot 1.4.2 except for the version number change. |
Hi! Please work on merging this to master. Parallelisation is so important with Pydot! It cuts the time significantly. |
@sebastiandohnany Did you test this branch? Any remarks? |
I haven't done any proper testing, that is why I just wrote a message of encouragement – I can try to look at it more extensively later. But for all I know, everything I'm using pydot for (traversing graph, changing properties, etc.) works + usage with multiprocessing is smooth too. As I'm working on animations (uwaicl/epsilon-animate), this cuts my running times greatly, so thank you. |
With "this" you mean working with this branch, right? That is what I meant to ask. Did not run into any problems? |
@sebastiandohnany Reading you last reply again, it seems that you really tried out this branch, so thanks for letting me know your results! About the schedule for merging to master: Because we are now at the start of the pydot 2.0 development cycle, I first wanted to merge the stuff that messes up |
This pull request has conflicts, please resolve those so that the changes can be evaluated. |
This branch aims to provide full support for pickle serialization of pydot objects. It should fix issues #127, #216 and #217 and is aimed for pydot 2.0.0.
An overview of the commits (original commit hashes before rebasing between parentheses):
get_*
/set_*
/create_*
/write_*
) to the class instead of the instance. This prevents that pickle tries and fails to pickle them. It also brings performance improvements to pydot as a whole.obj_dict
attribute of an object dictionary. This will ensure that the entire object dictionary, includingprog
andshape_files
, gets pickled.See the commit messages for details and side notes.
Reviews are welcome, also of the commit messages. Neither the code, nor the commit messages are set in stone yet.
About the pylint errors mentioned in 4b75945 (57a8adc) and 5bd4a38 (9cb6e0d), I welcome opinions not only on if they deserve fixing or not, but also on the question if, when we explicitely decide to ignore a certain pylint error, people appreciate the use of pragmas (
# pylint: disable=
comments), or think that these just mess up the code. Of course, if an error can be fixed that is preferred, but the list is quite long at the moment and I don't know how many false positives will be left in the end. I know there are alternative linters, but then they question remains (e.g. do we want to see# noqa
pragmas for flake8, etc.).Other comments are also welcome of course. I won't merge this PR for another 2 weeks at least.
Each individual commit (before rebasing) passed the test suite on Python 3.7.3 and 2.7.16, Debian buster 10.6 amd64, PyParsing 2.4.7 and Graphviz 2.40.1 (20161225.0304).