Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn about mutable objects in Request.cb_kwargs when JOBDIR is set #6120

Open
Prometheus3375 opened this issue Oct 24, 2023 · 1 comment
Open

Comments

@Prometheus3375
Copy link

Prometheus3375 commented Oct 24, 2023

Summary

Any serializable object inside Request.cb_kwargs when JOBDIR is set, is not passed to request's callable. Instead, its deep copy is passed. This is the result of the serialization and deserialization process. This behavior should be noted in the documentation.

Suggestion

As I suggested in #6119, the next note (or its variation) can added to jobs documentation:

(after the paragraph about debug setting)
Keep in mind, that deserialization of requests creates deep copies of the objects passed to
dictionary cb_kwargs. Meaning, that any mutable object inside cb_kwargs (directly or as a part
of other object) does not make its way to the callback; the callback receives its exact copy. Any
changes made to the copy does not affect the original object.

Additionally, note can be added to cb_kwargs as well:

A dict with arbitrary data that will be passed as keyword arguments to the Request’s callback.
Caution
One should not pass mutable objects to this dictionary if setting JOBDIR is set. Common mutable objects are lists, dicts, sets and bytearrays. Check pausing and resuming crawls documentation for more information.

Motivation

Making deep copies of cb_kwargs content when JOBDIR is set, is not obvious at a glance. Enabling JOBDIR is easy, and one can set it without realization that previously working code will stop working.

Additional context

In #6119 I also suggested that when JOBDIR is set, a warning can be emitted if there is a common mutable object (such as lists, dicts, sets, bytearrays and array from Python array module) inside cb_kwargs values. However, now I think this is unnecessary and can lead to redundant warnings. Especially in cases when mutable object is passed, but there is no intent in mutating it inside request's callback; for example, dictionaries does not have immutable analogue and types.MappingProxyType is not pickleable.

@Prometheus3375
Copy link
Author

In addition, I would be good to note, that to any custom object which cannot or is not meant to be serializable one should add pickle-specific method that immediately throws an error an pickle attempt.
For example:

class MyClass:
    def __init__(self, /):
        self._a = set()  # mutable object
        self._b = (10 ** i for i in range(10))  # unserializable object

    def __reduce__(self, /):
        raise TypeError(f'instances of {self.__class__} are unpickleable')

W/o __reduce__ pickle will attempt to pickle MyClass instances: it will start with _a and then fail and _b. If _a is increasing over time, then every attempt will take more time. Over time, the slow down will be noticeable, as I described in #6119.

Defining __reduce__ or any other pickle-specific method which is called upon serialization, with error raise forces pickle to fail immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants