Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore additional properties from input when deserializing to TypedDict #565

Open
ericbn opened this issue Mar 21, 2023 · 8 comments
Open
Labels
enhancement New feature or request

Comments

@ericbn
Copy link

ericbn commented Mar 21, 2023

Hi 👋

I have the intention to use apischema to map between an input dict and an output dict using a TypedDict that defines the output format and aliases and conversions that need to happen from the input. Basically mapping between what I receive from one API to what I should send to another API. Not sure if that's one use case that fits apischema's purpose.

To achieve this, I'd like to be able to ignore additional properties from the input. I see this works as I expected for dataclasses, but not for TypeDicts:

from dataclasses import dataclass
from typing import TypedDict

from apischema import deserialize


@dataclass
class PersonDataclass:
    first_name: str


class PersonTypedDict(TypedDict):
    first_name: str


data = {'first_name': 'John', 'last_name': 'Smith'}
deserialize(PersonDataclass, data, additional_properties=True)  # PersonDataclass(first_name='John')
deserialize(PersonTypedDict, data, additional_properties=True)  # {'first_name': 'John', 'last_name': 'Smith'}

I expected the output of the second deserialize to be {'first_name': 'John'} instead.

@wyfo wyfo added the enhancement New feature or request label Mar 22, 2023
@wyfo
Copy link
Owner

wyfo commented Mar 23, 2023

Not sure if that's one use case that fits apischema's purpose.

Hi, the purpose of apischema is to be the best possible library 😄

Actually, this behavior of TypedDict was implemented on purpose, because {'first_name': 'John', 'last_name': 'Smith'} can in fact be accepted as a PersonTypedDict. However, with hindsight, I think now this is not very intuitive, and should be changed. As written in PEP 589 specifications: "Extra keys included in TypedDict object construction should also be caught."

Because, the current behavior can still be useful, and in order to mitigate the breaking change, I think I will add a setting to toggle it, something like typed_dict_include_extra_keys.

Waiting for the release, here is a quick workaround:

{
    k: v 
    for k, v in deserialize(PersonTypedDict, data, additional_properties=True).items() 
    if k in PersonTypedDict.__annotations__
}

@ericbn
Copy link
Author

ericbn commented Mar 23, 2023

Thanks for considering this use case.

PEP 589 actually says additional keys in the TypedDict should raise an error by type checkers.

For example, d['x'] = 1 should generate a type check error if 'x' is not a valid key for d (which is assumed to be a TypedDict type).

Extra keys included in TypedDict object construction should also be caught. In this example, the director key is not defined in Movie and is expected to generate an error from a type checker:

m: Movie = dict(
    name='Alien',
    year=1979,
    director='Ridley Scott')  # error: Unexpected key 'director'

The sentence you mentioned says i.e. additional keys in the constructor, as well as in assignments as seen in the previous paragraph, should generate an error from a type checker.

I know it's going to be a breaking change if you change the current behavior of apischema, but just wanted to give you a heads up.

@ericbn
Copy link
Author

ericbn commented Mar 23, 2023

Also, since I'm already taking about breaking changes ☺️, I wonder if it would not be cleared to have additional_properties take different values like "fail", "include" and" ignore" (or "exclude") instead of just True and False... Unless you just want it to either "fail or "ignore", in which case something like ignore_additional_properties would maybe be cleared (since True means to "ignore" instead of ”fail", and since "fail" is the default behavior).

@wyfo wyfo closed this as completed Mar 23, 2023
@wyfo wyfo reopened this Mar 23, 2023
@wyfo
Copy link
Owner

wyfo commented Mar 23, 2023

I've no issue with breaking change, as long as it's well justified (it is the case here), and its impact is low enough; I don't know how many people rely on this behavior, but it should not be a lot, and the global setting makes it easy to upgrade.

Actually, additional_properties deserialization setting doesn't concern only TypedDict, but also dataclasses, NamedTuple, and in fact every object types, which can't include extra keys like TypedDict can. So I don't think additional_properties="include" could be a thing.

But you are right, this setting name is quite ambiguous and should be refined. I prefer allow_additional_properties to ignore_additional_properties though (JSON schema uses the term allow), but I also wonder if the JSON term additional_properties should be kept, or for example use extra like Pydantic, or unknown like Serde. Because apischema has the concept of properties fields, which is also very JSON-oriented, I should maybe keep additional_properties naming, so I would got for allow_additional_properties and typed_dict_allow_extra_keys

By the way, additional_properties is also a serialization setting, and its meaning is completely different, as it's also related to TypedDict extra keys. So I should rename this setting to typed_dict_allow_extra_keys too.

Thank you for bringing this discussion to the table, things seems way clearer now.

(Yes, closing the issue was a misclick)

@ericbn
Copy link
Author

ericbn commented Mar 23, 2023

Cool! I see JSON schema uses the term allow for additional properties that can be included beyond what's declared in the schema, which is something not possible for dataclasses for example. This is the behavior I was calling "include" above.

The way I see it:

dataclass NamedTuple TypedDict
Allows additional properties? ❌ Fails with TypeError: __init__() got an unexpected keyword argument '...' ❌ Fails with TypeError: <lambda>() got an unexpected keyword argument '...' Does not fail at run time, although PEP 589 states that type checkers should fail here too.

And what's the difference between allow_additional_properties and typed_dict_allow_extra_keys? Does it make sense to set both in any scenario?

I see how additional_properties makes sense and has a different meaning in serialize. Maybe reusing the same name in deserialize for a different meaning was the cause of the ambiguity here?

@ericbn
Copy link
Author

ericbn commented Mar 23, 2023

I see how additional_properties differs between serialize and deserialize :

serialize deserialize
additional_properties=False Ignore any additional properties from any type. Fail if there are any additional properties to any type.
additional_properties=True Output additional properties from TypedDict*. Ignore if there are additional properties from any other type. Output additional properties to TypedDict*. Ignore if there are additional properties to any other type.

additional_properties=False is where it's ambiguous. My initial request is to be able allow deserialize to "Ignore any additional properties to any type."

EDIT: Maybe change deserialize to ignore when additional_properties=False, same as serialize?

* According to PEP 589, these should not be allowed anyway.

@wyfo
Copy link
Owner

wyfo commented Mar 23, 2023

  • According to PEP 589, these should not be allowed anyway.

That's not exactly true, because of TypedDict subtyping, hence the old decision about the current behavior.

And what's the difference between allow_additional_properties and typed_dict_allow_extra_keys? Does it make sense to set both in any scenario?

deserialization serialization
allow_additional_properties=False
typed_dict_include_extra_keys=False
fail if there is any additional properties for any object type (allow_additional_properties is a deserialization parameter)
allow_additional_properties=True
typed_dict_include_extra_keys=False
allow (i.e. ignore) additional properties for any object type (allow_additional_properties is a deserialization parameter)
allow_additional_properties=False
typed_dict_include_extra_keys=True
include extra keys for TypedDict, fail for any other object type serialize typed dict extra keys
allow_additional_properties=True
typed_dict_include_extra_keys=True
include extra keys for any object type, allow additional properties for any other object type serialize typed dict extra keys

I have renamed typed_dict_allow_extra_keys into typed_dict_include_extra_keys, as I've realized by writing the table that it's less ambiguous.

Cool! I see JSON schema uses the term allow for additional properties that can be included beyond what's declared in the schema, which is something not possible for dataclasses for example.

Object types like dataclasses can have properties field.

Does the table above seem ok to you?

@ericbn
Copy link
Author

ericbn commented Mar 23, 2023

So allow_additional_properties is meant to be used as described in Additional properties / pattern properties only for deserialization and typed_dict_include_extra_keys is separately meant to be used for both serialization/deserialization of TypedDict extra keys. I'm still getting familiar with the apischema usage and possible use cases, but it makes sense to me. 😄

Thanks a lot for thinking this through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants