Ignore additional properties from input when deserializing to TypedDict #565

ericbn · 2023-03-21T20:16:11Z

Hi 👋

I have the intention to use apischema to map between an input dict and an output dict using a TypedDict that defines the output format and aliases and conversions that need to happen from the input. Basically mapping between what I receive from one API to what I should send to another API. Not sure if that's one use case that fits apischema's purpose.

To achieve this, I'd like to be able to ignore additional properties from the input. I see this works as I expected for dataclasses, but not for TypeDicts:

from dataclasses import dataclass
from typing import TypedDict

from apischema import deserialize


@dataclass
class PersonDataclass:
    first_name: str


class PersonTypedDict(TypedDict):
    first_name: str


data = {'first_name': 'John', 'last_name': 'Smith'}
deserialize(PersonDataclass, data, additional_properties=True)  # PersonDataclass(first_name='John')
deserialize(PersonTypedDict, data, additional_properties=True)  # {'first_name': 'John', 'last_name': 'Smith'}

I expected the output of the second deserialize to be {'first_name': 'John'} instead.

The text was updated successfully, but these errors were encountered:

wyfo · 2023-03-23T07:58:36Z

Not sure if that's one use case that fits apischema's purpose.

Hi, the purpose of apischema is to be the best possible library 😄

Actually, this behavior of TypedDict was implemented on purpose, because {'first_name': 'John', 'last_name': 'Smith'} can in fact be accepted as a PersonTypedDict. However, with hindsight, I think now this is not very intuitive, and should be changed. As written in PEP 589 specifications: "Extra keys included in TypedDict object construction should also be caught."

Because, the current behavior can still be useful, and in order to mitigate the breaking change, I think I will add a setting to toggle it, something like typed_dict_include_extra_keys.

Waiting for the release, here is a quick workaround:

{
    k: v 
    for k, v in deserialize(PersonTypedDict, data, additional_properties=True).items() 
    if k in PersonTypedDict.__annotations__
}

ericbn · 2023-03-23T12:33:05Z

Thanks for considering this use case.

PEP 589 actually says additional keys in the TypedDict should raise an error by type checkers.

For example, d['x'] = 1 should generate a type check error if 'x' is not a valid key for d (which is assumed to be a TypedDict type).

Extra keys included in TypedDict object construction should also be caught. In this example, the director key is not defined in Movie and is expected to generate an error from a type checker:
m: Movie = dict(
    name='Alien',
    year=1979,
    director='Ridley Scott')  # error: Unexpected key 'director'

The sentence you mentioned says i.e. additional keys in the constructor, as well as in assignments as seen in the previous paragraph, should generate an error from a type checker.

I know it's going to be a breaking change if you change the current behavior of apischema, but just wanted to give you a heads up.

ericbn · 2023-03-23T12:39:50Z

Also, since I'm already taking about breaking changes ☺️, I wonder if it would not be cleared to have additional_properties take different values like "fail", "include" and" ignore" (or "exclude") instead of just True and False... Unless you just want it to either "fail or "ignore", in which case something like ignore_additional_properties would maybe be cleared (since True means to "ignore" instead of ”fail", and since "fail" is the default behavior).

wyfo · 2023-03-23T13:44:15Z

I've no issue with breaking change, as long as it's well justified (it is the case here), and its impact is low enough; I don't know how many people rely on this behavior, but it should not be a lot, and the global setting makes it easy to upgrade.

Actually, additional_properties deserialization setting doesn't concern only TypedDict, but also dataclasses, NamedTuple, and in fact every object types, which can't include extra keys like TypedDict can. So I don't think additional_properties="include" could be a thing.

But you are right, this setting name is quite ambiguous and should be refined. I prefer allow_additional_properties to ignore_additional_properties though (JSON schema uses the term allow), but I also wonder if the JSON term additional_properties should be kept, or for example use extra like Pydantic, or unknown like Serde. Because apischema has the concept of properties fields, which is also very JSON-oriented, I should maybe keep additional_properties naming, so I would got for allow_additional_properties and typed_dict_allow_extra_keys

By the way, additional_properties is also a serialization setting, and its meaning is completely different, as it's also related to TypedDict extra keys. So I should rename this setting to typed_dict_allow_extra_keys too.

Thank you for bringing this discussion to the table, things seems way clearer now.

(Yes, closing the issue was a misclick)

ericbn · 2023-03-23T14:34:53Z

Cool! I see JSON schema uses the term allow for additional properties that can be included beyond what's declared in the schema, which is something not possible for dataclasses for example. This is the behavior I was calling "include" above.

The way I see it:

	dataclass	NamedTuple	TypedDict
Allows additional properties?	❌ Fails with `TypeError: __init__() got an unexpected keyword argument '...'`	❌ Fails with `TypeError: <lambda>() got an unexpected keyword argument '...'`	Does not fail at run time, although PEP 589 states that type checkers should fail here too.

And what's the difference between allow_additional_properties and typed_dict_allow_extra_keys? Does it make sense to set both in any scenario?

I see how additional_properties makes sense and has a different meaning in serialize. Maybe reusing the same name in deserialize for a different meaning was the cause of the ambiguity here?

ericbn · 2023-03-23T15:39:47Z

I see how additional_properties differs between serialize and deserialize :

	`serialize`	`deserialize`
`additional_properties=False`	Ignore any additional properties from any type.	Fail if there are any additional properties to any type.
`additional_properties=True`	Output additional properties from TypedDict^*. Ignore if there are additional properties from any other type.	Output additional properties to TypedDict^*. Ignore if there are additional properties to any other type.

additional_properties=False is where it's ambiguous. My initial request is to be able allow deserialize to "Ignore any additional properties to any type."

EDIT: Maybe change deserialize to ignore when additional_properties=False, same as serialize?

^{* According to PEP 589, these should not be allowed anyway.}

wyfo · 2023-03-23T16:40:19Z

According to PEP 589, these should not be allowed anyway.

That's not exactly true, because of TypedDict subtyping, hence the old decision about the current behavior.

And what's the difference between allow_additional_properties and typed_dict_allow_extra_keys? Does it make sense to set both in any scenario?

	deserialization	serialization
`allow_additional_properties=False` `typed_dict_include_extra_keys=False`	fail if there is any additional properties for any object type	(allow_additional_properties is a deserialization parameter)
`allow_additional_properties=True` `typed_dict_include_extra_keys=False`	allow (i.e. ignore) additional properties for any object type	(allow_additional_properties is a deserialization parameter)
`allow_additional_properties=False` `typed_dict_include_extra_keys=True`	include extra keys for `TypedDict`, fail for any other object type	serialize typed dict extra keys
`allow_additional_properties=True` `typed_dict_include_extra_keys=True`	include extra keys for any object type, allow additional properties for any other object type	serialize typed dict extra keys

I have renamed typed_dict_allow_extra_keys into typed_dict_include_extra_keys, as I've realized by writing the table that it's less ambiguous.

Cool! I see JSON schema uses the term allow for additional properties that can be included beyond what's declared in the schema, which is something not possible for dataclasses for example.

Object types like dataclasses can have properties field.

Does the table above seem ok to you?

ericbn · 2023-03-23T20:59:02Z

So allow_additional_properties is meant to be used as described in Additional properties / pattern properties only for deserialization and typed_dict_include_extra_keys is separately meant to be used for both serialization/deserialization of TypedDict extra keys. I'm still getting familiar with the apischema usage and possible use cases, but it makes sense to me. 😄

Thanks a lot for thinking this through.

wyfo added the enhancement New feature or request label Mar 22, 2023

wyfo closed this as completed Mar 23, 2023

wyfo reopened this Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore additional properties from input when deserializing to TypedDict #565

Ignore additional properties from input when deserializing to TypedDict #565

ericbn commented Mar 21, 2023 •

edited

wyfo commented Mar 23, 2023

ericbn commented Mar 23, 2023 •

edited

ericbn commented Mar 23, 2023 •

edited

wyfo commented Mar 23, 2023

ericbn commented Mar 23, 2023

ericbn commented Mar 23, 2023 •

edited

wyfo commented Mar 23, 2023

ericbn commented Mar 23, 2023

Ignore additional properties from input when deserializing to TypedDict #565

Ignore additional properties from input when deserializing to TypedDict #565

Comments

ericbn commented Mar 21, 2023 • edited

wyfo commented Mar 23, 2023

ericbn commented Mar 23, 2023 • edited

ericbn commented Mar 23, 2023 • edited

wyfo commented Mar 23, 2023

ericbn commented Mar 23, 2023

ericbn commented Mar 23, 2023 • edited

wyfo commented Mar 23, 2023

ericbn commented Mar 23, 2023

ericbn commented Mar 21, 2023 •

edited

ericbn commented Mar 23, 2023 •

edited

ericbn commented Mar 23, 2023 •

edited

ericbn commented Mar 23, 2023 •

edited