adding static type checking with mypy #14468

jreback · 2016-10-21T18:56:32Z

http://blog.zulip.org/2016/10/13/static-types-in-python-oh-mypy/

might be interesting if someone is looking for a project :>

max-sixty · 2016-10-21T21:13:29Z

#12412

jorisvandenbossche · 2016-10-22T10:05:08Z

See also wesm/pandas2#18 (a advantage for targetting this for 2.0 is we can directly use the python 3 syntax)

brmc · 2017-03-12T18:01:03Z

continuing the conversation from wesm/pandas2#18

I agree with @shoyer. type hints should be implemented as early as possible. I've taken a couple days off work, so I can get started on annotations for py2 tomorrow. Do you have any preference on how large pull requests should be? if it were up to me, given the relative safety of this sort of modification, I would probably do a PR per sub-module. Let me know if you'd like me to do it differently. After that i can move on to stubs

jreback · 2017-03-12T18:04:51Z

@brmc sounds great!

I think main things to add are support for Series, DataFrame, Index methods (Timestamp, Timedelta, Period would be nice to).

you are wanting to add .pyi files where the module live? (.e.g. pandas/core/)

will need to add a build with mypy as well for validation.

are there other projects that do this (just for seeing examples), if so how are they adding it?

jreback · 2017-03-12T18:07:28Z

A concern of mine is this:

essentially you are taking a file, say pandas/core/series.py and stripping out the impl, leaving the signatures. Then we validate that this works with mypy. I really really want to be sure this stays in sync with the actual signature / impl files. So we need tests that compares these directly (in py3 this is actually not that hard, you can iterate over the definitions and simply compare the signatures). I think if we only compare this in py3 would be ok actually as our definitions should be py2/3 compat anyways. (or do we want / need differing signatures, e.g. a 2 and 3 impl)?

Have others done this?

TomAugspurger · 2017-03-12T19:36:29Z

We'll want to push these upstream to https://github.com/python/typeshed/ once we have reasonable coverage.

TomAugspurger · 2017-03-12T19:41:42Z

Oh, and I hadn't heard of this, but https://github.com/python/mypy/blob/master/mypy/stubgen.py may be a good starting place? I guess it operates one module at a time so

mkdir out
stubgen pandas.core.series
stubgen pandas.core.frame
stubgen pandas.core.index

brmc · 2017-03-13T07:02:23Z

@jreback, to be honest, i've never used stubs as a core part of a project I was working on. I've only used them to supplement underlying dependencies or to workaround bugs in my IDE, so i've never thought too deeply about what consequences they might have

brmc · 2017-03-13T08:04:27Z

ok, I'm about to get started. To be clear, I'm going to only concentrate on method signatures and return types (for now). I'm not going to annotate local variables unless I have some justifiable reason.

Regarding the relevant parts of the contributing docs:

style: for short signatures i'll do the annotations on a single line, for longer ones, i'll annotate each parameter individually
committing: I should use the "CLN" prefix since this issue has the clean label, correct?
docs: Does this change warrant an entry in whatsnew/v0.20.0.txt? if so, where? under Other Enhancements?

I've never contributed to pandas so please let me know if there's anything I'm overlooking

jreback · 2017-03-13T12:27:36Z

contributing docs are: http://pandas-docs.github.io/pandas-docs-travis/contributing.html

style: for short signatures i'll do the annotations on a single line, for longer ones, i'll annotate each parameter individually

sounds good

committing: I should use the "CLN" prefix since this issue has the clean label, correct?

that's fine

docs: Does this change warrant an entry in whatsnew/v0.20.0.txt? if so, where? under Other Enhancements?

yes

brmc · 2017-03-13T15:19:43Z

contributing docs are: http://pandas-docs.github.io/pandas-docs-travis/contributing.html

right. I had already read through it. just double-checking I didn't miss anything major

brmc · 2017-03-25T08:01:17Z

Just to update, I'm still working on this, slowly but surely. I'm pretty much restricted to weekend work, but it'll get there. it's not nearly as trivial of a task as I expected :)

jorisvandenbossche · 2017-03-25T10:46:27Z

@brmc No problem, great to hear you make some progress.
I would just advise to make as quickly as possible a PR (just clearly indicate, eg in the title, that it is work in progress, that is no problem), so you can get early feedback on the direction you are going.

teh · 2017-05-08T19:49:02Z

@brmc - I was going to start on mypy types. Would you mind sharing what you have? I can start from scratch but that seems a bit silly if you already spent time on it :)

TomAugspurger · 2017-05-08T19:56:26Z

@teh that'd be great. The work is started here

I'm going to rebase and push an update to that and then probably merge soon after. We need to get the basic testing infrastructure in place, and then start iterating a handful of files at a time.

teh · 2017-05-08T20:32:31Z

Thanks! Type checking not exactly fast, even with --incremental (30 seconds per run on my machine) 😓

Do you have a preference for inline types over pyi stub files?

TomAugspurger · 2017-05-08T20:36:32Z

Thanks! Type checking not exactly fast, even with --incremental (30 seconds per run on my machine) 😓

I think part of the slowness is from mypy following imports? The commit I just pushed disabled that in setup.cfg. Doing the whole codebase will still take a while though.

Do you have a preference for inline types over pyi stub files?

Inline.

teh · 2017-05-08T21:10:06Z

I'm already running into some interesting problems. E.g.

python/mypy#1996

means that we'd have to add a values = None to the class body of IndexOpsMixin which is a semantic change. The mixin style also doesn't play nicely with is_unique in the same class. There we have:

self.nunique() == len(self)

but we don't know yet at this point that self is a typing.Sized. We could make IndexOpsMixin subclass some ABC thingy but that'd also be a semantic change.

An alternative could be that we ignore the pandas bowels for now and only focus on the user interface (e.g. DataFrame) which is useful for end-users but less so for pandas devs.

Overall already far less trivial than I expected :/

TomAugspurger · 2017-05-08T21:14:42Z

I'm already running into some interesting problems. E.g.

For those types of problems, I'm defining an ABCMixins, like ABCIndexOpsMixin

An alternative could be that we ignore the pandas bowels for now and only focus on the user interface (e.g. DataFrame) which is useful for end-users but less so for pandas devs.

I think you might be right. Do you want to spend a little time exploring this?

jreback · 2017-05-08T22:18:09Z

so would focus solely on the user focused API

nimish · 2017-10-09T10:14:40Z

Is anyone working on this? I have been hacking it by using my own bespoke types but I'd rather use official ones

TomAugspurger · 2017-10-09T10:57:50Z

There's a (stalled) PR at #15866. You might want to look at #15866 (comment)

Please feel free to take it up!

TomAugspurger · 2017-11-20T19:54:49Z

Maybe useful for auto-generating some annotations:

jreback · 2017-11-25T14:01:38Z

nice write up by @shoyer for numpy
https://docs.google.com/document/d/1vpMse4c6DrWH5rq2tQSx3qwP_m_0lyn-Ij4WHqQqRHY/mobilebasic

we have similar but in some sense more complicated issues as we would
care about a dtype per column specification

teh · 2017-11-25T17:24:36Z

I'm currently using company time to develop separate mypy stubs (.pyi files as opposed to inline annotations). I have covered the easy 30% (e.g. .copy()) but indexing is an ongoing, unresolved struggle. We're going to open source if we ever get to a usable state.

I'm not convinced that it'll be possible to capture the full pandas behaviour in mypy-annotations in a meaningful way - meaningful as in don't use Any everywhere.

With that out of the way:

The numpy typing document is really interesting! They seem to be suggesting a dependent typing solution (without calling it that), at least for dimensions.
This mypy ticket could be useful because we have lot of string-typed arguments that are really ADTs (or enums). E.g. kind : {‘quicksort’, ‘mergesort’, ‘heapsort’} in sort. I'm quite unhappy that those are str in my stubs.
Inferring the type of a column in pandas in general seems complex to impossible. E.g. unstack-ing an int64 dtype that introduces NaNs (missing data) will convert the int64 to float64. We can't really prove anything other than the return value being DataFrame. Even with mypy plugins this kind of behaviour will be hard to prove.
Protocols seem like a potential solution for supporting IndexOpsMixin - even if I still don't know yet how to implement indexing itself.

Apologies for the rather long comment!

shoyer · 2017-11-25T20:43:16Z

See also python/typing#478 (comment) for more on use cases for literal values in types for pandas. I agree that this would be very helpful.

For pandas and DataFrame libraries more generally, we could really use a generic version of mypy's TypedDict to indicate dtypes per column. Something like:

class UserDataFrame(pandas.TypedDataFrame):
    name = str
    id = int
    address = str

For indexing, it might be easier to start with .loc and .iloc. The full behavior for getitem/setitem is so complex that even pandas developers don't really understand it (#9595), and I doubt it could ever be fully type checked.

The numpy typing document is really interesting! They seem to be suggesting a dependent typing solution (without calling it that), at least for dimensions.

Thanks! This is probably because my knowledge of formal type systems is pretty limited. If you have any other useful pointers, they would be appreciated!

gwerbin · 2018-04-30T20:24:01Z

@shoyer that's exactly what I'd want out of a Pandas type system. Don't forget that Pandas is a data science tool, and that kind of typing would be a huge benefit to data science work. The rest is "extra" IMO.

shoyer · 2018-04-30T20:40:21Z

@gwerbin sorry, can you clarify exact what I wrote that you'd want out of a pandas type system? :)

gwerbin · 2018-04-30T20:51:25Z

@shoyer the ability to do this:

import pandas
from pandas.api.types import CategoricalDtype
from pathlib import Path
from py._path.local import LocalPath
from typing import Union, Text
from typing_extensions import Protocol

class SupportsRead(Protocol):
    def read(self) -> Union[Text, bytes]:
       ...

RegionDtype = CategoricalDtype('RegionDtype', ['north', 'south', 'east', 'west'])

class MyDataFrameType(pandas.TypedDataFrame):
    name = str
    id = int
    address = str
    region = RegionDtype

PandasReadable = Union[Text, Path, LocalPath, SupportsRead]
# or better yet, having a PandasReadable class accessible in the Pandas hierarchy somewhere

def load_my_data(filepath_or_buffer: PandasReadable=None) -> MyDataFrameType:
    dtypes = OrderedDict([
        ('name', str),
        ('id', int),
        ('address', str),
        ('region', RegionDtype)
    ])
    # or ideally something more elgant like dtypes = MyDataFrameType.dtypes

    return pandas.read_csv(filepath_or_buffer, usecols=list(dtypes.keys()), dtypes=dtypes)

westurner · 2018-08-07T02:28:49Z

PEP-484 Type Annotations tools:

stubgen.py
Dropbox/pyannotate
https://github.com/dropbox/pyannotate
Instagram/MonkeyType
https://github.com/Instagram/MonkeyType

EDIT: https://mypy.readthedocs.io/en/latest/existing_code.html

mfcabrera · 2018-08-16T13:24:30Z

@teh do you have your stubs somewhere? and do you plan to continue this route?. I would interested in helping.

nicoabie · 2018-11-28T17:48:51Z

Is there any way people can contribute to move this forward? I'm planning on start type checking a project and I'd definitely like having support for pandas types as well.
Currently considering mypy and pyre-check

TomAugspurger · 2018-11-28T17:51:03Z

At this point I suspect we'll wait until we drop Python 2 (end of year) so that we can use python3-style type annotations.

…

On Wed, Nov 28, 2018 at 11:48 AM Nicolás Andrés Gallinal < ***@***.***> wrote: Is there any way people can contribute to move this forward? I'm planning on start type checking a project and I'd definitely like having support for pandas types as well. Currently considering mypy and pyre-check — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#14468 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIkFa3lXV3tVJjl6yaWamNByUoCL3ks5uzsyKgaJpZM4KdfO7> .

nicoabie · 2018-11-28T17:56:17Z

Thanks @TomAugspurger, I'd do the same. If help can be provided for the migration I'm sure the community will be there me included.

jorisvandenbossche · 2018-11-28T23:29:22Z

But in principle, some work could already be started using type comments?

TomAugspurger · 2018-11-28T23:39:54Z

Yep (and I have been in places, probably incorrectly). We'd just need to convert those to annotations, though mypy is probably capable of using a mixture? I think the main benefit is that the `typing` module will be available.

…

On Wed, Nov 28, 2018 at 5:29 PM Joris Van den Bossche < ***@***.***> wrote: But in principle, some work could already be started using type comments? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14468 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIs549UqHwcPQUzTnw0Uq8e0TsqLoks5uzxxXgaJpZM4KdfO7> .

josh-theorem · 2019-08-05T17:01:46Z

So, this is marked done, is it possible to have dataframes with specified column types now?

WillAyd · 2019-08-05T17:12:00Z

@josh-theorem if you are referring to static analysis then no, we don't currently support generic parametrization of Series / Index / DataFrame objects. Not opposed to it in the long run but we have a large part of the code base that just needs to be annotated first.

If you would like to help out there is #26766 which is a pre-cursor to what you are asking for. Could definitely use community support to push that along if you'd like to submit PRs

jreback added CI Continuous Integration Clean labels Oct 21, 2016

jreback added this to the Next Major Release milestone Oct 21, 2016

jorisvandenbossche mentioned this issue Mar 12, 2017

How does the community feel about type hinting? #15661

Closed

jreback mentioned this issue Mar 12, 2017

Type annotations wesm/pandas2#18

Open

jreback added the 2/3 Compat label Mar 12, 2017

shoyer mentioned this issue Oct 27, 2017

Check that literal strings/int/float belong to /is excluded from a set/range of values python/typing#478

Closed

WillAyd mentioned this issue May 11, 2018

MyPy cleanup and absolute imports in pandas.core.dtypes.common #21008

Merged

zooba mentioned this issue Aug 14, 2018

No completions returned from LS in following code using pandas microsoft/PTVS#4590

Closed

rth mentioned this issue Jan 13, 2019

Support type annotations (PEP 484) numpy/numpydoc#196

Open

WillAyd mentioned this issue Mar 23, 2019

MyPy CI Configuration #25844

Merged

1 task

jreback modified the milestones: Contributions Welcome, 0.25.0 Mar 24, 2019

jreback closed this as completed in #25844 Mar 26, 2019

WillAyd added this to To do in Type Hints via automation Mar 26, 2019

WillAyd moved this from To do to Done in Type Hints Mar 26, 2019

simonjayhawkins added the Typing type annotations, mypy/pyright type checking label Aug 5, 2019

stevedodson mentioned this issue Nov 14, 2019

Consider adding typehints into the codebase elastic/eland#19

Closed

tobiasraabe mentioned this issue Jul 1, 2020

More Checks on input data iza-institute-of-labor-economics/gettsim#149

Closed

4 tasks

jeffzi mentioned this issue Aug 9, 2020

pydantic/fastapi style validation unionai-oss/pandera#253

Closed

hoechenberger mentioned this issue Aug 29, 2020

Are type annotations worth having? mne-tools/mne-bids#533

Open

adding static type checking with mypy #14468

adding static type checking with mypy #14468

Comments

jreback commented Oct 21, 2016

max-sixty commented Oct 21, 2016

jorisvandenbossche commented Oct 22, 2016

brmc commented Mar 12, 2017

jreback commented Mar 12, 2017 • edited

jreback commented Mar 12, 2017 • edited

TomAugspurger commented Mar 12, 2017

TomAugspurger commented Mar 12, 2017

brmc commented Mar 13, 2017

brmc commented Mar 13, 2017

jreback commented Mar 13, 2017

brmc commented Mar 13, 2017 • edited

brmc commented Mar 25, 2017 • edited

jorisvandenbossche commented Mar 25, 2017

teh commented May 8, 2017

TomAugspurger commented May 8, 2017

teh commented May 8, 2017

TomAugspurger commented May 8, 2017

teh commented May 8, 2017

TomAugspurger commented May 8, 2017

jreback commented May 8, 2017

nimish commented Oct 9, 2017

TomAugspurger commented Oct 9, 2017

TomAugspurger commented Nov 20, 2017

jreback commented Nov 25, 2017

teh commented Nov 25, 2017

shoyer commented Nov 25, 2017

gwerbin commented Apr 30, 2018

shoyer commented Apr 30, 2018

gwerbin commented Apr 30, 2018 • edited

westurner commented Aug 7, 2018 • edited

mfcabrera commented Aug 16, 2018

nicoabie commented Nov 28, 2018

TomAugspurger commented Nov 28, 2018 via email

nicoabie commented Nov 28, 2018

jorisvandenbossche commented Nov 28, 2018

TomAugspurger commented Nov 28, 2018 via email

josh-theorem commented Aug 5, 2019

WillAyd commented Aug 5, 2019

jreback commented Mar 12, 2017 •

edited

jreback commented Mar 12, 2017 •

edited

brmc commented Mar 13, 2017 •

edited

brmc commented Mar 25, 2017 •

edited

gwerbin commented Apr 30, 2018 •

edited

westurner commented Aug 7, 2018 •

edited