Skip to content

Opt-in speedy support for grouped pandas

Compare
Choose a tag to compare
@machow machow released this 29 Oct 00:30
· 248 commits to master since this release
ca35930

Features

  • Implementation of fast mutate, filter, and summarize using CallTreeLocal (#134). For even just a couple thousand groups, the fast methods are close to optimal hand-written pandas, and the slow versions are almost 1000x slower :o.
  • fixed current grouped pandas mutate to preserve row order (#139)
  • laid down tests of all supported series methods, currently skipping SQL backends (but ready to go!)
  • put up some very basic documentation (#145)
  • wrote an ADR on the rational for fast groupby (#135)

Note that CallTreeLocal has new options, allowing it to look up based on chained attributes (e.g. look for an entry named "dt.year", and override custom function calls.).

I still need to finish support for user defined operations and some light siu refactoring.

Breaking changes

  • Removed the rm_attr argument from CallTreeLocal, since converting subattrs like dt.year will consume dt anyway (can't imagine a situation where we'd want to keep it, and couldn't do that in the translator function)

Demo

from siuba.experimental.pd_groups import fast_mutate, fast_filter, fast_summarize
from siuba import *
from siuba.data import mtcars

g_cars = mtcars.groupby(['cyl', 'gear'])

fast_mutate(g_cars, _.hp - _.hp.mean())