Skip to content

SMEP D: Panel Grouped Data

josef-pkt edited this page Apr 28, 2013 · 1 revision

Summary

This page collects thoughts on data management and processing for panel or grouped data in statsmodels.

3 general steps:

  1. Index prep/check & conversion to DF
  2. Minor transforms applied to DF. These transforms return np.arrays
  3. Computation with np arrays

Index prep/check & conversion to DF

A typical user call:

panel_operation(array, unit, time)

where

  • array is a numpy array
  • unit is a list/np.array/pd.Series of len(array.shape[0])
  • time is a list/np.array/pd.Series of len(array.shape[0])

If a results instance is passed, we can check if the original data is a DF whose index we can use instead of building one from scratch.

Under the hood:

idx = make_pandas_index(unit, time)
sanity = check_index(idx) # e.g. duplicates, sorting, with useful error messages for user
df = pd.DataFrame(array, index=idx)

Notes:

  • I can't imagine a reasonable use-case where this operation has to be repeated many times, so overhead should be minimal.

Simple transforms are applied to the DF using groupby operations

Then we can apply one of the many simple group transforms that are collected in our library of transforms. For example:

Xl = group_lag(df)

Each of the simple transforms returns a numpy array. For other examples, see: https://gist.github.com/vincentarelbundock/5035397

Complex computations, most linalg operations, and expensive operations should not be applied here.

Notes:

  • These pandas groupby transforms are available only for convenience and to improve the readability of code. This step can be skipped in favor of what I describe in the next section.

Computation with np arrays

Group-by operations on numpy arrays are conducted using slice indexing. For example, we could build an Omega matrix for sandwich estimation with group-wise residual outerproducts on the block diagonal by:

slices = get_slices(idx)
OM = []
for s in slices:
    OM.append(np.outer(array[s], array[s]))
OM = scipy.linalg.block_diag(*OM)

Notes:

  • An example get_slices() function is included in the gist I linked to above.

Comments

Benefits

  • Single user interface for dealing with hierarchical group indices
  • Not reinventing the wheel in terms of group management (a.k.a. Wes has thought about this more than we have)

Concerns (and responses)

  • Future proofing (what if pandas breaks backward compatibility for groupbys and indices)
    • Discipline: Keep pandas transforms very simple. Complex operations should be done using numpy slices to reduce dependency
    • See more below
  • Expensive overhead:
    • Pandas transforms can be slower than pure numpy ones
    • Transforming arrays to DF and back to arrays imposes a small cost

What if pandas breaks our functions?

  • Write a get_slices() that works on numpy arrays instead of pandas indices
  • Write a group_lag() that works on sliced numpy arrays
  • If both of these functions preserve the call structure, nothing else needs to be changed.
Clone this wiki locally