SMEP D: Panel Grouped Data

Summary

This page collects thoughts on data management and processing for panel or grouped data in statsmodels.

3 general steps:

Index prep/check & conversion to DF
Minor transforms applied to DF. These transforms return np.arrays
Computation with np arrays

Index prep/check & conversion to DF

A typical user call:

panel_operation(array, unit, time)

where

array is a numpy array
unit is a list/np.array/pd.Series of len(array.shape[0])
time is a list/np.array/pd.Series of len(array.shape[0])

If a results instance is passed, we can check if the original data is a DF whose index we can use instead of building one from scratch.

Under the hood:

idx = make_pandas_index(unit, time)
sanity = check_index(idx) # e.g. duplicates, sorting, with useful error messages for user
df = pd.DataFrame(array, index=idx)

Notes:

I can't imagine a reasonable use-case where this operation has to be repeated many times, so overhead should be minimal.

Simple transforms are applied to the DF using groupby operations

Then we can apply one of the many simple group transforms that are collected in our library of transforms. For example:

Xl = group_lag(df)

Each of the simple transforms returns a numpy array. For other examples, see: https://gist.github.com/vincentarelbundock/5035397

Complex computations, most linalg operations, and expensive operations should not be applied here.

Notes:

These pandas groupby transforms are available only for convenience and to improve the readability of code. This step can be skipped in favor of what I describe in the next section.

Computation with np arrays

Group-by operations on numpy arrays are conducted using slice indexing. For example, we could build an Omega matrix for sandwich estimation with group-wise residual outerproducts on the block diagonal by:

slices = get_slices(idx)
OM = []
for s in slices:
    OM.append(np.outer(array[s], array[s]))
OM = scipy.linalg.block_diag(*OM)

Notes:

An example get_slices() function is included in the gist I linked to above.

Comments

Benefits

Single user interface for dealing with hierarchical group indices
Not reinventing the wheel in terms of group management (a.k.a. Wes has thought about this more than we have)

Concerns (and responses)

Future proofing (what if pandas breaks backward compatibility for groupbys and indices)
- Discipline: Keep pandas transforms very simple. Complex operations should be done using numpy slices to reduce dependency
- See more below
Expensive overhead:
- Pandas transforms can be slower than pure numpy ones
- Transforming arrays to DF and back to arrays imposes a small cost

What if pandas breaks our functions?

Write a get_slices() that works on numpy arrays instead of pandas indices
Write a group_lag() that works on sliced numpy arrays
If both of these functions preserve the call structure, nothing else needs to be changed.

Pages

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMEP D: Panel Grouped Data

Summary

Index prep/check & conversion to DF

Simple transforms are applied to the DF using groupby operations

Computation with np arrays

Comments

Benefits

Concerns (and responses)

What if pandas breaks our functions?

Clone this wiki locally