Skip to content

Pandas dataframe input

Jack Gerrits edited this page May 27, 2020 · 2 revisions

Pull request #2426 introduces a generic extensible framework for VW to understand structured Pandas dataframes.

1. Overview

The class DFToVW in vowpalwabbit.pyvw takes as input the pandas.DataFrame and special types (SimpleLabel, Feature, Namespace) that specify the desired VW conversion.

These classes make extensive use of a class Col that refers to a given column in the user specified dataframe.

A simpler interface DFtoVW.from_colnames also be used for the simple use-cases. The main benefit is that the user need not use the specific types.


Below are some usages of this class. They all rely on the following pandas.DataFrame called df :

  house_id  need_new_roof  price  sqft   age  year_built
0      id1              0   0.23  0.25  0.05        2006
1      id2              1   0.18  0.15  0.35        1976
2      id3              0   0.53  0.32  0.87        1924

2. Simple usage using DFtoVW.from_colnames

Let say we want to build a VW dataset with the target need_new_roof and the feature age :

from vowpalwabbit.pyvw import DFtoVW
conv = DFtoVW.from_colnames(y="need_new_roof", x=["age", "year_built"], df=df)

Then we can use the method process_df:

conv.process_df()

that outputs the following list:

['0 | 0.05 2006', '1 | 0.35 1976', '0 | 0.87 1924']

This list can then directly be consumed by the method pyvw.model.learn.

3. Advanced usages using default constructor

The class DFtoVW also allow the following patterns in its default constructor :

  • tag
  • (named) namespaces, with scaling factor
  • (named) features, with constant feature possible

To use these more complex patterns we need to import them using:

from vowpalwabbit.pyvw import SimpleLabel, Namespace, Feature, Col

3.1. Named namespace with scaling, and named feature

Let's create a VW dataset that include a named namespace (with scaling) and a named feature:

conv = DFtoVW(
        df=df,
        label=SimpleLabel(Col("need_new_roof")),
        namespaces=Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm"))
        )
conv.process_df()

which yields:

['0 |Imperial:0.092 sqm:0.25',
 '1 |Imperial:0.092 sqm:0.15',
 '0 |Imperial:0.092 sqm:0.32']

3.2. Multiple namespaces, multiple features, and tag

Let's create a more complex example with a tag and multiples namespaces with multiples features.

conv = DFtoVW(
        df=df, 
        label=SimpleLabel(Col("need_new_roof")),
        tag=Col("house_id"),
        namespaces=[
                Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm")),
                Namespace(name="DoubleIt", value=2, features=[Feature(value=Col("price")), Feature(Col("age"))])
                ]
        )
conv.process_df()

which yields:

['0 id1|Imperial:0.092 sqm:0.25 |DoubleIt:2 0.23 0.05',
 '1 id2|Imperial:0.092 sqm:0.15 |DoubleIt:2 0.18 0.35',
 '0 id3|Imperial:0.092 sqm:0.32 |DoubleIt:2 0.53 0.87']

4. Implementation details

  • The class DFtoVW and the specific types are located in vowpalwabbit/pyvw.py. The class only depends on the pandas module.
  • the code includes docstrings
  • 8 tests are included in tests/test_pyvw.py

5. Extensions

  • This framework does not yet handle multilines and more complex label types.
  • To convert very large dataset that can't fit in RAM, one can make use of the pandas import option chunksize and process each chunk at a time. This could be implemented functionnality directly in the class using generator. The generator would then be consumed by either a VW learning interface or could be written to external file (for conversion purpose only).
Clone this wiki locally