Skip to content

anhquan0412/fastai-tabular-text-demo

Repository files navigation

Demo for fastai tabular + text databunch with end-to-end classification/regression training

UPDATED: There is a FASTER way to create/train an end-to-end tabular + text regression WITH BETTER LOSS using an entirely different approach (should work with both classification and regression task)

This approach uses existing tabular databunch and text databunch along with tabular model and text model (AWD_LSTM), thus shorter to implement. And all of these are already in Fastai library; this implementation will be less prone to error, and will be up-to-date. You can follow the discussion on fastai forum

Main notebook for this approach is in fastai_tabtext2.py and this notebook. Dataset is from Mercari Price Kaggle competition

If you still want to learn about Fastai Datablock API (it can be helpful if you want to create your own data API that is different from tabular or text), you can take a look at all the materials below.


Inspired by Wayde Gilliam for his detailed blog post on Fastai Datablock API.

  • Including several new TabularText classes (inherited from ItemLists, LabelLists and DataBunch) to handle both tabular data (continuous and categorical) and textual data (to be converted to numerical ids). All the preprocesses from tabular processor (FillMissing, Categorify, Normalize) and text processor (Tokenizer and Numericalize) are included.

  • Combine RNN model (e.g. AWD LSTM) with multi layer perceptron (MLP) head to train both text and tabular data. All good training optimization from fastai learner can be used: fit_one_cycle, learning rate schedule, callbacks, train different groups using freeze_to (differential learning rate)...

The code to build TabularText databunch has been tested using data from Kaggle PetFinder competition by comparing the output from Tabular Databunch and Text Databunch. Notebook for that can be found here

The code to build TabularText learner has been used to train on data from Mercari Price Kaggle competition. The model did train successfully with loss decreaseed after epochs, but the results are just average compared to the leaderboard of that competition (not sure why -> need more testing). Training notebook

The entire source code is in fastai_tab_text.py. You can create TabularText databuch by following the same Databunch API from fastai doc. You can also look at my training notebook for both databunch creation and training the learner.

  • Example of creating TabularText databunch from pandas Dataframe (using Mercari dataset)
cat_names=['category1','category2','category3','brand_name','shipping'] # categorical
cont_names= list(set(train_df.columns) - set(cat_names) - {'price','text'}) # continuous
dep_var = 'price' # label
procs = [FillMissing,Categorify, Normalize]
txt_cols=['text'] # text

def get_tabulartext_databunch(bs=100,val_idxs=val_idxs,path=mercari_path):
    data_lm = load_data(path, 'data_lm.pkl', bs=bs) # data_lm.pkl from mercari-language-model notebook
    collate_fn = partial(mixed_tabular_pad_collate, pad_idx=1, pad_first=True)
    reset_seed()
    return (TabularTextList.from_df(train_df, cat_names, cont_names, txt_cols, vocab=data_lm.vocab, procs=procs, path=path)
                            .split_by_idx(val_idxs)
                            .label_from_df(cols=dep_var)
                            .add_test(TabularTextList.from_df(test_df, cat_names, cont_names, txt_cols,path=path))
                            .databunch(bs=bs,collate_fn=collate_fn, no_check=False))

data = get_tabulartext_databunch(bs=100)
data.show_batch()
  • Example of creating TabularText learner and start one-cycle training (note: this is a regression problem)
encoder_name = 'bs60-awdlstm-enc-stage2' # encoder from mercari-language-model notebook
def get_tabulartext_learner(data,params):
    learn= tabtext_learner(data,AWD_LSTM,metrics=[root_mean_squared_error],
                               callback_fns=[partial(SaveModelCallback, monitor='root_mean_squared_error',mode='min',every='improvement',name='best_nn')],
                               **params)
    learn.load_encoder(encoder_name)
    return learn

params={
    'layers':[500,400,200], # neural network at model's head
    'ps': [0.001,0.,0.], # dropout for NN at model's head
    'bptt':70,
    'max_len':20*70,
    'drop_mult': 1., # drop_mult: multiply to different dropouts in AWD LSTM
    'lin_ftrs': [300], # linear layer to AWD_LSTM output, before combining to embeddings
    'ps_lin_ftrs': [0.], # dropout for this linear layer at AWD_LSTM output
    # set 'lin_ftrs': None if you want AWD LSTM output (1200) to be combined straight to embeddings
    'emb_drop': 0., # embeddings dropout
    'y_range': [0,6], # restrict y range for regression problem
    'use_bn': True,    
}


learn = get_tabulartext_learner(data,params,seed=42)
print(learn.model)

learn.fit_one_cycle(3,max_lr=1e-02,pct_start=0.3,moms=(0.8,0.7))

Requirement: fastai version 1.0.51 (including pytorch 1.0). Visit https://github.com/fastai/fastai#installation for more.

Any feedback is welcome! Follow the discussion on fastai forums: https://forums.fast.ai/t/build-mixed-databunch-and-train-end-to-end-model-for-tabular-categorical-continuous-data-and-text-data/

About

fastai tabular and text combination model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published