BertGAT-for-Spider-Dataset

Chang Shu, Ruitao Yi, Bo Lun

We use Spider dataset as our main dataset to fulfill the largescale and cross-domain semantic parsing text-to-SQL tasks. For this task, we propose BertGAT, which is a novel approach to the beforementioned task. To build this model, we implement Bidirectional Encoder Representations from Transformers (BERT) to pre-train deep bidirectional representations instead of the traditional Bidirectional recurrent neural networks. Fine-tuning is applied to the pre-trained BERT representations so that we can use just one extra output layer to create state-of-the-art models for wide-ranging text-to-SQL tasks. We use Syntax tree network to employ a treebased SQL generator, and use Graph Attention networks (GATs) to learn the features of syntax-tree.

Environment Setup

The code uses Python 3.6, DGL 0.5.0 and Pytorch 1.4.0 GPU.
Install Python dependency: pip install -r requirements.txt

Download Data, Embeddings, Scripts, and Pretrained Models

Download the dataset from the Spider task website to be updated, and put tables.json, train.json, and dev.json under data/ directory.
Download the pretrained Glove, and put it as glove/glove.%dB.%dd.txt
Download evaluation.py and process_sql.py from the Spider github page
Download preprocessed train/dev datasets and pretrained models from here. It contains: -generated_datasets/
- generated_data for original Spider training datasets, pretrained models can be found at generated_data/saved_models
- generated_data_augment for original Spider + augmented training datasets, pretrained models can be found at generated_data_augment/saved_models

Generating Train/dev Data for Modules

You could find preprocessed train/dev data in generated_datasets/.

To generate them by yourself, update dirs under TODO in preprocess_train_dev_data.py, and run the following command to generate training files for each module:

python preprocess_train_dev_data.py train|dev

Folder/File Description

data/ contains raw train/dev/test data and table file
generated_datasets/ described as above
models/ contains the code for each module.
evaluation.py is for evaluation. It uses process_sql.py.
train.py is the main file for training. Use train_all.sh to train all the modules (see below).
test.py is the main file for testing. It uses supermodel.sh to call the trained modules and generate SQL queries. In practice, and use test_gen.sh to generate SQL queries.
generate_wikisql_augment.py for cross-domain data augmentation

Training

Run train_all.sh to train all the modules. It looks like:

python train.py \
    --data_root       path/to/generated_data \
    --save_dir        path/to/save/trained/module \
    --history_type    full|no \
    --table_type      std|no \
    --train_component <module_name> \
    --epoch           <num_of_epochs>

Testing

Run test_gen.sh to generate SQL queries. test_gen.sh looks like:

SAVE_PATH=generated_datasets/generated_data/saved_models_hs=full_tbl=std
python test.py \
    --test_data_path  path/to/raw/test/data \
    --models          path/to/trained/module \
    --output_path     path/to/print/generated/SQL \
    --history_type    full|no \
    --table_type      std|no \

Evaluation

Follow the general evaluation process in the Spider github page.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
GAT		GAT
models		models
results/baseline		results/baseline
README.md		README.md
baseline.res		baseline.res
evaluation.py		evaluation.py
gat_tutorial.ipynb		gat_tutorial.ipynb
generate_wikisql_augment.py		generate_wikisql_augment.py
log.txt		log.txt
predicted_sql.txt		predicted_sql.txt
preprocess_train_dev_data.py		preprocess_train_dev_data.py
report.pdf		report.pdf
supermodel.py		supermodel.py
test.py		test.py
test_gen.sh		test_gen.sh
train.py		train.py
train.sh		train.sh
train_all.sh		train_all.sh
utils.py		utils.py
word_embedding.py		word_embedding.py

PrecipiceBlades/BertGAT-for-Spider-Dataset

Folders and files

Latest commit

History

Repository files navigation

BertGAT-for-Spider-Dataset

Environment Setup

Download Data, Embeddings, Scripts, and Pretrained Models

Generating Train/dev Data for Modules

Folder/File Description

Training

Testing

Evaluation

About

Resources

Stars

Watchers

Forks

Languages