Skip to content

Latest commit

 

History

History
61 lines (56 loc) · 2.14 KB

File metadata and controls

61 lines (56 loc) · 2.14 KB

StackOverflow Dataset Generation

SQL parser is built and only functions in Python2 env. However, since NaturalCC is designed on Python3, we have processed SQL/C#/Python data in a Python2 based environment. and saved them in stack_overflow.zip. If interested in the data processing, you can follow original stack_overflow.

Step 1. Download StackOverflow C#/SQL/Python datasets

bash dataset/stack_overflow/download.sh

Step 2. SQL Generation

  1. flatten SQL code/docstring at ~/stack_overflow/flatten/sql
python -m dataset.stack_overflow.flatten -l sql
  1. move those decompressed files to ~/stack_overflow/flatten/sql
unzip dataset/stack_overflow/sql_tokens.zip -d ~/stack_overflow/flatten/sql
  1. binarize SQL dataset
python -m dataset.stack_overflow.summarization.preprocess -f config/sql

Step 3. C# Generation

  1. install antlr4-python3-runtime
pip install antlr4-python3-runtime==4.5.2
  1. flatten C# code/docstring at ~/stack_overflow/flatten/csharp
python -m dataset.stack_overflow.flatten -l csharp
  1. tokenize code/docstring into code_token/docstring_token
python -m dataset.stack_overflow.tokenization -l csharp

Since generating code_token/docstring_token is slow, you can move those decompressed files to ~/stack_overflow/flatten/csharp

unzip dataset/stack_overflow/csharp_tokens.zip -d ~/stack_overflow/flatten/csharp
  1. binarize C# dataset
python -m dataset.stack_overflow.summarization.preprocess -f config/csharp

Step 4. Python Generation

  1. flatten Python code/docstring at ~/stack_overflow/flatten/python
python -m dataset.stack_overflow.flatten -l python
  1. tokenize code/docstring into code_token/docstring_token
python -m dataset.stack_overflow.tokenization -l python
  1. binarize Python dataset
python -m dataset.stackoverflow.summarization.preprocess -f config/python