Multi-Source-Weak-Supervision

Learning from Multi-SourceWeak Supervision for Deep Text Classification

Code

Environment

Python 3.6

The code can be run on either CPU or GPU environment.

Training the model and make predictions

To run the model, first unzip the dataset file, and then using either way is fine:

(Note: due to the github space limitation, we only include the three dataset. The entire dataset can be downloaded using: https://drive.google.com/drive/folders/1MJe1BJYNPudfmpFxCeHwYqXMx53Kv4h_?usp=sharing)

(1) python main_conditional_attn.py --ds {$dataset}

(For example: python main_conditional_attn.py --ds imdb)

usage: main_conditional_attn.py [-h] [--pt_file PT_FILE] --ds
                                {youtube,imdb,yelp,agnews,spouse} [--no_cuda]
                                [--fast_mode] [--seed SEED] [--epoch EPOCH]
                                [--lr LR] [--weight_decay WEIGHT_DECAY]
                                [--hidden HIDDEN] [--c2 C2] [--c3 C3]
                                [--k K] [--x0 X0]
                                [--unlabeled_ratio UNLABELED_RATIO]
                                [--log_prefix LOG_PREFIX] [--ft_log FT_LOG]
                                [--n_high_cov N_HIGH_COV]

(2) sh run.sh

The trained model will be stored at the model folder.

The running details output can be found at log_files folder.

The test accuracy can be found at ft_logs folder.

Dataset

Dataset:

agnews
imdb
spouse
yelp
youtube

The required data are stored as *.pt file, and each record includes the following information:

the original document text ('text')
the extracted pre-trained Transformer feature ('bert_feature'')
the ground truth label ('label')
the annotated noisy labels ('lf')
the simple majority voting label of annotated noisy labels ('major_label')

We use a dictionary to store the training, validation, and test data. The division are maintained the same for all the baselines as well.

`*_organized_nb.pt`

data_dict = {
    'labeled': {
        'text':
        'label': 
        'major_label': 
        'lf': 
        'bert_feature': 
    },
    'unlabeled': {
        'text': 
        'label': 
        'major_label': 
        'lf': 
        'bert_feature': 
    },
    'test': {
        'text': 
        'label': 
        'major_label':
        'lf': 
        'bert_feature':
    },
    'validation': {
        'text': 
        'label': 
        'major_label': 
        'lf': 
        'bert_feature':
    }
}

Labeling sources with rules and annotated labels

We provide Labeling Functions and Labeling Results of each dataset in the rules-noisy-labels folder. The specific description is included in the inside README.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core_conditional_attn

core_conditional_attn

dataset

dataset

ft_logs

ft_logs

log_files

log_files

model

model

rules-noisy-labels

rules-noisy-labels

.DS_Store

.DS_Store

README.md

README.md

main_conditional_attn.py

main_conditional_attn.py

run.sh

run.sh

Repository files navigation

Multi-Source-Weak-Supervision

Code

Environment

Training the model and make predictions

Dataset

`*_organized_nb.pt`

Labeling sources with rules and annotated labels

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
core_conditional_attn		core_conditional_attn
dataset		dataset
ft_logs		ft_logs
log_files		log_files
model		model
rules-noisy-labels		rules-noisy-labels
.DS_Store		.DS_Store
README.md		README.md
main_conditional_attn.py		main_conditional_attn.py
run.sh		run.sh

weakrules/Denoise-multi-weak-sources

Folders and files

Latest commit

History

Repository files navigation

Multi-Source-Weak-Supervision

Code

Environment

Training the model and make predictions

Dataset

*_organized_nb.pt

Labeling sources with rules and annotated labels

About

Resources

Stars

Watchers

Forks

Languages

`*_organized_nb.pt`