Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datset Preprocessing #10

Open
hamza13-12 opened this issue Mar 25, 2024 · 3 comments
Open

Datset Preprocessing #10

hamza13-12 opened this issue Mar 25, 2024 · 3 comments

Comments

@hamza13-12
Copy link

Hello. As far as I understand, you are storing the data in a pandas dataframe with one column corressponding to EEG signals and the other to text and then converting EEG signals to text, correct? Could you elaborate more on how you've achieved this dataset format so that others can organize the dataset the same way?

@MikeWangWZHL
Copy link
Owner

Hi! sorry I am not sure what do you mean by pandas? But data preprocssing scripts can be found in scripts/prepare_dataset.sh;
for example, the util/construct_dataset_mat_to_pickle_v1.py will convert the ZuCo v1.0 .mat file into a .pickle file, which is like a python dictionary.

@hamza13-12
Copy link
Author

Pandas is a data analysis library in python used to build dataframes. I was actually asking for instructions on how to build the dataset in the format where one column corressponds to EEG signals and another one to text so that I can create seq2seq models that take EEG as input and generate text

@hamza13-12
Copy link
Author

hamza13-12 commented Mar 30, 2024

Actually, I figured it out! After creating train_set and dev_set, I just used this snippet of code:

import pandas as pd

def dataset_to_dataframe(dataset):
    # Initialize lists to hold data
    input_embeddings_list = []
    seq_len_list = []
    input_attn_mask_list = []
    input_attn_mask_invert_list = []
    target_strings_list = []
    sent_level_EEG_list = []
    
    # Iterate through the dataset
    for i in range(len(dataset)):
        input_embeddings, seq_len, input_attn_mask, input_attn_mask_invert, target_string, sent_level_EEG = dataset[i]
        
        # Convert tensors to numpy arrays
        input_embeddings_np = input_embeddings.numpy()
        input_attn_mask_np = input_attn_mask.numpy()
        input_attn_mask_invert_np = input_attn_mask_invert.numpy()
        sent_level_EEG_np = sent_level_EEG.numpy()
        
        # Append to lists
        input_embeddings_list.append(input_embeddings_np)
        seq_len_list.append(seq_len)
        input_attn_mask_list.append(input_attn_mask_np)
        input_attn_mask_invert_list.append(input_attn_mask_invert_np)
        target_strings_list.append(target_string)
        sent_level_EEG_list.append(sent_level_EEG_np)
    
    # Create DataFrame
    df = pd.DataFrame({
        'Input Embeddings': input_embeddings_list,
        'Sequence Length': seq_len_list,
        'Input Attention Mask': input_attn_mask_list,
        'Input Attention Mask Invert': input_attn_mask_invert_list,
        'Target String': target_strings_list,
        'Sentence Level EEG': sent_level_EEG_list
    })
    
    return df

# Convert datasets to dataframes
train_df = dataset_to_dataframe(train_set)
dev_df = dataset_to_dataframe(dev_set)


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants