Modify Tybalt to handle missing values for incomplete data #156

yagmuronay · 2021-05-24T00:23:29Z

Dear Dr. Greg,

Firstly, thank you very much for this well-documented research. This is not an issue but a feature that I need for my thesis. I would like to modify this model so that it can handle missing values in the test and training data. Because the data set I want to use in the future contains missing values in every sample. Imputing missing values before the training e.g. with zeros is not an option as this would introduce bias and the value zero has a meaning in our data set. Therefore I started with replacing the reconstruction error function with my custom binary cross-entropy function that can handle missing values. I tested this function in another notebook and it seemed to work. However, I observe NaN loss values in the training, also even after only using the reconstruction error. If you have any tips for me regarding how to handle missing values in the model, I would be very grateful for your help.

Kind regards,
Yagmur Onay

cgreene · 2021-05-24T18:14:56Z

Hi Yagmur,

Do you have details on how you are implementing this? In prior work with a different architecture, we used:

In the event of missing data, the cost calculation was modified to exclude missing data from contributing to the reconstruction cost. A missingness vector m was created for each input vector, with a value of 1 where the data is present and 0 when the data is missing. Both the input sample x and reconstruction z were multiplied by m and the cross entropy error was divided by the sum of the m, the number of non-missing features to get the average cost per feature present (Formula 4). This allowed the DA to learn the structure of the data from present features rather than imputation.

https://www.biorxiv.org/content/10.1101/039800v1.full

I think we would need more details to provide any guidance.

yagmuronay · 2021-05-25T22:10:13Z

Dear Dr. Greene,

thank you for your reply and the paper. I see that in the paper the corrupted values have been masked with zeros. In my case, the original data set may initially contain zeros and missing values are represented separately with numpy.nan. Therefore I cannot simply overwrite the missing values with zeros in the preprocessing. I believe replacing the missing values(numpy.nan) with any value would affect the binary cross-entropy loss even if multiply the input vector and the reconstruction with the "missingness vector" m in the end. Please correct me if I am wrong. Instead, I need to omit them when calculating the loss.

Therefore what I need is rather a loss function that creates a mask for the missing values in the original data and applies this mask to the original and predicted values before the calculations. To sum up, the pipeline we have in mind is as follows:

1- Get the original data which initially may have missing values (numpy.nan) and preprocess omitting the missing values
2- Introduce further missing values to the data at random (e.g. 10% in total)
3- Modify the loss function of Tybalt, defined in the CustomVariationalLayer: Modify vae_loss() steps as follows:
a) Create a boolean mask to get only where the original values are missing*
b) Mask the original and the predicted data with this mask
c) Calculate the cost with the masked input vector and reconstruction vector
(*We could extend the mask so that the loss is only calculated on the corrupted values that are not missing in the original data but only in the training data to focus on missing value imputation. It should be easy once the loss function is ready.)
4- Train the model with the corrupted data using the modified loss function

As far as I am concerned, I only need to modify the reconstruction error, K.metrics.binary_crossentropy() and not the KL term to achieve this. Therefore I have been working on a custom binary-cross-entropy function that masks the original and predicted values, where the original data is missing :

def custom_binary_crossentropy(y_true, y_pred): 
    # Create the mask to mask the values where the original data has missing values
    y_true_not_nan_mask = tf.logical_not(tf.is_nan(y_true))
    # Apply the mask to the original data
    y_true_masked = tf.boolean_mask(y_true, mask=y_true_not_nan_mask)
    # Apply the mask to the predicted values
    y_pred_masked = tf.boolean_mask(y_pred, mask=y_true_not_nan_mask)
    
    # Calculate the binary cross entropy(bce) with the masked values
    term_0 = (1 - y_true_masked) * K.log(1 - y_pred_masked + K.epsilon()) # Cancels out when target is 1 
    term_1 = y_true_masked * K.log(y_pred_masked + K.epsilon()) # Cancels out when target is 0
    cross_entropy_loss = -(term_0 + term_1)
    
    # Calculate the bce loss mean only where the original values were not missing by using the mask
    masked_mean_bce_loss = tf.reduce_mean(cross_entropy_loss)
    
    return masked_mean_bce_loss

After testing this function with the variables as in the code snippet below, the loss is 0.659456. However, when I use it instead of keras.metrics.binary_crossentropy(), the loss graph is empty and the both axes ticks show unexpecetd values (-0.04, -0.02, 0, 0.02, 0.04). Do I need to do other modifications to the training pipeline/ model? I am also not sure if I need to keep the mean calculation at the end of the custom loss function. Is vae_loss( ) calculated on each sample? Thank you very much for your time and suggestions!

y_true = tf.constant([
    [0, 1, np.nan, 0],
    [0, 1, 1, 0],
    [np.nan, 1, np.nan, 0],
    [1, 1, 0, np.nan],
])

y_pred = tf.constant([
    [0.1, 0.7, 0.1, 0.3],
    [0.2, 0.6, 0.1, 0],
    [0.1, 0.9, 0.3, 0.2],
    [0.1, 0.4, 0.4, 0.2],
])

loss = custom_binary_crossentropy(y_true, y_pred)
print(loss.eval())

gwaybio · 2021-05-26T11:52:59Z

Nice explanation @yagmuronay - a couple quick things to consider:

Have you tried adding axis=-1 into the tf.reduce_mean() call? See the "Creating custom losses" section of https://keras.io/api/losses/
Are the weird values a result of replacing the binary_cross_entropy vae_loss() with your custom masked loss?

tybalt/tybalt/utils/vae_utils.py

Line 58 in 644f34a

metrics.binary_crossentropy(x_input, x_decoded)
- Try multiplying by the original dimensions of your input RNAseq data. IIRC, we needed this term to balance the KL divergence loss
Have you considered adding the mask to the KL divergence term as well? This reports on the distribution of the encoder output - if you're not learning how to handle missingness with the reconstruction term, then I might worry about missingness influencing the KL term disproportionately
vae_loss is called on each batch of input data

stale · 2022-01-09T02:24:30Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale label Jan 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify Tybalt to handle missing values for incomplete data #156

Modify Tybalt to handle missing values for incomplete data #156

yagmuronay commented May 24, 2021

cgreene commented May 24, 2021

yagmuronay commented May 25, 2021 •

edited

gwaybio commented May 26, 2021 •

edited

stale bot commented Jan 9, 2022

Modify Tybalt to handle missing values for incomplete data #156

Modify Tybalt to handle missing values for incomplete data #156

Comments

yagmuronay commented May 24, 2021

cgreene commented May 24, 2021

yagmuronay commented May 25, 2021 • edited

gwaybio commented May 26, 2021 • edited

stale bot commented Jan 9, 2022

yagmuronay commented May 25, 2021 •

edited

gwaybio commented May 26, 2021 •

edited