Skip to content

A machine learning algorithm based on Gaussian processes for MODIS multilayer cloud and thermodynamic phase classification using CALIOP and CloudSat

License

Notifications You must be signed in to change notification settings

benjamin-hg-marchant/ml-gp-modis-multilayer-clouds

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A machine learning algorithm based on Gaussian processes for MODIS multilayer cloud and thermodynamic phase classification using CALIOP and CloudSat

Author: Benjamin Marchant
(marchant.benjamin01@gmail.com)
(benjamin.marchant@nasa.gov)
v1.0.0

DOI

This Jupyter notebook presents a machine learning algorithm based on Gaussian processes to detect cloud thermodynamic phase and multilayer clouds for the MODIS (Moderate Resolution Imaging Spectroradiometer) instrument. The algorithm is trained using a co-located dataset between MODIS CALIOP and CloudSat.

Note 1: dataset used to train the algorithm is publicly available on Github.

Note 2: Main author open notebook can be found at the following url.

Aqua

Table of Content:

from sklearn.metrics import accuracy_score, log_loss
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import DotProduct, ConstantKernel as C
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.utils import resample
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
from pyhdf.SD import SD, SDC 

from pyhdf.SD import SD, SDC 
from matplotlib.pyplot import figure
from random import shuffle 
from scipy.spatial.distance import pdist, cdist

import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.cm as cm
import pandas as pd
import numpy as np
import seaborn as sns; sns.set()
import numpy.ma as ma
import pprint
import warnings

warnings.filterwarnings('ignore')
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")
<style> .output_png { display: table-cell; text-align: center; vertical-align: middle; } </style>

Read MODIS CALIOP CLoudSat Co-located Dataset

To train the algorithm, a co-located MODIS CALIOP and CloudSat Dataset has been created using random sampling of pixels selected during the month of January 2008 (5.000 pixels selected randomly each day -> dataset size 31*5000=155000):

df = pd.read_csv('../data/2008_01_colocated_modis_caliop_cloudsat_random_sample_dataset.csv',index_col=False)

print('dataframe shape',df.shape)
dataframe shape (155000, 113)

The file includes the following information:

column_names_list = df.columns

for name in column_names_list:
    print(name)
modis_multilayer_cloud
modis_multilayer_cloud_qa1
modis_multilayer_cloud_qa2
modis_multilayer_cloud_qa3
modis_multilayer_cloud_qa4
modis_multilayer_cloud_qa5
modis_cloud_phase
latitude
longitude
modis_cloud_top_height_1km
modis_cloud_effective_radius
modis_cloud_effective_radius_16
modis_cloud_effective_radius_37
modis_cloud_optical_thickness
surface_flag
nb_tot_cloud_layers
caliop_1km_nb_cloud_layers
caliop_5km_nb_cloud_layers
cldclass_lidar_nb_cloud_layers
modis_band_1
modis_band_2
modis_band_3
modis_band_4
modis_band_5
modis_band_6
modis_band_7
modis_band_8
modis_band_9
modis_band_10
modis_band_11
modis_band_12
modis_band_13l
modis_band_13h
modis_band_14l
modis_band_14h
modis_band_15
modis_band_16
modis_band_17
modis_band_18
modis_band_19
modis_band_26
modis_band_20
modis_band_21
modis_band_22
modis_band_23
modis_band_24
modis_band_25
modis_band_27
modis_band_28
modis_band_29
modis_band_30
modis_band_31
modis_band_32
modis_band_33
modis_band_34
modis_band_35
modis_band_36
modis_band_1_atm_corr_refl
modis_band_2_atm_corr_refl
modis_band_5_atm_corr_refl
modis_band_6_atm_corr_refl
modis_band_7_atm_corr_refl
modis_band_20_atm_corr_refl
cloud_layer_top_01
cloud_layer_base_01
cloud_layer_optical_depth_01
cloud_layer_phase_01
cloud_layer_source_01
cloud_layer_top_02
cloud_layer_base_02
cloud_layer_optical_depth_02
cloud_layer_phase_02
cloud_layer_source_02
cloud_layer_top_03
cloud_layer_base_03
cloud_layer_optical_depth_03
cloud_layer_phase_03
cloud_layer_source_03
cloud_layer_top_04
cloud_layer_base_04
cloud_layer_optical_depth_04
cloud_layer_phase_04
cloud_layer_source_04
cloud_layer_top_05
cloud_layer_base_05
cloud_layer_optical_depth_05
cloud_layer_phase_05
cloud_layer_source_05
cloud_layer_top_06
cloud_layer_base_06
cloud_layer_optical_depth_06
cloud_layer_phase_06
cloud_layer_source_06
cloud_layer_top_07
cloud_layer_base_07
cloud_layer_optical_depth_07
cloud_layer_phase_07
cloud_layer_source_07
cloud_layer_top_08
cloud_layer_base_08
cloud_layer_optical_depth_08
cloud_layer_phase_08
cloud_layer_source_08
cloud_layer_top_09
cloud_layer_base_09
cloud_layer_optical_depth_09
cloud_layer_phase_09
cloud_layer_source_09
cloud_layer_top_10
cloud_layer_base_10
cloud_layer_optical_depth_10
cloud_layer_phase_10
cloud_layer_source_10
df.head(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
modis_multilayer_cloud modis_multilayer_cloud_qa1 modis_multilayer_cloud_qa2 modis_multilayer_cloud_qa3 modis_multilayer_cloud_qa4 modis_multilayer_cloud_qa5 modis_cloud_phase latitude longitude modis_cloud_top_height_1km ... cloud_layer_top_09 cloud_layer_base_09 cloud_layer_optical_depth_09 cloud_layer_phase_09 cloud_layer_source_09 cloud_layer_top_10 cloud_layer_base_10 cloud_layer_optical_depth_10 cloud_layer_phase_10 cloud_layer_source_10
0 5.0 1.0 0.0 0.0 0.0 1.0 2.0 -49.767658 -144.492096 2700.0 ... 0.000000 0.000000 -9999.0 0.0 0.0 0.0 0.0 -9999.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0 0.0 2.0 7.892334 87.800476 4750.0 ... 2.049999 1.449999 0.0 3.0 3.0 0.0 0.0 -9999.0 0.0 0.0
2 1.0 0.0 0.0 0.0 0.0 0.0 2.0 -52.644794 29.831434 500.0 ... 0.000000 0.000000 -9999.0 0.0 0.0 0.0 0.0 -9999.0 0.0 0.0
3 1.0 0.0 0.0 0.0 0.0 0.0 2.0 28.681301 -40.599194 50.0 ... 0.000000 0.000000 -9999.0 0.0 0.0 0.0 0.0 -9999.0 0.0 0.0
4 1.0 0.0 0.0 0.0 0.0 0.0 2.0 -67.958023 -82.737038 1250.0 ... 0.000000 0.000000 -9999.0 0.0 0.0 0.0 0.0 -9999.0 0.0 0.0
5 1.0 0.0 0.0 0.0 0.0 0.0 2.0 -60.335007 -138.910355 850.0 ... 0.000000 0.000000 -9999.0 0.0 0.0 0.0 0.0 -9999.0 0.0 0.0
6 8.0 0.0 1.0 1.0 0.0 1.0 3.0 -63.907398 36.836395 7450.0 ... 0.000000 0.000000 -9999.0 0.0 0.0 0.0 0.0 -9999.0 0.0 0.0
7 1.0 0.0 0.0 0.0 0.0 0.0 2.0 -23.554411 -4.174837 0.0 ... 0.000000 0.000000 -9999.0 0.0 0.0 0.0 0.0 -9999.0 0.0 0.0
8 1.0 0.0 0.0 0.0 0.0 0.0 2.0 -16.378271 -30.564116 2800.0 ... 0.000000 0.000000 -9999.0 0.0 0.0 0.0 0.0 -9999.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 4.0 -6.938173 115.692223 7150.0 ... 0.000000 0.000000 -9999.0 0.0 0.0 0.0 0.0 -9999.0 0.0 0.0

10 rows Ă— 113 columns

Remove rows with no data or clear pixels:

df = df[df.modis_multilayer_cloud != 0]

print('dataframe shape',df.shape)
dataframe shape (111558, 113)
df['latitude'].plot(kind='hist',bins=50)

plt.xlabel('latitude')

plt.show()

png

Remove pixels with latitude > 70:

df = df.loc[ abs( df['latitude'] ) < 70]

print('dataframe shape',df.shape)
dataframe shape (101483, 113)

Gaussian processes are a "Lazy Learning" method (meaning that the training dataset will be used to make each new predictions). So it becomes slower with incrising the training dataset size. To avoid that, let's just take a smaller sample size:

df = df.sample(n=10000, random_state = 42)
#df = df.sample(n=10000)
#df = df.head(n=10000)
df.shape
(10000, 113)

Create cloud phase & multilayer cloud labels

Using the cldclass-lidar product (corresponding to cloud_layer_source = 3) in the dataset, let's define monolayer and multilayer clouds labels that will be used to train the machine learning algorithm:

df['label 1'] = 'unlabeled'
df['label 2'] = 'unlabeled'

#print(df.head())

for index, row in df.iterrows():
    nb_cloud_layer = 0
    cloud_phase_per_layer_list = []
    for layer_idx in range(10):
        if int( row['cloud_layer_source_{:02d}'.format(layer_idx+1)] ) == 3: # cldclass-lidar product
            cloud_phase_per_layer_list.append( int( row['cloud_layer_phase_{:02d}'.format(layer_idx+1)] ) )
            nb_cloud_layer = nb_cloud_layer + 1
    if nb_cloud_layer == 1: # monolayer cloud   
        df.loc[index,'label 1'] = 'monolayer (n=1)'
        if cloud_phase_per_layer_list[0] == 1: df.loc[index,'label 2'] = 'ice'
        if cloud_phase_per_layer_list[0] == 2: df.loc[index,'label 2'] = 'mixed'    
        if cloud_phase_per_layer_list[0] == 3: df.loc[index,'label 2'] = 'liquid'
    if nb_cloud_layer == 2: # multilayer cloud 
        df.loc[index,'label 1'] = 'multilayer (n=2)'
        if cloud_phase_per_layer_list[0] == 1 and cloud_phase_per_layer_list[1] == 1: 
            df.loc[index,'label 2'] = 'ice / ice'
        if cloud_phase_per_layer_list[0] == 1 and cloud_phase_per_layer_list[1] == 3: 
            df.loc[index,'label 2'] = 'ice / liquid'
        if cloud_phase_per_layer_list[0] == 1 and cloud_phase_per_layer_list[1] == 2: 
            df.loc[index,'label 2'] = 'ice / mixed'
        if cloud_phase_per_layer_list[0] == 3 and cloud_phase_per_layer_list[1] == 3: 
            df.loc[index,'label 2'] = 'liquid / liquid'
        if cloud_phase_per_layer_list[0] == 2 and cloud_phase_per_layer_list[1] == 3: 
            df.loc[index,'label 2'] = 'mixed / liquid'    
        if cloud_phase_per_layer_list[0] == 3 and cloud_phase_per_layer_list[1] == 1: 
            df.loc[index,'label 2'] = 'liquid / ice'
    if nb_cloud_layer > 2: # multilayer cloud 
        df.loc[index,'label 1'] = 'multilayer (n>2)'
df['label 1'].value_counts()
monolayer (n=1)     5842
multilayer (n=2)    2661
multilayer (n>2)    1079
unlabeled            418
Name: label 1, dtype: int64
df['label 1'].value_counts().plot(kind='pie', figsize=(8, 8))

plt.show()

png

df['label 1'].value_counts().plot(kind='bar')

plt.show()

png

df['label 2'].value_counts()
liquid             2908
ice                1624
unlabeled          1560
mixed              1310
ice / liquid       1237
ice / mixed         479
ice / ice           452
liquid / liquid     271
mixed / liquid      151
liquid / ice          8
Name: label 2, dtype: int64
df['label 2'].value_counts().plot(kind='bar')

plt.show()

png

df.drop( df[(df['label 1'] == 'unlabeled' ) | (df['label 2'] == 'unlabeled')].index, inplace=True )

print(df.shape)
(8440, 115)
df['label 2'].value_counts().plot(kind='bar')

plt.savefig('output.png', bbox_inches='tight')
plt.show()

png

Data Preparation (scaling and standardization)

Create a function that prepare the data:

def data_preparation(df_train, df_test, max_train_sample_size, label_name, features_train):

    df_train_processed = df_train.copy()
    df_test_processed = df_test.copy()
    
    #----- data class balanced -----#    
    
    print(df_train_processed['label 2'].value_counts())
    
    df_train_processed['label 2'][ df_train_processed['label 2'] != label_name ] = '(not) ' + label_name
 
    df_train_processed['label 2'].value_counts().plot(kind='bar')

    plt.show()  

    #----- data class balanced -----#     
    
    # Separate majority and minority classes
    df_majority = df_train_processed[ df_train_processed['label 2'] != label_name ]
    df_minority = df_train_processed[ df_train_processed['label 2'] == label_name ]
    
    df_minority_len = len(df_minority)
    
    # Downsample majority class
    df_majority_downsampled = resample(df_majority, 
                                     replace=False,    # sample without replacement
                                     n_samples=df_minority_len,     # to match minority class
                                     random_state=123) # reproducible results

    # Combine minority class with downsampled majority class
    df_train_processed = pd.concat([df_majority_downsampled, df_minority])
    
    #print(df_train_processed)
    
    print(df_train_processed['label 2'].value_counts())

    df_train_processed['label 2'].value_counts().plot(kind='bar')

    plt.show() 
    
    #----- data train scaling -----#
    
    X_train = df_train_processed[features_train]

    scaler = preprocessing.StandardScaler().fit(X_train)

    X_train_scaled = scaler.transform(X_train)
    
    df_train_processed[features_train] = X_train_scaled
        
    #----- data test scaling -----#    
    
    X_test_scaled = scaler.transform(df_test_processed[features_train])
    
    df_test_processed[features_train] = X_test_scaled
    
    df_test_processed['label 2'][ df_test_processed['label 2'] != label_name ] = '(not) ' + label_name
    
    #----- data sub sampling -----#
    
    if df_minority_len * 2.0 > max_train_sample_size:

        df_majority = df_train_processed[ df_train_processed['label 2'] != label_name ]
        df_minority = df_train_processed[ df_train_processed['label 2'] == label_name ]    

        df_majority = df_majority.sample(n=int(max_train_sample_size/2.0), random_state=42)
        df_minority = df_minority.sample(n=int(max_train_sample_size/2.0), random_state=42)

        df_train_processed = pd.concat([df_majority, df_minority])

        df_train_processed['label 2'].value_counts().plot(kind='bar')

        plt.show()     

    return scaler, df_train_processed, df_test_processed

Create a learning function

def train_gp_model(label_name, features_train, df_train_processed):
    
    kernel = C(1.0, (0.1, 10.0)) * RBF([0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2], (1e-2, 1e2))

    gp = GaussianProcessClassifier(kernel=kernel, n_restarts_optimizer=20)
    
    gp.fit(df_train_processed[features_train], df_train_processed['label 2'])
    
    return gp

Create an evaluation function

def evaluate_gp_model(label_name, features_train, gp, df_test_processed):
        
    y_true = df_test_processed['label 2']
    y_pred = gp.predict_proba(df_test_processed[features_train])
    
    df_test_processed['label 2'].value_counts().plot(kind='bar')

    plt.show() 

    y_pred_list = []
    y_liq_liq_list = []
    y_liq_notliq_list = []
    y_notliq_liq_list = []
    y_notliq_notliq_list = []
    for i in range(y_pred.shape[0]):
        #print(y_pred[i,0],y_pred[i,1],y_true.iloc[i])
        if y_pred[i,0] >= 0.5:
            y_pred_list.append('(not) ' + label_name)
        else:
            y_pred_list.append(label_name)
        if y_true.iloc[i] == label_name:
            y_liq_liq_list.append(y_pred[i,1])
            y_liq_notliq_list.append(y_pred[i,0])  
        else:
            y_notliq_liq_list.append(y_pred[i,1])
            y_notliq_notliq_list.append(y_pred[i,0])                       
    
    print('accuracy_score:',accuracy_score(y_true, y_pred_list))  
    
    cm_data = confusion_matrix(y_true, y_pred_list) 

    #print(cm_data)
    
    label_list = ['(not) ' + label_name, label_name]
    label_list.sort()

    #df_cm = pd.DataFrame(100.0*cm_data/np.sum(cm_data), index = label_list, columns = label_list)

    #sns.heatmap(df_cm, annot=True)

    #plt.show()
    #plt.close()    
    
    plt.hist(y_liq_liq_list, alpha = 0.5, label=label_name)
    
    plt.hist(y_liq_notliq_list, alpha = 0.5, label='(not) ' + label_name)
    
    plt.title(label_name + ' -- True Label')
    
    plt.xlabel('Predicted Probability')
    plt.legend()
    
    plt.show()   
    
    plt.hist(y_notliq_liq_list, alpha = 0.5, label=label_name)
    
    plt.hist(y_notliq_notliq_list, alpha = 0.5, label='(not) ' + label_name)
    
    plt.title('(not) ' + label_name + ' -- True Label')
    
    plt.xlabel('Predicted Probability')
    
    plt.legend()
    
    plt.show()   
        
    fpr, tpr, _ = roc_curve(y_true, y_pred[:,1], pos_label=label_name)
    
    roc_auc = auc(fpr, tpr)
    plt.title('roc_auc = %0.2f' % roc_auc)
    
    lw = 2
    plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
    
    plt.show()

    #print('log_loss', log_loss(y_true, y_pred))

    return None

Select MODIS bands that will be used to train the ML model

features_train = ['modis_band_1','modis_band_7','modis_band_20',
                  'modis_band_26','modis_band_28','modis_band_29',
                  'modis_band_31','modis_band_32']

sub_df = df[features_train + ['label 1','label 2']]

Split the dataset between a training and testing dataset:

df_train, df_test = train_test_split(sub_df, test_size = .3, random_state = 42)

print(df_train.shape)
(5908, 10)

Train a model that detect monolayer liquid cloud

%%time

scaler_liquid, df_train_liquid_processed, df_test_liquid_processed = data_preparation(
                                            df_train, 
                                            df_test, 
                                            max_train_sample_size = 100, 
                                            label_name = 'liquid', features_train = features_train)
liquid             2064
ice                1124
mixed               900
ice / liquid        855
ice / mixed         333
ice / ice           331
liquid / liquid     193
mixed / liquid      103
liquid / ice          5
Name: label 2, dtype: int64

png

(not) liquid    2064
liquid          2064
Name: label 2, dtype: int64

png

png

CPU times: user 334 ms, sys: 6.96 ms, total: 341 ms
Wall time: 339 ms
%%time

gp_liquid = train_gp_model('liquid', features_train, df_train_liquid_processed)
CPU times: user 4.8 s, sys: 32.8 ms, total: 4.83 s
Wall time: 2.5 s
%%time

evaluate_gp_model('liquid', features_train, gp_liquid, df_test_liquid_processed)

png

accuracy_score: 0.7729067930489731

png

png

png

CPU times: user 1.01 s, sys: 14.5 ms, total: 1.02 s
Wall time: 817 ms

Train a model that detect monolayer ice cloud

scaler_ice, df_train_ice_processed, df_test_ice_processed = data_preparation(
                                        df_train, 
                                        df_test, 
                                        max_train_sample_size = 100, 
                                        label_name = 'ice', features_train = features_train)
liquid             2064
ice                1124
mixed               900
ice / liquid        855
ice / mixed         333
ice / ice           331
liquid / liquid     193
mixed / liquid      103
liquid / ice          5
Name: label 2, dtype: int64

png

ice          1124
(not) ice    1124
Name: label 2, dtype: int64

png

png

gp_ice = train_gp_model('ice', features_train, df_train_ice_processed)
evaluate_gp_model('ice', features_train, gp_ice, df_test_ice_processed)

png

accuracy_score: 0.7417061611374408

png

png

png

Train a model that detect monolayer mixed cloud

scaler_mixed, df_train_mixed_processed, df_test_mixed_processed = data_preparation(
                                        df_train, 
                                        df_test, 
                                        max_train_sample_size = 100, 
                                        label_name = 'mixed', features_train = features_train)
liquid             2064
ice                1124
mixed               900
ice / liquid        855
ice / mixed         333
ice / ice           331
liquid / liquid     193
mixed / liquid      103
liquid / ice          5
Name: label 2, dtype: int64

png

mixed          900
(not) mixed    900
Name: label 2, dtype: int64

png

png

%%time

features_train = ['modis_band_1','modis_band_7','modis_band_20',
                  'modis_band_26','modis_band_28','modis_band_29',
                  'modis_band_31','modis_band_32']

gp_mixed = train_gp_model('mixed', features_train, df_train_mixed_processed)
CPU times: user 6.12 s, sys: 44.1 ms, total: 6.17 s
Wall time: 3.13 s
evaluate_gp_model('mixed', features_train, gp_mixed, df_test_mixed_processed)

png

accuracy_score: 0.6303317535545023

png

png

png

Train a model that detect multilayer (ice / liquid) cloud

scaler_ice_over_liquid, df_train_ice_over_liquid_processed, df_test_ice_over_liquid_processed = data_preparation(
                                        df_train, 
                                        df_test, 
                                        max_train_sample_size = 500, 
                                        label_name = 'ice / liquid', features_train = features_train)
liquid             2064
ice                1124
mixed               900
ice / liquid        855
ice / mixed         333
ice / ice           331
liquid / liquid     193
mixed / liquid      103
liquid / ice          5
Name: label 2, dtype: int64

png

ice / liquid          855
(not) ice / liquid    855
Name: label 2, dtype: int64

png

png

%%time

features_train = ['modis_band_1','modis_band_7','modis_band_20',
                  'modis_band_26','modis_band_28','modis_band_29',
                  'modis_band_31','modis_band_32']

gp_ice_over_liquid = train_gp_model('ice / liquid', features_train, df_train_ice_over_liquid_processed)
CPU times: user 2min 9s, sys: 8.84 s, total: 2min 18s
Wall time: 1min 9s
evaluate_gp_model('ice / liquid', features_train, gp_ice_over_liquid, df_test_ice_over_liquid_processed)

png

accuracy_score: 0.6184834123222749

png

png

png

Train a model that detect multilayer (ice / mixed) cloud

scaler_ice_over_mixed, df_train_ice_over_mixed_processed, df_test_ice_over_mixed_processed = data_preparation(
                                        df_train, 
                                        df_test, 
                                        max_train_sample_size = 100, 
                                        label_name = 'ice / mixed', features_train = features_train)
liquid             2064
ice                1124
mixed               900
ice / liquid        855
ice / mixed         333
ice / ice           331
liquid / liquid     193
mixed / liquid      103
liquid / ice          5
Name: label 2, dtype: int64

png

(not) ice / mixed    333
ice / mixed          333
Name: label 2, dtype: int64

png

png

%%time

features_train = ['modis_band_1','modis_band_7','modis_band_20',
                  'modis_band_26','modis_band_28','modis_band_29',
                  'modis_band_31','modis_band_32']

gp_ice_over_mixed = train_gp_model('ice / mixed', features_train, df_train_ice_over_mixed_processed)
CPU times: user 5.66 s, sys: 59.6 ms, total: 5.71 s
Wall time: 3.07 s
evaluate_gp_model('ice / mixed', features_train, gp_ice_over_mixed, df_test_ice_over_mixed_processed)

png

accuracy_score: 0.5217219589257504

png

png

png

Train a model that detect multilayer (ice / ice) cloud

scaler_ice_over_ice, df_train_ice_over_ice_processed, df_test_ice_over_ice_processed = data_preparation(
                                        df_train, 
                                        df_test, 
                                        max_train_sample_size = 100, 
                                        label_name = 'ice / ice', features_train = features_train)
liquid             2064
ice                1124
mixed               900
ice / liquid        855
ice / mixed         333
ice / ice           331
liquid / liquid     193
mixed / liquid      103
liquid / ice          5
Name: label 2, dtype: int64

png

ice / ice          331
(not) ice / ice    331
Name: label 2, dtype: int64

png

png

%%time

features_train = ['modis_band_1','modis_band_7','modis_band_20',
                  'modis_band_26','modis_band_28','modis_band_29',
                  'modis_band_31','modis_band_32']

gp_ice_over_ice = train_gp_model('ice / ice', features_train, df_train_ice_over_ice_processed)
CPU times: user 6.48 s, sys: 42.3 ms, total: 6.52 s
Wall time: 3.31 s
evaluate_gp_model('ice / ice', features_train, gp_ice_over_ice, df_test_ice_over_ice_processed)

png

accuracy_score: 0.6212480252764613

png

png

png

Train a model that detect multilayer (liquid / liquid) cloud

scaler_liquid_over_liquid, df_train_liq_over_liq_processed, df_test_liq_over_liq_processed = data_preparation(
                                        df_train, 
                                        df_test, 
                                        max_train_sample_size = 100, 
                                        label_name = 'liquid / liquid', features_train = features_train)
liquid             2064
ice                1124
mixed               900
ice / liquid        855
ice / mixed         333
ice / ice           331
liquid / liquid     193
mixed / liquid      103
liquid / ice          5
Name: label 2, dtype: int64

png

(not) liquid / liquid    193
liquid / liquid          193
Name: label 2, dtype: int64

png

png

%%time

features_train = ['modis_band_1','modis_band_7','modis_band_20',
                  'modis_band_26','modis_band_28','modis_band_29',
                  'modis_band_31','modis_band_32']

gp_liq_over_liq = train_gp_model('liquid / liquid', features_train, df_train_liq_over_liq_processed)
CPU times: user 6.3 s, sys: 47.8 ms, total: 6.35 s
Wall time: 3.23 s
evaluate_gp_model('liquid / liquid', features_train, gp_liq_over_liq, df_test_liq_over_liq_processed)

png

accuracy_score: 0.6015007898894155

png

png

png

How to apply a machine learning model to a single MODIS granule ?

In this section, we will see how to develop an algorihtm to apply previous ML models (based on Gaussian processes) to a single MODIS granule. For that we will use the following products:

  • MODIS MYD021KM L1
  • MODIS MYD06 L2 Cloud Mask (to select only cloudy pixels to apply ML models)
  • MODIS MYD06 L2 Cloud Phase (only to make some comparisons with the ML models)
  • MODIS MYD06 L2 Multilayer Clouds (only to make some comparisons with the ML models)
myd021km_file = SD('../data/MYD021KM.A2008015.1435.006.2012066180438.hdf', SDC.READ)
myd06_file = SD('../data/MYD06_L2.A2008015.1435.006.2013342100940.hdf', SDC.READ)

Create a function to plot MODIS MYD06 L2 Cloud Mask:

def bits_stripping(bit_start,bit_count,value):
    bitmask=pow(2,bit_start+bit_count)-1
    return np.right_shift(np.bitwise_and(value,bitmask),bit_start)

def plot_MODIS_L2_Cloud_Mask_1km(cloud_mask_flag):
    figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
    #cmap =  [(1.0,1.0,1.0)] + [(1.0, 0.0, 0.0)] + [(65.0/255,105.0/255,225.0/255)] + [(0.0,0.0,0.0)]
    #cmap = sns.mpl_palette("Set1", 4)
    cmap = sns.color_palette("RdBu", n_colors=4)
    cmap = mpl.colors.ListedColormap(cmap)
    bounds = [-0.5, 0.5, 1.5, 2.5, 3.5]
    norm = mpl.colors.BoundaryNorm(bounds, cmap.N)
    img = plt.imshow(np.fliplr(cloud_mask_flag), cmap=cmap, norm=norm,interpolation='none', origin='lower')
    cbar_bounds = [-0.5, 0.5, 1.5, 2.5, 3.5]
    cbar_ticks = [ 0, 1, 2, 3]  
    cbar_labels = ['Confident Cloudy', 'Probably Cloudy','Probably Clear ','Confident Clear']  
    cbar = plt.colorbar(img, cmap=cmap, norm=norm, boundaries=cbar_bounds, ticks=cbar_ticks)
    cbar.ax.set_yticklabels(cbar_labels, fontsize=11)
    plt.title('MODIS MYD06 C6 Cloud Mask 1km', fontsize=11)
    l = [int(i) for i in np.linspace(0,data.shape[1],6)]
    plt.xticks(l, [i for i in reversed(l)], rotation=0, fontsize=11 )
    l = [int(i) for i in np.linspace(0,data.shape[0],9)]
    plt.yticks(l, l, rotation=0, fontsize=11 )
    plt.xticks(fontsize=11)
    plt.yticks(fontsize=11)
    plt.show()
data_selected_id = myd06_file.select('Cloud_Mask_1km')
data = data_selected_id.get()   
cloud_mask_flag = bits_stripping(1,2,data[:,:,0])
plot_MODIS_L2_Cloud_Mask_1km(cloud_mask_flag)  

png

Create a function to plot MODIS MYD06 L2 Cloud Phase:

def plot_MODIS_L2_Cloud_Phase_Optical_Properties(data):
    figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
    cmap = sns.color_palette("RdBu_r", n_colors=4)
    cmap = mpl.colors.ListedColormap(cmap)
    bounds = [-0.5, 1.5, 2.5, 3.5, 4.5]
    norm = mpl.colors.BoundaryNorm(bounds, cmap.N)
    img = plt.imshow(np.fliplr(data), cmap=cmap, norm=norm,interpolation='none', origin='lower')
    cbar_bounds = [0.5,1.5, 2.5, 3.5, 4.5]
    cbar_ticks = [1.0,2.0,3.0,4.0]  
    cbar_labels = ['clear','Liquid','Ice','Und.']  
    cbar = plt.colorbar(img, cmap=cmap, norm=norm, boundaries=cbar_bounds, ticks=cbar_ticks)
    cbar.ax.set_yticklabels(cbar_labels, fontsize=10)
    plt.title('MODIS MYD06 C6 \n Cloud Phase Optical Properties', fontsize=11)
    l = [int(i) for i in np.linspace(0,data.shape[1],6)]
    plt.xticks(l, [i for i in reversed(l)], rotation=0, fontsize=11 )
    l = [int(i) for i in np.linspace(0,data.shape[0],9)]
    plt.yticks(l, l, rotation=0, fontsize=11 )
    plt.xticks(fontsize=11)
    plt.yticks(fontsize=11)
    plt.show()
data_selected_id = myd06_file.select('Cloud_Phase_Optical_Properties')
data = data_selected_id.get()
plot_MODIS_L2_Cloud_Phase_Optical_Properties(data)   

png

Create a function to plot MODIS MYD06 L2 Multilayer Clouds:

def plot_multilayer_clouds(data):
    figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
    cmap = sns.color_palette('RdBu_r', n_colors=10)
    cmap = mpl.colors.ListedColormap(cmap)
    bounds = [-0.5, 1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5,9.5]
    norm = mpl.colors.BoundaryNorm(bounds, cmap.N)
    img = plt.imshow(np.fliplr(data), cmap=cmap, norm=norm,interpolation='none', origin='lower')
    cbar_bounds = [-0.5,0.5, 1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5,9.5]
    cbar_ticks = [0,1,2,3,4,5,6,7,8,9]               
    cbar = plt.colorbar(img, cmap=cmap, norm=norm, boundaries=cbar_bounds, ticks=cbar_ticks)
    cbar.ax.set_yticklabels(cbar_ticks, fontsize=10)
    plt.title('MODIS MYD06 C6 \n Multilayer Clouds', fontsize=11)
    l = [int(i) for i in np.linspace(0,data.shape[1],6)]
    plt.xticks(l, [i for i in reversed(l)], rotation=0, fontsize=11 )
    l = [int(i) for i in np.linspace(0,data.shape[0],9)]
    plt.yticks(l, l, rotation=0, fontsize=11 )
    plt.xticks(fontsize=11)
    plt.yticks(fontsize=11)
    plt.show()
data_selected_id = myd06_file.select('Cloud_Multi_Layer_Flag')

data = data_selected_id.get()
data_selected_attributes = data_selected_id.attributes()
plot_multilayer_clouds(data)

png

Read and create a function to plot MODIS MYD021KM L1:

EV_250_Aggr1km_RefSB = myd021km_file.select('EV_250_Aggr1km_RefSB')
EV_500_Aggr1km_RefSB = myd021km_file.select('EV_500_Aggr1km_RefSB')
EV_1KM_RefSB = myd021km_file.select('EV_1KM_RefSB')
EV_1KM_Emissive = myd021km_file.select('EV_1KM_Emissive')

First, lets create a dictionary that will store for each MODIS band the corresponding SDS name and index:

modis_band_dic = {}
#print( EV_250_Aggr1km_RefSB.info() )
EV_250_Aggr1km_RefSB_attributes = EV_250_Aggr1km_RefSB.attributes()
EV_250_Aggr1km_RefSB_scales = EV_250_Aggr1km_RefSB_attributes['reflectance_scales']
EV_250_Aggr1km_RefSB_offsets = EV_250_Aggr1km_RefSB_attributes['reflectance_offsets']
#pprint.pprint(EV_250_Aggr1km_RefSB_attributes )

for idx,i in enumerate(EV_250_Aggr1km_RefSB_attributes['band_names'].split(',')):
    print(idx,i)    
    modis_band_dic[i] = [EV_250_Aggr1km_RefSB,idx]
0 1
1 2
#print( EV_500_Aggr1km_RefSB.info() )
EV_500_Aggr1km_RefSB_attributes = EV_500_Aggr1km_RefSB.attributes()
EV_500_Aggr1km_RefSB_scales = EV_500_Aggr1km_RefSB_attributes['reflectance_scales']
EV_500_Aggr1km_RefSB_offsets = EV_500_Aggr1km_RefSB_attributes['reflectance_offsets']
#pprint.pprint(EV_500_Aggr1km_RefSB_attributes )

for idx,i in enumerate(EV_500_Aggr1km_RefSB_attributes['band_names'].split(',')):
    print(idx,i)
    modis_band_dic[i] = [EV_500_Aggr1km_RefSB,idx]
0 3
1 4
2 5
3 6
4 7
#print( EV_1KM_RefSB.info() )
EV_1KM_RefSB_attributes = EV_1KM_RefSB.attributes()
EV_1KM_RefSB_scales = EV_1KM_RefSB_attributes['reflectance_scales']
EV_1KM_RefSB_offsets = EV_1KM_RefSB_attributes['reflectance_offsets']
#pprint.pprint(EV_1KM_RefSB_attributes )

for idx,i in enumerate(EV_1KM_RefSB_attributes['band_names'].split(',')):
    print(idx,i)
    modis_band_dic[i] = [EV_1KM_RefSB,idx]
0 8
1 9
2 10
3 11
4 12
5 13lo
6 13hi
7 14lo
8 14hi
9 15
10 16
11 17
12 18
13 19
14 26
print( EV_1KM_Emissive.info() )
EV_1KM_Emissive_attributes = EV_1KM_Emissive.attributes()
EV_1KM_Emissive_scales = EV_1KM_Emissive_attributes['radiance_scales']
EV_1KM_Emissive_offsets = EV_1KM_Emissive_attributes['radiance_offsets']
#pprint.pprint(EV_1KM_Emissive_attributes )

for idx,i in enumerate(EV_1KM_Emissive_attributes['band_names'].split(',')):
    print(idx,i)
    modis_band_dic[i] = [EV_1KM_Emissive,idx]
('EV_1KM_Emissive', 3, [16, 2030, 1354], 23, 8)
0 20
1 21
2 22
3 23
4 24
5 25
6 27
7 28
8 29
9 30
10 31
11 32
12 33
13 34
14 35
15 36
for key in modis_band_dic:
    print(key, modis_band_dic[key][0].info()[0], modis_band_dic[key][1])
1 EV_250_Aggr1km_RefSB 0
2 EV_250_Aggr1km_RefSB 1
3 EV_500_Aggr1km_RefSB 0
4 EV_500_Aggr1km_RefSB 1
5 EV_500_Aggr1km_RefSB 2
6 EV_500_Aggr1km_RefSB 3
7 EV_500_Aggr1km_RefSB 4
8 EV_1KM_RefSB 0
9 EV_1KM_RefSB 1
10 EV_1KM_RefSB 2
11 EV_1KM_RefSB 3
12 EV_1KM_RefSB 4
13lo EV_1KM_RefSB 5
13hi EV_1KM_RefSB 6
14lo EV_1KM_RefSB 7
14hi EV_1KM_RefSB 8
15 EV_1KM_RefSB 9
16 EV_1KM_RefSB 10
17 EV_1KM_RefSB 11
18 EV_1KM_RefSB 12
19 EV_1KM_RefSB 13
26 EV_1KM_RefSB 14
20 EV_1KM_Emissive 0
21 EV_1KM_Emissive 1
22 EV_1KM_Emissive 2
23 EV_1KM_Emissive 3
24 EV_1KM_Emissive 4
25 EV_1KM_Emissive 5
27 EV_1KM_Emissive 6
28 EV_1KM_Emissive 7
29 EV_1KM_Emissive 8
30 EV_1KM_Emissive 9
31 EV_1KM_Emissive 10
32 EV_1KM_Emissive 11
33 EV_1KM_Emissive 12
34 EV_1KM_Emissive 13
35 EV_1KM_Emissive 14
36 EV_1KM_Emissive 15
def plot_MODIS_L1(MODIS_band, modis_band_dic):

    data_selected_id = modis_band_dic[str(MODIS_band)][0]
    band_idx = modis_band_dic[str(MODIS_band)][1]
    title = 'MODIS Band' + str(MODIS_band)
    
    figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')

    data = data_selected_id.get()
    data_selected_band = data[band_idx,:,:]
    data_selected_attributes = data_selected_id.attributes()
    _FillValue = data_selected_attributes['_FillValue']
    _FillValue = 65528 # warning wrong _FillValue stored in attributes    

    if modis_band_dic[str(MODIS_band)][0].info()[0] == 'EV_1KM_Emissive':
    
        radiance_scales = data_selected_attributes['radiance_scales']
        radiance_offsets = data_selected_attributes['radiance_offsets']
        data_selected_band[ data_selected_band == _FillValue ] = 0.0
        data_selected_band = (data_selected_band - radiance_offsets[band_idx]) * radiance_scales[band_idx]    
    
    else:
    
        reflectance_scales = data_selected_attributes['reflectance_scales']
        data_selected_band[ data_selected_band == _FillValue ] = 0.0
        data_selected_band = data_selected_band * reflectance_scales[band_idx]

    cmap = [(0.0,0.0,0.0)] + [(cm.jet(i)) for i in range(1,256)] 
    cmap = mpl.colors.ListedColormap(cmap)
    img = plt.imshow(np.fliplr(data_selected_band), cmap=cmap,interpolation='none', origin='lower')
    plt.title(title, fontsize=11)
    cbar = plt.colorbar()
    cbar.ax.tick_params(labelsize=8)
    l = [int(i) for i in np.linspace(0,data_selected_band.shape[1],6)]
    plt.xticks(l, [i for i in reversed(l)], rotation=0, fontsize=11 )
    l = [int(i) for i in np.linspace(0,data_selected_band.shape[0],9)]
    plt.yticks(l, l, rotation=0, fontsize=11 )
    plt.xticks(fontsize=11)
    plt.yticks(fontsize=11)    
    plt.show()
    
plot_MODIS_L1(6,modis_band_dic)
plot_MODIS_L1(7,modis_band_dic)
plot_MODIS_L1(28,modis_band_dic)
plot_MODIS_L1(29,modis_band_dic)

png

png

png

png

Now let's see how to apply ML models created previously to all cloudy pixels of a single MODIS granule. For that we will vectorize the MODIS cloudy pixels (instead of creating a for loop on every granule pixels which will be quite slow).

First, let's create a function that select only cloudy or probably cloudy pixels:

A = np.argwhere( cloud_mask_flag < 2 ) # cloudy pixels
row = A[:,0]
col = A[:,1]

def select_cloudy_pixel_only(data):
    data_masked = ma.masked_where( cloud_mask_flag > 1, data)
    return np.ma.compressed(data_masked)

Get the MODIS L1 data:

EV_250_Aggr1km_RefSB_data = EV_250_Aggr1km_RefSB.get()
EV_500_Aggr1km_RefSB_data = EV_500_Aggr1km_RefSB.get()
EV_1KM_RefSB_data = EV_1KM_RefSB.get()
EV_1KM_Emissive_data = EV_1KM_Emissive.get()
def vectorize_modis_l1(features_train, modis_band_dic):
    
    X_list = []
    
    for feature in features_train:
    
        modis_band = feature.replace('modis_band_','')  
        modis_band_idx = modis_band_dic[modis_band][1]
        
        if modis_band_dic[modis_band][0].info()[0] == 'EV_250_Aggr1km_RefSB':
            data_band = EV_250_Aggr1km_RefSB_data[modis_band_idx,:,:]
            data_band = (data_band - EV_250_Aggr1km_RefSB_offsets[modis_band_idx]) * EV_250_Aggr1km_RefSB_scales[modis_band_idx]
  
        if modis_band_dic[modis_band][0].info()[0] == 'EV_500_Aggr1km_RefSB':
            data_band = EV_500_Aggr1km_RefSB_data[modis_band_idx,:,:]
            data_band = (data_band - EV_500_Aggr1km_RefSB_offsets[modis_band_idx]) * EV_500_Aggr1km_RefSB_scales[modis_band_idx]
            
        if modis_band_dic[modis_band][0].info()[0] == 'EV_1KM_RefSB':
            data_band = EV_1KM_RefSB_data[modis_band_idx,:,:]
            data_band = (data_band - EV_1KM_RefSB_scales[modis_band_idx]) * EV_1KM_RefSB_scales[modis_band_idx]
            
        if modis_band_dic[modis_band][0].info()[0] == 'EV_1KM_Emissive':
            data_band = EV_1KM_Emissive_data[modis_band_idx,:,:] 
            data_band = (data_band - EV_1KM_Emissive_offsets[modis_band_idx]) * EV_1KM_Emissive_scales[modis_band_idx]

        X_list.append( select_cloudy_pixel_only(data_band) )
        
    return np.stack(X_list, axis=-1)
features_train = ['modis_band_1','modis_band_7','modis_band_20',
                  'modis_band_26','modis_band_28','modis_band_29',
                  'modis_band_31','modis_band_32']

X = vectorize_modis_l1(features_train, modis_band_dic)

print(X.shape)
(2016074, 8)
X_scaled = scaler_liquid.transform(X)
%%time

gp_liquid_y_prob = gp_liquid.predict_proba(X_scaled)
CPU times: user 8.28 s, sys: 3.01 s, total: 11.3 s
Wall time: 9.97 s
def plot_y_prob(data, title):
    figure(num=None, figsize=(12, 10), dpi=80, facecolor='w', edgecolor='k')
    cmap = sns.color_palette("RdBu_r", n_colors=10)
    cmap = mpl.colors.ListedColormap(cmap)
    img = plt.imshow(np.fliplr(data), cmap=cmap,interpolation='none', origin='lower', vmin=0.0, vmax=1.0)
    cbar = plt.colorbar(img, cmap=cmap)
    plt.title(title, fontsize=16)
    l = [int(i) for i in np.linspace(0,data.shape[1],6)]
    plt.xticks(l, [i for i in reversed(l)], rotation=0, fontsize=11 )
    l = [int(i) for i in np.linspace(0,data.shape[0],9)]
    plt.yticks(l, l, rotation=0, fontsize=11 )
    plt.xticks(fontsize=11)
    plt.yticks(fontsize=11)
    plt.show()
l = ['liquid', '(not) liquid']
l.sort()

print(l)
['(not) liquid', 'liquid']
modis_y_prob = np.zeros((cloud_mask_flag.shape)) 
modis_y_prob[row,col] = gp_liquid_y_prob[:,1]
plot_y_prob(modis_y_prob, 'Probability of liquid clouds')

png

X_scaled = scaler_ice.transform(X)

gp_ice_y_prob = gp_ice.predict_proba(X_scaled)

modis_y_prob = np.zeros((cloud_mask_flag.shape)) 
modis_y_prob[row,col] = gp_ice_y_prob[:,1]

plot_y_prob(modis_y_prob, 'Probability of ice clouds')

png

Conclusions

In this notebook, I summarized my first attempt of using machine learning Gaussian processes to develop a cloud phase and multilayer cloud classification algorithm for MODIS. GP give good results on the training and test datasets. Unfortunetly it appears that it is not the best approach to use in production since Gaussian processes approach becomes very slow when the size of the training dataset increases (which will be necessary if I want to add new input features such the surface type, more MODIS bands, etc). So

What are the pro(s) of Gaussian Processes ?

  • Easy to train (not a lot of hyperparamters)

What are the con(s) of Gaussian Processes ?

  • Very slow to make a new prediction ('lazy learning': use the entire training dataset to make a prediction)

What do we learned ?

Even if GP is not the best suited approach, I learned a bunch of things that will be used in the future notebooks:

  • cloud phase and mutlialyer cloud classification is a multi-label problem not a multi-class problem (1 vs all strategy not 1 vs 1 to train ML models).
  • how to scale and homogenize the data.
  • how to vectorize MODIS cloudy pixels to apply a ML model.

What is the next step ?

  • I will try the same approach but using a deep neural network instead, that will be presented in the next jupyter notebook.