Skip to content

Latest commit

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..

Sept 2018 - Jan 2019

Summary

Predict precipitation at ARM TWP-C1 an hour later (tried 6 hours later as well) using in situ measurement data

  • Input data: T_p, rh_p, u_p, v_p, prec_sfc, (t_cos, t_sin)
  • Method: NN (4 hidden layers), RF+NN
  • Problem: it tries to predict most events as no precipitation events
  • Suspect: current input data cannot fully capture the dynamics, and data is imbalanced (precip vs no precip, hour imbalance)

Remarks

  • All .ipynb files are in colab/.
  • From Run 02.*, run version includes a .* at the end.
  • From Run 04.*, run version *.0 is reserved for testground.

Data Extraction

Convert raw data to data of interest in NetCDF

extract_data.py: raw DataSet in .nc / .cdf -> var of interest DataSet in .cdf

Convert data of interest in NetCDF to 2D (flattened) easy-to-read DataFrame-supported .csv

netcdf-flattening.ipynb <- netcdf-flattening.py: var of interest in .cdf -> flattened two-dimensional (pandas) DataFrame in .csv, append the next hour precipitation as labels

netcdf-flattening-6-hour-cumulative-precip.ipynb: ditto, but apeend the next 6-hour cumulative precipitation as labels

RF-1hrlater.ipynb: append RF-predicted class onto dataset in .csv. Because of lack of disk quota, I cannot install more packages in the virtual environment. I have requested for more disk quota. (14 Dec 2018)

Classical Machine Learning

Classification - SVM - RBF kernel

SVM classifies if it is rainy the next hour (the second best classifier)

  1. DATADIR = ARM_1hrlater.csv
  2. Classification Threshold = 0.1
  3. train_size = 0.6
  4. Rainy period ratio = 0.1659/ 0.4869 - blind test accuracy = 0.8341/ 0.5131
  5. test accuracy = 0.8922/ 0.4(bad)
  6. plt.plot = 1D True precipitation plots for both classes separately

SVM classifies if it is rainy the next 6 hours

  1. DATADIR = ARM_6hrcumul.csv
  2. Classification Threshold = 0.3002
  3. train_size = 0.6
  4. Rainy period ratio = 0.4995 - blind test accuracy = 0.5
  5. test accuracy = 0.4672
  6. plt.plot = None

Classification - Random Forest

RF classifies if it is rainy the next hour (the best classifier)

  1. DATADIR = ARM_1hrlater.csv
  2. Classification Threshold = 0.1/ 0/ 0.05
  3. train_size = 0.6
  4. Rainy period ratio = 0.1659/ 0.4869/ 0.3183 - blind test accuracy = 0.8341/ 0.5131/ 0.6817
  5. test accuracy = 0.9/ 0.85/ 0.88 !!!
  6. plt.plot = 1D True precipitation plots for both classes separately

Neural Networks

Abs loss is de-normalized, and is not used as a loss metric. Other regression losses are normalized.

Regression

Na茂ve NN regression

  1. DATADIR = ARM_1hrlater.csv
  2. train_size = 0.75
  3. num_epoch = 100000
  4. n_hid = [n_in = 151, 128, 64, 32, 16, n_out = 1]
  5. run_ID = 01
  6. connections = ['fc'] #, 'bn', 'do'
  7. act_funcs = ['relu', 'leaky_relu']
  8. loss_funcs = ['square', 'quartic']#, 'huber']
  9. learning_rates = [1e-2, 1e-3, 1e-4]#, 1e-5, 1e-6]
  10. plt.plot = True precipitation vs Predicted precipitation
  11. LeakyReLU-sqloss-1e-3 mean abs loss = 1.131 < other config, tends to all collapse to zero due to imbalanced data

NN regression after RF binary classification

  1. DATADIR = ARM_1hrlater_RFclassified.csv; ARM_1hrlater_RFclassified_threshold_0.05.csv
  2. train_size = 0.6 - have to follow RF config in RF-1hrlater.ipynb
  3. num_epoch = 100000
  4. n_hid = [n_in = 151, 128, 64, 32, 16, n_out = 1]
  5. run_ID = 04.1; 05.1
  6. connections = ['fc']#, 'bn', 'do']
  7. act_funcs = ['relu', 'leaky_relu']
  8. loss_funcs = ['square', 'quartic']
  9. learning_rates = [1e-3, 1e-4]
  10. plt.plot = True precipitation vs Predicted precipitation in 2 colours (each for each RF class)
  11. r04.1: threshold = 0, ReLU-sqloss-1e-3 mean abs loss = 0.8082 < other config
  12. r05.1: threshold = 0.05, ReLU-sqloss-1e-3 mean abs loss = 0.9153 < other config

Classification

NN/ Log reg classifies if it is rainy the next 6 hours (overfit, the worst)

  1. DATADIR = ARM_6hrcumul.csv
  2. Classification Threshold = 0.31
  3. train_size = 0.6
  4. n_hid = [n_in = 151, n_out = 1]
  5. num_epoch = 3000
  6. run_ID = 02.0; 02.1; 02.2
  7. connections = ['fc']
  8. act_funcs = ['log_reg']#['leaky_relu','relu']
  9. loss_funcs = ['xent','hinge']#,'square']
  10. learning_rates = [1e-3]#, 1e-5, 1e-6]
  11. plt.plot = True precipitation vs Probability of Raining
  12. r02.0: n_hid = [n_in = 151, 16, 4, n_out = 1] ReLU-Hinge accuracy = 0.5219 > other config
  13. r02.1: n_hid = [n_in = 151, 16, 4, n_out = 1] accuracy < 0.5 sucks
  14. r02.2: Hinge==linSVM accuracy = 0.5525 > 0.5473 = xEnt==LogReg accuracy

Log reg (r03.0)/ linear SVM (r03.0)/ simple 1-hid-layer NN (r03.1) classifies if it is rainy the next hour

  1. DATADIR = ARM_1hrlater.csv
  2. Classification Threshold = 0.1
  3. train_size = 0.6
  4. n_hid = [n_in = 151, (5), n_out = 1] - the hid layer exists in some runs only
  5. num_epoch = 3000
  6. run_ID = 03.0; 03.1
  7. connections = ['fc']
  8. act_funcs = ['leaky_relu','relu'] #'lr-svmlin']
  9. loss_funcs = ['xent','hinge']#,'square'] - xent and hinge corresponds to logistic reg and linear SVM resp. when no hid layers (r03.0)
  10. learning_rates = [1e-3]#, 1e-5, 1e-6]
  11. plt.plot = 1D True precipitation plots for both classes separately
  12. r03.0: Hinge==linSVM accuracy = 0.8938 > 0.8871 = xEnt==LogReg accuracy
  13. r03.1: 1hd-ReLU-Hinge accuracy = 0.8756 > other 1hd config