Skip to content

rochiecuevas/DS2014PHY

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Machine learning using data from the DS2014PHY IRRI rice samples

Background

The data is publicly available and can be found in the supplementary section of the article "Multivariate-based classification of predicting cooking quality ideotypes in rice (Oryza sativa L.) indica germplasm" by Rosa Paula Cuevas (me), Cyril John Domingo, and Nese Sreenivasulu (2018) published in (Rice 11: 56).

In that paper, R was used in data analyses. I used Ward's cluster analysis to classify the rice varieties into quality types. I then used multinomial logistic regression to create a model that can be used to differentiate the different quality types based on the non-collinear variables used to characterise the rice samples. A random forest algorithm was applied to determine the variables that were most important in classifying the rice samples.

Further data exploration

I am now exploring the dataset using different approaches implemented in Python. The results, of course, are quite different because I use deep learning in neural networks.

There are 25 continuous variables in the dataset:

Variable Meaning Description
AC Amylose content (%) Predicts hardness and stickiness of cooked rice based on the relative concentration of amylose (starch type with straight chains)
GT_DSC Gelatinisation temperature (ºC) Indicates the temperature range at which rice begins to cook based on the melting of amylopectin (crystalline starch type with hyperbranched chains)
PC Protein content (%) Indicates the relative amount of proteins inside the rice endosperm based on Kjeldahl N measurements
HRD Hardness (g) Force required to bite onto a sample, simulated by compression
ADH Adhesiveness (g•sec) Degree of stickiness of a sample, simulated by the work required to separate the probe from the base platform
COH Cohesiveness Capacity of a sample to remain intact rather than to break during compression
SPR Springiness Capacity of a sample to return to its original shape after compression
SMMAX Maximum storage modulus (Pa) Maximum elastic response of a sample (solid-like behaviour
TEMP_SMMAX Temperature at maximum storage modulus (ºC) Temperature reading when a sample exhibits maximum solid-like behaviour
TD_SMMAX Tan delta at max storage modulus Ratio of loss to storage modulus at maximum storage modulus
LM_SMMAX Loss modulus at max storage modulus (Pa) Viscous response of a sample at the maximum storage modulus
TEMP_GELPT Temperature at Gel Point (ºC) Temperature reading when the loss and the storage moduli are equal (tan delta = 1)
TROUGH_SM Lowest storage modulus (Pa) Lowest storage modulus value after reaching the maximum
SLOPE1_SM Increasing storage modulus Measured from gel point to SMMAX
SLOPE2_SM Decreasing storage modulus Measured from the highest to the lowest points of storage modulus after SMMAX
SLOPE3_LM Increasing loss modulus Measured from gel point to maximum loss modulus
SLOPE4_LM Decreasing loss modulus Measured from maximum loss modulus until it levelled off
PV Peak viscosity (RVU) Highest viscosity recorded as the sample is cooked
TV Trough viscosity (RVU) Lowest viscosity recorded as the sample is kept at a high temperature
FV Final viscosity (RVU) Last viscosity reading when the sample is cooled
BD Breakdown (RVU) Difference between peak viscosity and trough viscosity
SB Setback (RVU) Difference between final viscosity and peak viscosity
LO Lift-off (RVU) Difference between final viscosity and trough viscosity
PASTEMP_RECALC Pasting temperature (ºC) Temperature when a sample starts thickening (as temperature is increased)
PT Pasting time (min) How long it takes to reach peak viscosity

First, I calculate the Pearson correlation coefficient and determine the variable pairs with high coefficients (r > 0.70). From these variable pairs, I picked variables to be excluded from the analysis.

Second, I conduct K-means clustering. To determine the number of clusters, I used the elbow method (calculating the sum of squared distances per cluster number), the silhouette method, and the dendogram method. These methods indicate that a five-cluster solution is the best; hence, the subsequent deep learning neural network algorithm is based on five clusters.

Neural networks are programming approaches to learn from observational data by loosely imitating the way the brain's neurons connect and process information. Deep learning, on the other hand, is a set of techniques for learning within neural networks. I used these techniques to classify the samples.

I divide the samples into a test and a training set. Then, the data is scaled so that all variables have values within the same scale. The input layer is composed of the 18 variables. The hidden layer has 100 units and uses ReLU as the activation function. The output layer has five units and softmax as its activation function.

The deep learning algorithm's performance is evaluated using the DS2014PHY data.

About

Machine learning using the rice quality data obtained from the DS2014PHY samples (IRRI).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published