Machine learning using data from the DS2014PHY IRRI rice samples

Background

The data is publicly available and can be found in the supplementary section of the article "Multivariate-based classification of predicting cooking quality ideotypes in rice (Oryza sativa L.) indica germplasm" by Rosa Paula Cuevas (me), Cyril John Domingo, and Nese Sreenivasulu (2018) published in (Rice 11: 56).

In that paper, R was used in data analyses. I used Ward's cluster analysis to classify the rice varieties into quality types. I then used multinomial logistic regression to create a model that can be used to differentiate the different quality types based on the non-collinear variables used to characterise the rice samples. A random forest algorithm was applied to determine the variables that were most important in classifying the rice samples.

Further data exploration

I am now exploring the dataset using different approaches implemented in Python. The results, of course, are quite different because I use deep learning in neural networks.

There are 25 continuous variables in the dataset:

Variable	Meaning	Description
AC	Amylose content (%)	Predicts hardness and stickiness of cooked rice based on the relative concentration of amylose (starch type with straight chains)
GT_DSC	Gelatinisation temperature (ºC)	Indicates the temperature range at which rice begins to cook based on the melting of amylopectin (crystalline starch type with hyperbranched chains)
PC	Protein content (%)	Indicates the relative amount of proteins inside the rice endosperm based on Kjeldahl N measurements
HRD	Hardness (g)	Force required to bite onto a sample, simulated by compression
ADH	Adhesiveness (g•sec)	Degree of stickiness of a sample, simulated by the work required to separate the probe from the base platform
COH	Cohesiveness	Capacity of a sample to remain intact rather than to break during compression
SPR	Springiness	Capacity of a sample to return to its original shape after compression
SMMAX	Maximum storage modulus (Pa)	Maximum elastic response of a sample (solid-like behaviour
TEMP_SMMAX	Temperature at maximum storage modulus (ºC)	Temperature reading when a sample exhibits maximum solid-like behaviour
TD_SMMAX	Tan delta at max storage modulus	Ratio of loss to storage modulus at maximum storage modulus
LM_SMMAX	Loss modulus at max storage modulus (Pa)	Viscous response of a sample at the maximum storage modulus
TEMP_GELPT	Temperature at Gel Point (ºC)	Temperature reading when the loss and the storage moduli are equal (tan delta = 1)
TROUGH_SM	Lowest storage modulus (Pa)	Lowest storage modulus value after reaching the maximum
SLOPE1_SM	Increasing storage modulus	Measured from gel point to SMMAX
SLOPE2_SM	Decreasing storage modulus	Measured from the highest to the lowest points of storage modulus after SMMAX
SLOPE3_LM	Increasing loss modulus	Measured from gel point to maximum loss modulus
SLOPE4_LM	Decreasing loss modulus	Measured from maximum loss modulus until it levelled off
PV	Peak viscosity (RVU)	Highest viscosity recorded as the sample is cooked
TV	Trough viscosity (RVU)	Lowest viscosity recorded as the sample is kept at a high temperature
FV	Final viscosity (RVU)	Last viscosity reading when the sample is cooled
BD	Breakdown (RVU)	Difference between peak viscosity and trough viscosity
SB	Setback (RVU)	Difference between final viscosity and peak viscosity
LO	Lift-off (RVU)	Difference between final viscosity and trough viscosity
PASTEMP_RECALC	Pasting temperature (ºC)	Temperature when a sample starts thickening (as temperature is increased)
PT	Pasting time (min)	How long it takes to reach peak viscosity

First, I calculate the Pearson correlation coefficient and determine the variable pairs with high coefficients (r > 0.70). From these variable pairs, I picked variables to be excluded from the analysis.

Second, I conduct K-means clustering. To determine the number of clusters, I used the elbow method (calculating the sum of squared distances per cluster number), the silhouette method, and the dendogram method. These methods indicate that a five-cluster solution is the best; hence, the subsequent deep learning neural network algorithm is based on five clusters.

Neural networks are programming approaches to learn from observational data by loosely imitating the way the brain's neurons connect and process information. Deep learning, on the other hand, is a set of techniques for learning within neural networks. I used these techniques to classify the samples.

I divide the samples into a test and a training set. Then, the data is scaled so that all variables have values within the same scale. The input layer is composed of the 18 variables. The hidden layer has 100 units and uses ReLU as the activation function. The output layer has five units and softmax as its activation function.

The deep learning algorithm's performance is evaluated using the DS2014PHY data.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
DS2014PHY_Deep-Learning-Model.ipynb		DS2014PHY_Deep-Learning-Model.ipynb
DS2014PHY_alldata-2.csv		DS2014PHY_alldata-2.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DS2014PHY_Deep-Learning-Model.ipynb

DS2014PHY_Deep-Learning-Model.ipynb

DS2014PHY_alldata-2.csv

DS2014PHY_alldata-2.csv

README.md

README.md

Repository files navigation

Machine learning using data from the DS2014PHY IRRI rice samples

Background

Further data exploration

About

Releases

Packages

Languages

rochiecuevas/DS2014PHY

Folders and files

Latest commit

History

Repository files navigation

Machine learning using data from the DS2014PHY IRRI rice samples

Background

Further data exploration

About

Topics

Resources

Stars

Watchers

Forks

Languages