Skip to content


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation


Alternative Methods to Breeding Value Prediction in Loblolly Pine


  • Phenotypic variation in forest trees may be partitioned into genomic and environmntal compenets which are consequently used to estimate the heritability of traits as the proportion of total phenotypic variation attributed to genetic variation.

  • Applied tree breeding programs can use matrices of relationships, based either on recorded pedigrees in structured breeding populations or on genotypes of molecular genetic markers, to model genetic covariation among related individuals and predict genetic values for individuals for whom no phenotypic measurements are available.

  • This study tests the hypothesis that genetic covariation among individuals of similar genetic value will be reflected in shared patterns of gene expression. We collected gene expression data by high-throughput sequencing of RNA isolated from pooled seedlings of parents with known genetic value, and compared alternative approaches to data analysis to test this hypothesis.

Background of samples

  • All information about samples is located in the RNA Seq Data repo

Step 1 - Transcript normalization and SNP filtering

The counts were normarlized multiple ways, however the following way was used for prediction:

  1. Using techincal replicate counts and asreml to normalize for batch, index, lane, and pedigree:


  1. For more normalization schemes using DESEQ2, edgeR, sommer in bio and tech see repo folder step3.normalization

  2. SNP's were filtered multiple ways:


  1. The final data sets used for prediction were restructured to be in identical order:


Step 2 - Prediction of EW families with LGEP

  • The EW vs. LGEP

Organizing test and train data sets

  • EW and LGEP families were subset into train and test objects


Conduct prediction on EW

Family mean estimates of counts and snps were used for prediction with OmicKriging and glmnet (lasso/ridge):

EW predictions

Step 5 - Prediction of 70-fold CV

Construct 70 test groups

Instead of predicting across batch, here we split the complete data set into a 7-fold CV (repeated 10 times). The cv groups were split so that each test fold had individuals which were spread across the phenotypic range:

create 70 fold

Conduct prediction on each of the test folds using all data


Visualize predictions

70 Fold CV visualization

Step 6 - Prediction using LOO

Predictions were conducted using a maximum training size of 55 to predict the 56th family using OK, lasso, & ridge. Script:

LOO Script

Visualize prediction of LOO

LOO markdown

**The below part is defunct, the scripts are still there but are not used.

Estimate anova scores for features using LGEP and then conduct prediction on EW

Generate anova scores using LGEP as training

Utilizing the biological replicate data sets, ANOVA scores were estimated for each feature (snp/transcript):


Conduct prediction on EW

Family mean estimates of counts and snps were used for prediction with OmicKriging and glmnet (lasso/ridge):

EW predictions

The below part is defunct, the scripts are still there but are not used.

First construct the 70 test groups, estimate ANOVA scores, and then conduct predictions

Conduct prediction on each of the test folds across pvals

Just as when predicting on the EW families, predictions were carried out for each of the 70 unique test groups:

predict 70-fold

Visualize prediction of 70-fold

70-fold-cv markdown


Predicting NCSU Coastal Breeding Values using RNA Seq data







No releases published


No packages published