Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Color nodes according to correlation between relative abundance and external variables #340

Open
jpcasasruiz opened this issue Jun 29, 2022 · 5 comments

Comments

@jpcasasruiz
Copy link

Hi there,
first of all, thanks for taxa and metacoder. These are extremely useful packages.

I was wondering if it would be possible to color the nodes and edges of a taxonomic tree according the correlation between the relative abundance of each taxa and an external variable (e.g. any environmental variable such as pH or Salinity). Thanks!

@zachary-foster
Copy link
Contributor

Thanks!

Yea, it should be, but there is no function to do that for you yet. You would have to fit a model for each taxon's abundance vs that external variable and add the correlation and p-value as columns in a table with per-taxon data in the taxmap object and then color by the correlation column, while setting all correlations with a p value > 0.05 to 0. Make sense?

@zachary-foster
Copy link
Contributor

oops, did not mean to close.

@zachary-foster
Copy link
Contributor

Apparently Ctrl + Enter closes issues now, haha

@jpcasasruiz
Copy link
Author

Thanks for your fast response!
That is what I had in mind, but I am not familiar enough with the taxmap object as to do it. Let me know if you ever give it a try. I think this would be a cool addition to future versions of metacoder.

@zachary-foster
Copy link
Contributor

Here is a prototype of the technique that could be used to make a function in the future.

Load libraries

library(metacoder)
## This is metacoder version 0.3.6 (stable)
library(tidyr)
library(readr)
library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(purrr)

I will use the TARA expedition dataset since that is the first that
comes to mind that had sample data with a continuous variable
(latitude).

Parsing taxonomic data

The data set at the below URL was downloaded and uncompressed:

http://taraoceans.sb-roscoff.fr/EukDiv/data/Database_W5_OTU_occurences.tsv.zip

raw_data <- readr::read_tsv("data/Database_W5_OTU_occurences.tsv")
obj <- parse_tax_data(raw_data, class_cols = "lineage", class_sep = "\\|", sep_is_regex = TRUE)

Getting sample data

The sample data was downloaded from the URL below:

http://taraoceans.sb-roscoff.fr/EukDiv/data/Database_W1_Sample_parameters.xls

sample_data <- read_excel("data/Database_W1_Sample_parameters.xls")

Caluculate read abundance per taxon

The input data included read abundance for each sample-OTU combination,
but we need the abundances associated with each taxon for graphing and
regression. There will usually be multiple OTUs assigned to the same
taxon, especially for coarse taxonomic ranks (e.g. the root will have
all OTU indexes), so the abundances at those indexes are are summed to
provide the total abundance for each taxon.

obj$data$otu_prop <- calc_obs_props(obj, data = "tax_data", cols = sample_data[["PANGAEA ACCESSION NUMBER"]])
## Calculating proportions from counts for 334 columns for 140707 observations.
obj$data$tax_abund <- calc_taxon_abund(obj, data = "otu_prop",
                                       cols = sample_data[["PANGAEA ACCESSION NUMBER"]])
## Summing per-taxon counts from 334 columns for 9512 taxa

Looking for correlations between latitude and taxon abundance

I will be using simple linear regression to demonstrate how this might
work for plotting with metacoder, but it is likely the linear
regression is not the correct method for this kind of proportional data.
This code does the bare minimum as an example and should not be used as
is for research.
Once I learn what an appropriate method is I will try
to remember to update this post.

The first step is to get a table for each taxon in a format that typical
regression functions like lm expect. This would be a table with one
row per sample and columns for abundance and another for the continuous
variable of interest from the sample metadata. Since the taxonomic
abundance matrix in the taxmap object has the abundance data in rows
and the samples in columns, a single row corresponding to each taxon
must be transposed and combined with the sample IDs from the column
names. This can then be joined with the sample metadata based on sample
IDs to generate the input for lm or similar functions. The output of
lm can then be returned for each taxon and reformatted into a table
with one row per taxon that includes columns for taxon ID, p-value, and
correlation strength. Here is a function to do the test for each taxon
(row):

run_one_test <- function(tax_prop_row) {
  sample_ids <- sample_data[["PANGAEA ACCESSION NUMBER"]]
  props <- unlist(tax_prop_row[1, sample_ids])
  test_data <- tibble(sample_id = names(props), prop = props) %>%
    left_join(sample_data, c("sample_id" = "PANGAEA ACCESSION NUMBER"))
  lm_result = summary(lm(prop ~ `LATITUDE (Decimal Degrees)`, data = test_data))
  output <- tibble(
    taxon_id = tax_prop_row$taxon_id,
    coeff = lm_result$coefficients[2, 1],
    pvalue = lm_result$coefficients[2, 4]
  )
  return(output)
}

And here is how to run that function for each row and format the results
as a table:

obj$data$tax_lm <- obj$data$tax_abund %>%
  group_by(taxon_id) %>%
  group_split() %>%
  map_dfr(run_one_test)
obj
## <Taxmap>
##   9512 taxa: aab. Eukaryota ... obw. Metridia+gerlachei
##   9512 edges: NA->aab, NA->aac, NA->aad ... nxm->obv, nom->obw
##   4 data sets:
##     tax_data:
##       # A tibble: 140,707 × 349
##         taxon_id md5sum     cid  totab TARA_…¹ TARA_…² TARA_…³ TARA_…⁴
##         <chr>    <chr>    <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 isw      43fc0f… 127368 1.44e8  258322   73563  262269  331687
##       2 cpo      7e5e35…  86387 6.12e7    6476  141987  122993  379254
##       3 deo      fe8c24… 122743 3.56e7  643876      30  877579     355
##       # … with 140,704 more rows, 341 more variables:
##       #   TARA_N000000839 <dbl>, TARA_N000000845 <dbl>,
##       #   TARA_N000000837 <dbl>, TARA_N000000829 <dbl>,
##       #   TARA_N000000833 <dbl>, TARA_N000000841 <dbl>,
##       #   TARA_A100000402 <dbl>, TARA_A100001646 <dbl>,
##       #   TARA_A100000394 <dbl>, TARA_A100000393 <dbl>, …, and
##       #   abbreviated variable names ¹​TARA_N000000077, …
##     otu_prop:
##       # A tibble: 140,707 × 335
##         taxon_id TARA_X00000…¹ TARA_…² TARA_…³ TARA_…⁴ TARA_…⁵ TARA_…⁶
##         <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 isw           0.123    3.22e-1 0.0209  0.400   5.89e-1 0.0327 
##       2 cpo           0.00735  1.33e-1 0.210   0.00913 9.90e-2 0.216  
##       3 deo           0.000325 3.07e-4 0.00137 0.175   5.33e-4 0.00230
##       # … with 140,704 more rows, 328 more variables:
##       #   TARA_X000000741 <dbl>, TARA_A200000148 <dbl>,
##       #   TARA_A200000158 <dbl>, TARA_A200000102 <dbl>,
##       #   TARA_A200000170 <dbl>, TARA_A200000110 <dbl>,
##       #   TARA_A200000123 <dbl>, TARA_A200000139 <dbl>,
##       #   TARA_X000001056 <dbl>, TARA_X000001010 <dbl>, …, and
##       #   abbreviated variable names ¹​TARA_X000000361, …
##     tax_abund:
##       # A tibble: 9,512 × 335
##         taxon_id TARA_X00000…¹ TARA_…² TARA_…³ TARA_…⁴ TARA_…⁵ TARA_…⁶
##         <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 aab          0.996     9.68e-1  0.733  0.984   9.76e-1 0.811  
##       2 aac          0.00360   3.21e-2  0.222  0.0140  1.13e-2 0.176  
##       3 aad          0.0000433 6.11e-5  0.0298 0.00112 5.63e-4 0.00811
##       # … with 9,509 more rows, 328 more variables:
##       #   TARA_X000000741 <dbl>, TARA_A200000148 <dbl>,
##       #   TARA_A200000158 <dbl>, TARA_A200000102 <dbl>,
##       #   TARA_A200000170 <dbl>, TARA_A200000110 <dbl>,
##       #   TARA_A200000123 <dbl>, TARA_A200000139 <dbl>,
##       #   TARA_X000001056 <dbl>, TARA_X000001010 <dbl>, …, and
##       #   abbreviated variable names ¹​TARA_X000000361, …
##     tax_lm:
##       # A tibble: 9,512 × 3
##         taxon_id       coeff pvalue
##         <chr>          <dbl>  <dbl>
##       1 aab      -0.00000805  0.971
##       2 aac       0.0000573   0.782
##       3 aad      -0.0000372   0.260
##       # … with 9,509 more rows
##   0 functions:

It would also be useful to have the per-taxon mean proportion for
plotting and filtering, so we can add that as another table:

obj$data$tax_mean_prop <- calc_group_mean(obj, data = "tax_abund",
                                          cols = sample_data[["PANGAEA ACCESSION NUMBER"]],
                                          groups = "mean_prop")

Now we have the results of the per-taxon regression in a format that can
be plotted.

obj %>%
  filter_taxa(taxon_names == "Bacteria", subtaxa = TRUE, reassign_obs = FALSE) %>%
  filter_taxa(n_supertaxa < 3, taxon_names != "NA", reassign_obs = FALSE) %>%
  # filter_taxa(mean_prop > 0.00001, reassign_obs = FALSE) %>%
  filter_taxa(pvalue < 0.05, supertaxa = TRUE, reassign_obs = FALSE) %>%
  heat_tree(node_label = taxon_names,
            node_size = mean_prop, 
            node_color = ifelse(pvalue < 0.05, coeff, 0), 
            node_color_interval = c(-0.00001, 0.00001), 
            node_color_range = c("cyan", "gray", "tan"), 
            node_size_range = c(0.01, 0.04),
            node_size_axis_label = "Mean taxon read proportion",
            node_color_axis_label = "Regression coeffecient",
            layout = "davidson-harel",
            initial_layout = "reingold-tilford")

unnamed-chunk-6-1

Its clearly not the best dataset/variable as far as interesting results go, but it still demonstrates the idea.
Once I have a chance to look into which statistical methods would be
most appropriate I might turn this process into a function to make it
easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants