Confusion about filter_taxa and filter_obs #332

levlitichev · 2022-02-10T19:58:21Z

Hello,

Thank you for developing MetaCoder. I am enjoying the package so far. I am trying to filter my dataset before plotting heat trees, but instead of throwing counts away for low-abundant taxa, I'd like to reassign those counts to the lowest available supertaxa.

Here is my dataset:

library(metacoder)
#> This is metacoder verison 0.3.5 (stable)

(toy.df <- data.frame(
  sample1=c(1,2,4,10,0,1,18,3),
  sample2=c(0,1,0,9,1,0,4,2),
  sample3=c(0,2,3,13,0,1,17,2),
  taxonomy=c("K","K;P1","K;P2","K;P1;C1","K;P1;C2","K;P2;C3","K;P2;C4","K;P3;C5")))
#>   sample1 sample2 sample3 taxonomy
#> 1       1       0       0        K
#> 2       2       1       2     K;P1
#> 3       4       0       3     K;P2
#> 4      10       9      13  K;P1;C1
#> 5       0       1       0  K;P1;C2
#> 6       1       0       1  K;P2;C3
#> 7      18       4      17  K;P2;C4
#> 8       3       2       2  K;P3;C5

So, classes 1 and 2 (C1, C2) belong to phylum 1 (P1); classes 3 and 4 (C3, C4) belong to phylum 2; and class 5 (C5) belongs to phylum 3 (P3). I only want to display classes 1 and 4 (and their supertaxa) in my heat tree. But instead of throwing out counts for the other classes (C2, C3, C5), I want their counts to be assigned to the lowest available supertaxon. This is my desired output:

(desired.df <- data.frame(
  sample1=c(4,2,5,10,18),
  sample2=c(2,2,0,9,4),
  sample3=c(2,2,4,13,17),
  taxonomy=c("K","K;P1","K;P2","K;P1;C1","K;P2;C4")))
#>   sample1 sample2 sample3 taxonomy
#> 1       4       2       2        K
#> 2       2       2       2     K;P1
#> 3       5       0       4     K;P2
#> 4      10       9      13  K;P1;C1
#> 5      18       4      17  K;P2;C4

I want the counts for C2 to be assigned to P1, the counts for C3 to be assigned to P2, and the counts for C5 to be assigned to K. (I feel like this is the best way to simplify my heat tree without throwing away information, but please let me know if I'm thinking about that incorrectly.)

I've tried different things unsuccessfully:

toy.taxmap <- parse_tax_data(toy.df, class_cols = "taxonomy", class_sep = ";")

# wrong because I lose supertaxa of C1 and C4
toy.taxmap %>%
  filter_taxa(taxon_names %in% c("C1", "C4")) %>%
  get_dataset("tax_data")
#> # A tibble: 2 × 5
#>   taxon_id sample1 sample2 sample3 taxonomy
#>   <chr>      <dbl>   <dbl>   <dbl> <chr>   
#> 1 f             10       9      13 K;P1;C1 
#> 2 i             18       4      17 K;P2;C4

# wrong because no filtering seems to occur
toy.taxmap %>%
  filter_taxa(taxon_names %in% c("C1", "C4"), supertaxa=T) %>%
  get_dataset("tax_data")
#> # A tibble: 8 × 5
#>   taxon_id sample1 sample2 sample3 taxonomy
#>   <chr>      <dbl>   <dbl>   <dbl> <chr>   
#> 1 b              1       0       0 K       
#> 2 c              2       1       2 K;P1    
#> 3 d              4       0       3 K;P2    
#> 4 f             10       9      13 K;P1;C1 
#> 5 c              0       1       0 K;P1;C2 
#> 6 d              1       0       1 K;P2;C3 
#> 7 i             18       4      17 K;P2;C4 
#> 8 b              3       2       2 K;P3;C5

# wrong because I lose supertaxa of C1 and C4
toy.taxmap %>%
  filter_obs(data="tax_data", taxonomy %in% c("K;P1;C1","K;P2;C4"),
             drop_taxa=T, reassign_obs=T) %>%
  get_dataset("tax_data")
#> # A tibble: 2 × 5
#>   taxon_id sample1 sample2 sample3 taxonomy
#>   <chr>      <dbl>   <dbl>   <dbl> <chr>   
#> 1 f             10       9      13 K;P1;C1 
#> 2 i             18       4      17 K;P2;C4

I think the FAQs describe a very similar situation to mine, but I wasn't able to use those examples to do what I wanted.

Thanks in advance for your help.

^{Created on 2022-02-10 by the reprex package (v2.0.1)}

zachary-foster · 2022-02-11T22:58:15Z

Thanks for the example code! If you use calc_taxon_abund, this will happen automatically. Right now you are filtering ASV/OTUs (I assume) by taxon, but it does not really make sense to add the counts from one ASV to another, even if that was identified as a supertaxon. Rather, you should add up the counts by taxon and then filter the per-taxon counts. That way, removing the counts for a genus, for example, would not change the counts for the family the genus is a part of, since those counts were already used to make the family-level counts. Anyway, you need per-taxon data to plot with heat_tree and right now it appears that you have per-ASV/organism data. Does that make sense? I can provide an example if needed.

levlitichev · 2022-02-15T21:52:34Z

Thanks for your reply. Aggregating to taxons first and then filtering makes sense. My only reservation is that eventually I want to show relative abundance, but it doesn't make sense to compute relative abundance after aggregating to taxons. Right?

zachary-foster · 2022-02-15T22:16:55Z

You could calculate the relative abundances before aggregating to the taxon level and it will work. You can use calc_obs_props for that. You are right that trying to calculate relative abundance on the taxon counts would not work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about filter_taxa and filter_obs #332

Confusion about filter_taxa and filter_obs #332

levlitichev commented Feb 10, 2022

zachary-foster commented Feb 11, 2022

levlitichev commented Feb 15, 2022

zachary-foster commented Feb 15, 2022

Confusion about filter_taxa and filter_obs #332

Confusion about filter_taxa and filter_obs #332

Comments

levlitichev commented Feb 10, 2022

zachary-foster commented Feb 11, 2022

levlitichev commented Feb 15, 2022

zachary-foster commented Feb 15, 2022