Skip to content

Commit

Permalink
NCBIAccession collapse + vroom patch
Browse files Browse the repository at this point in the history
Removed the step from the PREDICT pipeline that collapses the accessions, and then added it at the very end of the pipeline for unique-otherwise records. Also added col_type specification for two very sparse columns that seem to get read as logical instead of double by vroom, leading to the loss of PMID and PublicationYear in the final copy. This is something users also need to be wary of when they read the data in.
  • Loading branch information
Colin J. Carlson committed Aug 1, 2021
1 parent 88d3692 commit 240650c
Show file tree
Hide file tree
Showing 18 changed files with 43,395 additions and 34,025 deletions.
5 changes: 2 additions & 3 deletions Code/02_2a_Digest PREDICT.R
Expand Up @@ -21,9 +21,8 @@ predict.raw %>%
Host = `Species Scientific Name Based on Field Morphology`,
Accession = `Genbank #`) %>%


group_by_at(vars(-Accession)) %>%
summarize(Accession = str_c(Accession, collapse = ", ")) %>%
# group_by_at(vars(-Accession)) %>%
# summarize(Accession = str_c(Accession, collapse = ", ")) %>%
unique() %>%

# The below step deals with flagged host names and "cf." names equally
Expand Down
4 changes: 2 additions & 2 deletions Code/02_2c_Digest PREDICT PCR.R
Expand Up @@ -42,8 +42,8 @@ predict.raw %<>% select(Host,
rename(NCBIAccession = "GenbankAccessionNumber") %>%

# Collapse the Genbank info
group_by_at(vars(-NCBIAccession)) %>%
summarize(NCBIAccession = str_c(NCBIAccession, collapse = ", ")) %>%
# group_by_at(vars(-NCBIAccession)) %>%
# summarize(NCBIAccession = str_c(NCBIAccession, collapse = ", ")) %>%
unique() %>%

# Clean up the host info
Expand Down
8 changes: 4 additions & 4 deletions Code/03_Merge clean files.R
Expand Up @@ -5,10 +5,10 @@ library(tidyverse); library(magrittr); library(vroom)

source("./Code/001_TaxizeFunctions.R")

gb <- vroom("Intermediate/Formatted/GenbankFormatted.csv.gz")
clo <- read_csv("Intermediate/Formatted/CloverFormatted.csv")
pred <- read_csv("Intermediate/Formatted/PREDICTAllFormatted.csv")
globi <- read_csv("Intermediate/Formatted/GLOBIFormatted.csv")
gb <- vroom("Intermediate/Formatted/GenbankFormatted.csv.gz", col_type = cols(PMID = col_double(), PublicationYear = col_double()))
clo <- read_csv("Intermediate/Formatted/CloverFormatted.csv", col_type = cols(PMID = col_double(), PublicationYear = col_double()))
pred <- read_csv("Intermediate/Formatted/PREDICTAllFormatted.csv", col_type = cols(PMID = col_double(), PublicationYear = col_double()))
globi <- read_csv("Intermediate/Formatted/GLOBIFormatted.csv", col_type = cols(PMID = col_double(), PublicationYear = col_double()))

if(class(clo$NCBIAccession)=='numeric') {clo %<>% mutate(NCBIAccession = as.character(NCBIAccession))}

Expand Down
6 changes: 5 additions & 1 deletion Code/04_High level VIRION checks.R
@@ -1,5 +1,5 @@

virion <- vroom("Intermediate/Formatted/VIRIONUnprocessed.csv.gz")
virion <- vroom("Intermediate/Formatted/VIRIONUnprocessed.csv.gz", col_type = cols(PMID = col_double(), PublicationYear = col_double()))

# # Is there anything that's not vertebrate in here?
#
Expand Down Expand Up @@ -76,4 +76,8 @@ virion %<>% select(-c(HostSynonyms))
virion %<>% distinct()
virion %<>% mutate(across(everything(), ~replace_na(.x, '')))

virion %<>%
group_by_at(vars(-NCBIAccession)) %>%
summarize(NCBIAccession = str_c(NCBIAccession, collapse = ", "))

vroom_write(virion, "Virion/Virion.csv.gz")
2 changes: 1 addition & 1 deletion Code/05_Dissolve VIRION.R
Expand Up @@ -5,7 +5,7 @@ library(magrittr)
library(tidyverse)
library(vroom)

virion <- vroom("Virion/Virion.csv.gz")
virion <- vroom("Virion/Virion.csv.gz", col_type = cols(PMID = col_double(), PublicationYear = col_double()))

fixer <- function(x) {toString(unique(unlist(x)))}

Expand Down
8,716 changes: 5,920 additions & 2,796 deletions Intermediate/Formatted/PREDICTAllFormatted.csv

Large diffs are not rendered by default.

8,339 changes: 5,721 additions & 2,618 deletions Intermediate/Formatted/PREDICTMainFormatted.csv

Large diffs are not rendered by default.

337 changes: 179 additions & 158 deletions Intermediate/Formatted/PREDICTPCRFormatted.csv

Large diffs are not rendered by default.

Binary file modified Intermediate/Formatted/VIRIONUnprocessed.csv.gz
Binary file not shown.
8,571 changes: 5,837 additions & 2,734 deletions Intermediate/Unformatted/PREDICTMainUnformatted.csv

Large diffs are not rendered by default.

337 changes: 179 additions & 158 deletions Intermediate/Unformatted/PREDICTPCRUnformatted.csv

Large diffs are not rendered by default.

Binary file modified Virion/Detection.csv.gz
Binary file not shown.
49,882 changes: 24,940 additions & 24,942 deletions Virion/Edgelist.csv

Large diffs are not rendered by default.

Binary file modified Virion/Provenance.csv.gz
Binary file not shown.
808 changes: 404 additions & 404 deletions Virion/TaxonomyHost.csv

Large diffs are not rendered by default.

0 comments on commit 240650c

Please sign in to comment.