Skip to content

Klebsiella genome metadata collection schema plus guidance, examples and collection template

License

Notifications You must be signed in to change notification settings

klebgenomics/Klebsiella-genome-metadata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 

Repository files navigation

Klebsiella-genome-metadata

Klebsiella genome metadata scheme, plus guidance, examples and submission template.

This is a community-driven data curation effort to facilitate use and reuse of public genome collections for maximum knowledge gain. These efforts are focussed on Klebsiella pneumoniae and closely related organisms in the K. pneumoniae Species Complex (KpSC) and are coordinated by the KlebNET-GSP project team. The data willl be collated and made publically available via this repository and the PathogenWatch website, which hosts public KpSC genome collections and reports associated genotypes.

Our goal is to collect information with broad utility to research focussed on KpSC, and that can be readily harmonised for easy and effective reuse. We aim to capture information that is not currently well represented in the public data repositories. Notably, the National Center for Biotechnology Information (NCBI) already allows submission of detailed Antimicrobial Susceptibility Testing information that is directly applicable to the KpSC, and AST data is therefore excluded from our data curation effort. If you have generated and are able to share AST data for KpSC isolates please consider submitting to NCBI.

Our scheme includes two types of data fields:

1. Isolate metadata fields capture information about the individual KpSC isolates and their associated genome sequences, as well as the sample sources and/or hosts from which the isolates were collected.

2. Sampling fields capture information about how and why isolates were collected and/or chosen for sequencing. These data are essential to understand the underlying biases in genome collections, and to make descisions about the inclusion or exclusion of isolates for comparative and aggregate analyses.

The submission template is available here. Detailed instructions and guidance for data submission can be found below.

Contents

1. Data submission
2. Isolate metadata fields
3. Sampling fields
    i. Term definitions for 'purpose of sampling'
    ii. Examples of how to describe study designs using the sampling fields
4. Queries and suggestions
5. License

Data submission

The data submission template is available here. Please MAKE A COPY to add your own data. You cannot enter data directly into the master copy of the template. Once completed, email or share your copy to klebsiella.genome.metadata@gmail.com.

The full list of data fields, value formats and options are shown in the tables below.

Fields with restricted vocabularies

Some fields have restricted vocabularies and/or require selection from a list of predefined data values. In most cases the list of possible values can be accessed and searched via a drop-down list within the submission template (also shown in the tables below, marked 'Choose from list') and only values matching those in the list will be accepted. However, in a minority of cases the possible set of values is derived from an established ontology that is too large for inclusion within the submission template. These fields are marked as, 'Controlled vocabulary,' with a link to the appropriate ontology e.g. NCBI taxonomy database or MeSH disease ontology.

Fields with a list of suggested values

In some cases it is desirable to have a restricted vocabulary to support data harmonisation, but there are no appropriate predefined ontologies and too many foreseeable options to create a definitive list. In these cases, we provide a list of suggested values that we expect to capture the vast majority of scenarios, but also provide the option to enter alternative values via free text. These fields are marked in the tables below as 'Choose common values from the list, or if none are appropriate, enter free text'. The submission template includes a drop-down list of the suggested values, but will allow other values to be entered (these free text entries will be marked with warnings).

Isolate metadata

These data describe individual bacterial isolates and their associated sequence data. Please complete one row per isolate.

Variable fields, and guidance for completing them, are shown in the table below.

For text fields, please DO NOT enter 'Unknown' or 'missing' etc, just leave the field blank if you don't have any data to input.

Status Variable Definition; Guidance Value format
REQUIRED if published References PubMed ID for associated publication reporting genome data; DOI is acceptable for preprints only. Multiple references can be provided as a list (comma-separated). If no associated publications leave blank. {text}
RECOMMENDED; REQUIRED if no Assembly accession provided Run accession Sequence archive run accession (sequence read accession); SRRxxx, ERRxxx. If multiple sequences for same ISOLATE, a list of accessions can be given (comma-seperated). {text}
REQUIRED Project accession BioProject accession; PRJxxx. If multiple projects for same ISOLATE, a list of accessions can be given (comma-seperated). {text}
REQUIRED Sample accession BioSample accession; SAMxxx {text}
RECOMMENDED; REQUIRED if no Assembly accession provided Experiment accession Sequence archive experiment accession; SRXxxx, ERXxxx. If multiple experiments for same ISOLATE, a list of accessions can be given here (comma-separated). {text}
optional Secondary sample accession NCBI Biosample; ERSxxx {text}
optional; REQUIRED if no Run accession provided Assembly accession GenBank assembly accession; GCA_xxx. The accession for the entire assembly, including chromosome and plasmids. {text}
optional Secondary assembly accession Genbank WGS master record accession {text}
REQUIRED Genome source Type of sequence from which this genome was derived; Indicate if the sequence represents a single cultured isolate whole genome sequence (WGS) or is derived from a mixed sequence / metagenome assembled genome (MAG). Choose from list. Isolate WGS | MAG | unknown
REQUIRED Isolate name A name that you choose for the isolate. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible. Every Isolate name from a single Submitter must be unique. {text}
REQUIRED Collection year The year that the isolate was collected; YYYY {int}
REQUIRED Collection month The month that the isolate was collected; MM {int}
REQUIRED Collection day The day that the isolate was collected within the month specified in 'Collection month'; DD {int}
REQUIRED Country Country of isolate collection. Controlled vocabulary, choose from the list of values as defined at https://www.insdc.org/submitting-standards/country-qualifier-vocabulary/ {term}
REQUIRED Isolate source Short free text description of the sample source from which the Klebsiella was isolated. E.g. ‘human blood’ or ‘animal feed’, or ‘river water grab sample’. {text}
REQUIRED Source type Controlled vocabulary describing the source of the isolate. Choose from the list. Enables high level grouping of isolates. human | animal | food | environmental | other | missing | restricted access | not applicable | not collected | not provided
REQUIRED Host Scientific name of the host from which the isolate was collected. Controlled vocabulary as defined at https://www.ncbi.nlm.nih.gov/taxonomy. If not host associated, specify 'not host associated', ensure the source is appropriately described under ‘Isolation source', and consider submitting detailed source information to NCBI via the One Health Enteric metadata template. {term}
RECOMMENDED unless lat/long given City or region City or region of isolate collection. {text}
RECOMMENDED unless city/region given lat_lon The geographical coordinates of the location where the sample was collected. Specify as degrees latitude and longitude in format "d[d.dddd] N|S d[dd.dddd] W|E", eg, 38.98 N 77.11 W. {float}{float}
optional Isolate alias Other IDs associated with this isolate. Multiple IDs can be given (comma-separated). {text}
optional Travel associated For isolates collected from human hosts, indicate if associated with recent travel. Leave blank if travel status is unknown. travel associated | NOT travel associated
optional Travel country If travel associated, indicate the travel country by choosing from the list of values as defined at https://www.insdc.org/submitting-standards/country-qualifier-vocabulary/. Leave blank if unknown. {term}
REQUIRED if host-associated Host tissue sampled Name of body site or specimen type from which the sample was obtained, such as a specific organ, tissue or clinical specimen. Choose common values from the list, or if none are appropriate, enter free text. blood | cerebrospinal fluid (CSF) | urine | sputum | brochoalveolar lavage (BAL) | other respiratory | wound | skin | feces | rectal swab | throat swab | cecal swab | {text}
REQUIRED if host-associated Infection For host-associated isolates, indicate if infecting or colonising isolate, or if the infection status is unknown. Choose from list. infection | colonisation | unknown
REQUIRED if infection Host disease For host-associated infecting isolates, provide the name of the relevant disease e.g. pneumonia, bacteremia. Controlled vovabulary as defined at https://meshb.nlm.nih.gov/treeView {term}
optional Infection outcome For host-associated and infecting isolates, indicate the broad infection outcome at 28 days post-infection. Choose from list. death within 28 days | alive at 28 days | restricted access | unknown
optional Infection severity For host-associated infecting isolates, if severity information could be made availble (upon request), indicate the type of information here. If none available or none can be shared with the community, leave blank. {text}
optional Host age group For human-associated isolates, indicate the age range of the host. Choose from list. 0-30 days | 1-12 months | 1-5 years | 5-18 years | 18-60 years | >60 years | restricted access | not collected | not applicable | missing
optional Host sex For host-associated isolates, indicate the biological sex of the host. Choose from list. male | female | restricted access | not collected | not applicable | missing
REQUIRED Repeat isolate If other ISOLATES are sequenced from the same host infection or colonisation episode, and this entry is NOT the primary isolate in the series, indicate the primary isolate (isolate name), otherwise leave blank. {text}
REQUIRED Duplicate sequence If multiple ASSEMBLIES of the same isolate, and this entry is NOT the primary sequence in the series, indicate the primary isolate (isolate name), otherwise leave blank. {text}
REQUIRED Collected by Name of persons or institute who collected the sample. {text}
REQUIRED Lab contact Contact email address for the person providing metadata. This information will be made available only to the KlebNET-GSP team. {text}

Sampling fields

These contextual data describe the purpose of sampling, and the sampling strategy for the collection from which each isolate is derived. Please complete one row per isolate.

Variable fields, and guidance for completing them, are summarised in the table below. Definitions and detailed examples are also shown below the table.

Status Variable Definition Guidance Value format
REQUIRED purpose of sampling Primary purpose for sampling bacterial isolates Indicate whether the primary purpose for the collection and sequencing of these isolates (e.g. routine diagnostics, outbreak investigation, research). Choose from the list, or if none of the values are appropriate, enter free text. Definitions are shown below the table. Routine diagnostics and / or infection control | Routine surveillance | Outbreak investigation / outbreak-initiated surveillance | Research | {text}
REQUIRED study population Population from whom bacterial isolates were sampled Give details about the population of hosts or environments represented in the sample (e.g. Hospital patients, Neonates, Hospital wastewater). This information is essential to inform the inclusion and exclusion of studies for aggregate or comparative epidemiological analyses. Choose common values from the list, or if none of the values are appropriate, enter free text. Multiple values can be specified (comma-separated). Hospital patients | Intensive Care Unit (ICU) patients | Primary care patients | Community participants | Neonates | Clinical environment: sinks and drains | Clinical environment: surfaces | Medical devices | Hospital wastewater | Wastewater (not hospital) | Fresh water | Seawater | Soil | Rhizosphere | Plants | Livestock | Companion animals | Captive animals | Wild animals | Food | {text}
REQUIRED target epi Broad epidemiological category of the study Indicate the broad epidemiological category of the study (e.g. Host colonisation, Host infection, Environmental). This information is useful to inform aggregate or comparative analyses of disease-associated vs non-disease associated isolates. Choose from the list, or if none of the values are appropriate, enter free text. Host infection | Host colonisation | Environmental | Host infection & colonisation | Host infection, colonisation & environmental | {text}
REQUIRED if target epi includes 'Host infection' selected by clinical phenotype Flag to indicate whether isolates were selected for inclusion on the basis of host clinical phenotype Indicate whether isolates were selected for inclusion on the basis of host clinical phenotype (e.g. blood stream infection, liver abscess, severe infection) or if no selection was applied. Choose from the list. This information is essential to inform studies focussed on specific infection types or disease severity e.g. to determine serotype distributions among invasive infection isolates or compare rates of drug resistance among blood stream infections. The specific phenotype used for selection can be indicated in the 'selected clinical phenotype' field. selected by clinical phenotype | NOT selected by clinical phenotype
REQUIRED if selected by clinical phenotype = 'selected by clinical phenotype' selected clinical phenotype Clinical phenotype used to select isolates for inclusion Indicate the specific clinical phenotype that was used to select samples for collection and/or sequencing. Choose common values from the list, or if none of the values are appropriate, enter free text. Multiple values can be specified (comma-separated). liver abscess | invasive infection | blood stream infection | respiratory infection | urinary tract infection | hospital acquired infection | community acquired infection | severe disease | {text}
REQUIRED selected by organism trait Flag to indicate whether isolates were selected for inclusion on the basis of microbial trait Indicate if samples were selected for inclusion on the basis of a microbial phenotype or genotype (e.g. specific drug resistance or serotype, presence of a specific gene) or if no selection was applied. Choose from the list. This information is essential to inform studies aiming to estimate the prevalence of microbial phenotypes / genotypes by study populations, geographies etc e.g. to estimate national prevalence of ceftriaxone or carbapenem resistant isolates. The specific phenotype or genotype used for selection can be indicated in the 'selected organism trait' field. selected by organism trait | NOT selected by organism trait
REQUIRED if selected by organism trait = 'selected by organism trait' selected organism trait Microbial trait used to select isolates for inclusion Indicate the specific microbial phenotype or genotype that was used to select isolates for collection and/or sequencing. Choose common values form the list, or if none of the values are appropriate, enter free text. Multiple values can be specified (comma-separated). ceftriaxone resistance | carbapenem resistance | drug resistance (not ceftriaxone or carbapenem) | ESBL producers | carbapenemase producers | OXA positive | NDM positive | KPC positive | iuc (aerobactin) positive | iro (salmochelin) positive | rmpA positive | peg-344 positive | string-test positive | hypermucoviscous by low-speed centrifugation | hypermucoviscous by percoll-gradient sedimentation | 7-gene multi-locus sequence type | serotype | {text}
RECOMMENDED sampling period start Start date for the sampling period Indicate when the sample collection began (YYYY, or YYYY-MM or YYYY-MM-DD). This information is useful for understanding the temporal coverage of data to inform trend analysis. {ISO format}
RECOMMENDED sampling period end End date for the sampling period Indicate when the sample collection ended (YYYY, or YYYY-MM or YYYY-MM-DD). This information is useful for understanding the temporal coverage of data to inform trend analysis. If collection and sequencing are on-going, leave blank. {ISO format}

Term definitions for purpose-of-sampling

Routine diagnostics and / or infection control

Samples collected through the routine and ongoing activities of clinical or veterinary microbiology laboratories for the purposes of clinical diagnosis and/or infection control. May include isolates confirmed as infecting agents and/or those considered as asymptomatic or environmental colonisers e.g. isolates identified from hospital sinks or patient screening swabs as part of routine infection prevention and control procedures.

Routine surveillance

Samples collected through the routine and ongoing activities of other laboratories (not clinical or veterinary microbiology laboratories) and/or collected for purposes other than clinical diagnostics and infection control e.g. laboratories processing samples from non-healthcare environmental sources or food products.

Outbreak investigation / outbreak-initiated surveillance

Samples collected as part of a response to a specific outbreak e.g. within a hospital or other healthcare setting (human or veterinary). May include isolates confirmed as infecting agents and/or those considered as asymptomatic colonisers (e.g. from screening swabs) and/or those from environmental sources (e.g. hospital sinks, drains etc.)

Research

Samples collected for specific research purposes (excluding outbreak investigation / outbreak-initiated surveillance) that would not have otherwise been collected via routine diagnostics, infection control or surveillance activities as described above.

Examples of how to describe study designs using the sampling fields

Below we describe various hypothetical study designs and show how the sampling fields would be populated for each.

Neonatal sepsis study

K. pneumoniae were isolated from the blood of neonates via routine diagnostic procedures. All isolates collected between 1 Jan 2019 and 31 Dec 2020 were stocked and subjected to whole genome sequencing.

Neonatal sepsis study flow diagram

Ceftriaxone-resistant infection study

K. pneumoniae identified via routine diagnostic procedures from hospitalised patients in a tertiary care centre between February 2016 and February 2018 were collected. Isolates resistant to ceftriaxone were selected for sequencing.

Ceftriaxone resistant infection study flow diagram

CPE outbreak study

In May 2019 there was a sudden increase in CPE infections in the ICU of a large tertiary care centre. Enhanced infection prevention and control procedures were activated from 18 May until 31 August when the outbreak was declared contained: rectal screening swabs were collected on patient admission and every 3 days thereafter, in addition to sink and drain screening swabs. All swabs were cultured on selective media and presumptive carbapenem-resistant K. pneumoniae were sequenced alongside all carbapenem-resistant K. pneumoniae identified from ICU patients via routine diagnostics procedures.

CPE outbreak study flow diagram

CR-hvKp study

Carbapenem-resistant K. pneumoniae were isolated from liver abscess patients as part of a reserach study focussed on diabetic patients, between 1 June 2018 and 30 June 2020. Strains carrying K. pneumoniae carbapanemase genes were detected by PCR and string test was used to determine hypermucoidy. String test positive isolates harbouring blaKPC were subjected to whole genome sequencing.

CR-hvKp study flow diagram

Pig gut carriage study

Veterinary researchers collected 100 faecal samples from each of six pig farms in June 2017. K. pneumoniae were isolated by culture on SCAI media and subjected to whole genome sequencing as part of a One Health research project.

Pig carriage study flow diagram

(Note that the specific hosts, ie pigs, should be indicated in the isolate metadata field 'host', rather than in the sampling field)

Water surveillance study

K. pneumoniae were isolated from fresh and wastewaters in a metropolitan centre as part of routine water surveillance conducted by the Environmental Protection Authority. Since 2021 all isolates have been stocked and 100 isolates have been randomly selected for sequencing each year. Sampling and sequencing is ongoing.

Water surveillance study flow diagram

Queries and suggestions

We welcome queries and suggestions from the community on any aspect of the scheme. In particular, please tell us if you think we have missed key data fields or options, or if the guidance is unclear. You can contact us via the issue tracker.

License

These resources are freely available for reuse and adapatation under GNU general public license v3. We encourage the development of similar schemes for other organisms.

About

Klebsiella genome metadata collection schema plus guidance, examples and collection template

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published