Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SAUtils for Cosmic #63

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Update SAUtils for Cosmic #63

wants to merge 19 commits into from

Conversation

heseber
Copy link

@heseber heseber commented Oct 29, 2021

Current COSMIC releases have a slightly modified format of the TSV files. Some column headers have changed (e.g., Mutation ID -> GENOMIC_MUTATION_ID, and a new Tier column was added.

Counts for CancerType and CancerSite were based on study identifiers, not on tumor identifiers. This does not really make sense, because we want to know the number of tumors per cancerType and cancerSite, not the number of studies (a study can have many different tumors of the same type and site). Furthermore, the study id column is often empty if a PubMed id is specified instead.

This pull request updates the SAUtils for Cosmic to make it work with current Cosmic releases and also changes the counts to refer to tumors instead of studies. This is only for short variants, not for structural variants and fusions.

Here is an example output (just the "cosmic" section, and after pretty-printing with jq, of course):

"cosmic": [
            {
              "id": "COSV55892885",
              "refAllele": "A",
              "altAllele": "T",
              "gene": "PIK3CA",
              "sampleCount": 5,
              "cancerTypesAndCounts": [
                {
                  "cancerType": "carcinoma",
                  "count": 5
                }
              ],
              "cancerSitesAndCounts": [
                {
                  "cancerSite": "ovary",
                  "count": 3
                },
                {
                  "cancerSite": "large intestine",
                  "count": 2
                }
              ],
              "tiersAndCounts": [
                {
                  "tier": "1",
                  "count": 5
                }
              ]
            }
]

New Cosmic releases use a different column name for the Cosmic ID.
This is now fixed.

The cancer type count and cancer site count were based on study id.
This does not make sense, because we want to have the number of
tumours where a mutation is found, not the number of studies.
For now, I just replaced the study id column tag with the name of the
tumour id column. Addtional refactoring would be needed to rename
all references to _study_ with _tumour_.
Do not drop counts of cancer type, cancer site, and tier completely if
just some of the tumors do not have them annotated.
Histologies and Sites were returned as IList<string> but could never
have more than one list item. This was new simplified to having just
a single string for each of Histology, Site and Tier.
Replace underscores with spaces in tumor sites and types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant