Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for homologous recombination deficiency scores in Dx.R #159

Open
lima1 opened this issue Dec 19, 2020 · 10 comments
Open

Support for homologous recombination deficiency scores in Dx.R #159

lima1 opened this issue Dec 19, 2020 · 10 comments

Comments

@lima1
Copy link
Owner

lima1 commented Dec 19, 2020

Probably wrapper around https://github.com/sztup/scarHRD

@lbeltrame
Copy link

lbeltrame commented Jan 14, 2021

I actually "reverse-engineered" the code (because it is really hard to read and understand what does it do) when implementing it internally, using PureCN data as a source to calculate HRD scores. If you're interested I might write down the notes (I'd have put up a PR, but I rewrote the implementation in Python as I was more familiar with it).

@lima1
Copy link
Owner Author

lima1 commented Jan 14, 2021

Hi @lbeltrame, that would be awesome. More than happy to add it as Python code for now.

@lbeltrame
Copy link

lbeltrame commented Jan 19, 2021

Here are some notes on how scarHRD and how I adapted it:

Preprocessing (before the HRD calculation)

  1. Get the variants off PureCN and remove flagged variants (FLAGGED == TRUE)
  2. Construct the "major allele CN" by subtracting ML.M.SEGMENT to ML.C. (ML.CN.MAJOR = ML.C - ML.M.SEGMENT)

Segment generation

Contrary to "popular" belief, this does not use the segmentation output. It generates "segments" of identical major and minor CN. It works like this:

  1. Group data by chromosome
  2. Group consecutive stretches of (major, minor) CN together
  3. Aggregate the stretches in a single segment with the same sample ID, the same chromosome, the first occurence of start coordinate, the last occurrence of end coordinate, and the median of the log.ratio
  4. Calculate the length of the segment (will be used later)

LOH score

Start with a LOH score of 0

  1. Group segments (those generated before) by chromosome (but do only autosomes)
  2. Exclude data from chromosomes where minor CN is consistently 0 (LOH of the whole chromosome)
  3. Set all segments with major allele CN > 1 as equal 1 (they count all the same)
  4. Re-aggregate the segments (see point 3 of Segment generation)
  5. Select those segments where major CN is 1 and minor CN is 0
  6. Select those segments where the length of the segments is at least 15Mbp
  7. Sum the number of segments wihch are left after the selection in 6. in that chromosome to the LOH score

LST score (large scale transitions)

This was pretty hard to figure out.

Start with LST score of 0.

  1. Group segments (those generated before) by chromosome (but do only autosomes)
  2. Subset the data for those segments that wholly fall into p or q arm (you'll need centromere locations for this) and not those across
  3. Handle p and q data separately (because the following steps change start or end depending on the arm)
  4. Re-aggregate the segments (see point 3 of Segment generation) for each arm
  5. Set the last (p arm) or the first (q arm) start position as the corresponding centromere coordinate (start and end, respectively)
  6. Recompute segment lengths
  7. Identify all segments that are below 3 Mbp
  8. Iterate through these segments
    1. Remove the segment from the data (and from the list at point 7)
    2. Re-aggregate the segments (see point 3 of Segment generation)
    3. Recompute segment lengths
    4. Repeat from step 1 until all the segments < 3mbp have been removed
  9. Select all pairs of adjacent segments that are >= 10 Mbp in length and spaced less than 3 Mbp
  10. Count these segments and add the value to the LST score

TAI (Telomeric allelic imbalance)

This is another confusing one, and I'm not sure I understood the logic completely (but my implementation produces identical results to scarHRD).

Start with a TAI score of 0.

  1. Group segments (those generated before) by chromosome (but do only autosomes; the following operations are applied to each group)
  2. Discard all segments smaller than 1Mbp (to reduce noise)
  3. Re-aggregate the segments (see point 3 of Segment generation)
  4. If there is only one segment covering the entire chromosome, pass to the next chromosome
  5. Get the absolute minimum major copy number observed in the segments for that chromosome
  6. For each segment, consider the major and minor CN:
    1. If the minimum major CN is equal to 1 or the minimum major CN has an even number of copies, check if there is a difference between major and minor CN: if there is, mark the segment has having "Interstitial" allelic imbalance
    2. If the condition at point 1 is not satisfied, we have an odd number of copies of major allele CN, consider the contribution of the minor allele to the imbalance: if the minor allele CN in that segment is not 0 and the sum of major and minor allele CN for that segment is equal to the minimum copy number observed, mark the segment as having "Interstitial" allelic imbalance
    3. If neither 1 or 2 holds, mark the segment as having no imbalance
  7. Get the tagged segment from point 6 closest to the 5' telomere and the segment closest to the 3' telomere
    1. If the segment closest to the 5' telomere ends before the centromere, mark the segment as having "telomeric allelic imbalance" and add 1 to the TAI score
    2. If the segment closest to the 3' telomere starts before the centromere, mark the segment as having "telomeric allelic imbalance" and add 1 to the TAI score

Wrapping up

  1. Collect LST, TAI and LOH scores
  2. Calculate HRD score as LST+TAI+LOH (this is the latest version described in the papers)

Currently the script I use depends on pandas (which pulls in numpy as well) and requires a file with centromere coordinates. What would be the minimum Python version you consider acceptable (consider it will likely only work with Python 3.4 or later)?

@lima1
Copy link
Owner Author

lima1 commented Feb 10, 2021

Thanks so much Luca, that's super helpful. And sorry for the delayed response. I think for now I'll take whatever you have. At some point when I need it I might reimplement in R to make the installation easier, but as a prototype that's awesome for now.

@lbeltrame
Copy link

Super! I'll clean it up tomorrow and attach it here. Do you have a data file with centromere information already inside PureCN? That's needed for the TAI calculation (you need to know you're not crossing the centromere).

@lima1
Copy link
Owner Author

lima1 commented Feb 18, 2021

Yes, I have the centromeres. Currently as serialized RDA file in https://github.com/lima1/PureCN/tree/master/data though.

@lbeltrame
Copy link

I'll see whether https://github.com/ofajardo/pyreadr can be used to handle this (I wanted to get rid of a dependency on a random file that might or might not be in some person's disk anyway).

@ShvartsmanIrina
Copy link

Hi @lbeltrame , do you have a script for HRD scoring? I would be very grateful for help!

@lbeltrame
Copy link

No, not at the moment, unfortunately (I switched institutions in the meantime).

@ShvartsmanIrina
Copy link

@lbeltrame, thank you very much for your answer! I understand that you are a professional in this field, I come from the field of microbiome. Maybe, as an experienced professional, you can advise me on a tool or workflow for calculating HRD score from the results of unpaired tumor samples?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants