Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fragment-level analysis #68

Open
Phlya opened this issue May 18, 2018 · 41 comments
Open

Fragment-level analysis #68

Phlya opened this issue May 18, 2018 · 41 comments

Comments

@Phlya
Copy link
Member

Phlya commented May 18, 2018

Discussion to address open2c/distiller-nf#96.
Since there is pairsamtools restrict, we can use that as a starting point for adding fragment-level stats. But after some thinking, it seems more complicated to implement properly than I thought.

So, how about we add another column (pair_frags_type?) to the pairs file with the classification of the pairs according to this. And then maybe the actual counting can be a part of the regular stats? Then in distiller restrict should be called before stats, if we include it there (and I think it should be there since people expect this kind of stats)?

Here is a list of the stats that need to be implemented, basing off of hiclib. What did I forget?

  • Same-fragment pairs
    • Dangling ends: pairs oriented towards each other that map to the same restriction fragment.
    • Self-circles: the same, except they face in opposite directions.
    • Self-ligations: the same, but oriented in the same direction (afaiu was not implemented in hiclib, but seems logical to include it)
    • hiclib also has a category "error" here, what does that mean? @mimakaev
  • Undigested pairs - pairs that probably came from undigested or simply re-ligated fragments, so they face towards each other, map to different fragments, but the whole molecule is shorter than a certain threshold, which is the upper size of fragment sizes in the library. I'm presuming this is the "extraDandlingEnds" in hiclib?
    • Can we estimate the fragment size from the pairs, by the way? E.g. as the longest dangling end? Something to think about.
  • Nothing from the above - "Distal"?
  • Did I forget something?
@Phlya
Copy link
Member Author

Phlya commented May 18, 2018

As an alternative to all this, provide a way to easily plot pair frequency by distance by read direction?

@nvictus
Copy link
Member

nvictus commented May 18, 2018

Here's an example using the dask-adapter for pairix in bioframe.

It certainly doesn't require either dask or pairix, but would need to be accumulated chunkwise. The functions to compute "contact areas" and pairs scalings were recently moved to cooltools, but they are quite short and reusable: https://github.com/mirnylab/cooltools/blob/master/cooltools/expected.py#L53

@Phlya
Copy link
Member Author

Phlya commented May 18, 2018

I think we just need to tweak the output of stats a bit so that we don't don't need to do anything on the raw pairs - it already saves number of contacts at different distance with different orientations. I can make a plot like this from stats:

image

I suppose, it lacks normalisation over the areas for each bin...
(This is all for mouse - with no centromeres, whether not counting over-centromeres pairs in stats directly is feasible I am not sure)

Also, if there are any ideas what went wrong with the "Bad" sample, I would be happy to hear your thoughts...

@Phlya
Copy link
Member Author

Phlya commented May 19, 2018

OK, using @nvictus 's suggestion I can quite easily make this using the areas functions from cooltools and counts from stats.

image

Still not sure how to interpret the difference between samples, though (the bad one was not sequenced deeply, so the curves are not smooth).

Code, just in case it's useful.

def get_areas(d, chroms=pd.read_table('/home/s1529682/eddie/ilia/genomes/mm9/chrfile.txt', names=['chrom', 'length'])):
    bins = np.append(np.unique(d['Start'].values), [np.inf])
    areas = np.zeros(len(bins)-1)
    for i in range(len(chroms)):
        areas = areas + expected.contact_areas(bins, (0, chroms['length'][i]), (0, chroms['length'][i]))
    return areas

def get_scaling_by_orienation(stats):
    d = stats.loc[stats.index.str.startswith('dist_freq')]
    d = d.reset_index()
    d[['Distance', 'Orientation']] = d['index'].str.split('/', expand=True).iloc[:, 1:]
    d = d.drop('index', axis=1)
    d[['Start', 'End']] = d['Distance'].str.split('-', expand=True).apply(lambda x: [int(i.replace('+', '')) if i is not None else np.inf for i in x])
    d['Distance'] = np.exp((np.log(d['Start'])+np.log(d['End']))/2)
    d = d[['Orientation', 'Distance', 'Start', 'End', 'Value']]
    return d
def plot_scaling_by_orienation(stats, ax):
    d = get_scaling_by_orienation(stats)
    areas = get_areas(d)
    for orientation in '+-', '-+', '++', '--':
        sd = d[d['Orientation']==orientation]
        sd['Value']/= areas
        ax.plot(sd['Distance'], sd['Value']/d['Value'].sum(), label=orientation)

@gfudenberg
Copy link
Member

gfudenberg commented May 20, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 20, 2018

Thank @gfudenberg !
Yes, same cell type. This is just an example, I recently have problems like that with all libraries...
Yeah, there is a lot of trans (60-70%), and yeah, scaling is flat on the right, clearly a lot of random ligations or something like that - what I mean is there are also differences on the left side, and I thought maybe they can hint at what exactly is going wrong? Because I just have no clue what is going wrong now...

@gfudenberg
Copy link
Member

gfudenberg commented May 20, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 20, 2018

I realize that, I just thought since you guys have been analyzing other people's data maybe you encountered something like that.

Do you mean 3 kb? Digestion works fine though... Size shift after Hi-C with blunt-end ligation is hard to see on the gel sometimes, so last week I just did 3C same way as I do Hi-C to check that the reagents are working fine, and the gel looks great.
image
(Perhaps not the best place for an extended discussion like this...)

@gfudenberg
Copy link
Member

gfudenberg commented May 20, 2018 via email

@nvictus
Copy link
Member

nvictus commented May 20, 2018

There's an enrichment in the supposed "self-circle" fraction < 10kb with a strange "double hump". I'm not sure how unusual it is, but I would have guessed some issue with fill-in or ligation. Your gel suggests that ligation is not the problem though. Maybe it's downstream stuff like library prep or PCR.

Have you looked at @gibcus's bootcamp slides? There might be some QC ideas to try there, like digesting PCR products. He might be able to chime in too.

@gibcus
Copy link

gibcus commented May 20, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 20, 2018

Hi @nvictus , @gibcus

Thanks for your ideas! Yeah, it's possible fill-in is inefficient, but I am not sure how to test this...

Trans-fraction is very high. Crosslinking is not the problem because this has happened with cells crosslinked by different people and even in different labs.

@Phlya
Copy link
Member Author

Phlya commented May 21, 2018

@gfudenberg you have a good eye! The small peak at 3kb is real and present in a few other new bad data, but never in the old good ones. Here is an extremely bad dataset where it's really obvious:
image

And by looking at a few of them the height of the peak clearly correlates with how flat the scaling is on the right end! WTF is this?!

@golobor
Copy link
Member

golobor commented May 21, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 21, 2018

Yeah, there are differences - values as a fraction of total reads for this extremely bad one.

image
image

@golobor
Copy link
Member

golobor commented May 21, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 21, 2018

No, I haven't checked, actually... Why do you think it might be adaptors? distiller can do that for me, right, if I specify do_fastqc: True?

@golobor
Copy link
Member

golobor commented May 21, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 21, 2018

No adaptors, but some weird stuff is going on!!! Unequal level of A and T across the whole read! And other things...
If you want to have a look, here is a zip with the output, serum-RI3t is good, RI4t is bad (the first plot here on top is from these).

fastqc.zip

@golobor
Copy link
Member

golobor commented May 21, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 21, 2018

(I have to say, they were sequenced on different machines)

@Phlya
Copy link
Member Author

Phlya commented May 21, 2018

Also GATC signature in the beginning is way more pronounced.

@Phlya
Copy link
Member Author

Phlya commented May 21, 2018

So no ideas from this?..

@golobor
Copy link
Member

golobor commented May 21, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 21, 2018

If you mean there is something wrong on the technical side with the sequencing, some bad samples were sequenced in one place, and some - in another...

But I might ask them for their opinion on what could cause such issues though.

@golobor
Copy link
Member

golobor commented May 21, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 21, 2018

a) I haven't run fastqc on all of them
b) yes

I can try to compare them more comprehensively tomorrow

@golobor
Copy link
Member

golobor commented May 21, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 21, 2018

Both good and bad data have been produced in two different facilities, and at one point a few months ago all libraries, wherever we sequence them, started having similar issues.

@Phlya
Copy link
Member Author

Phlya commented May 22, 2018

So, I ran fastqc for many of the good and bad libraries, and the only mostly consistent problem with the bad ones is very high over-representation of a particular group of K-mers in a particular place in the forward reads. This sequence looks like a part of the adaptor sequence, why it is so often found in a particular place, and only in the forward read is completely enigmatic to me, but their actual counts are pretty low and overall adaptor sequences are not very frequent, so I don't think this can be the source of these problems.
image

@golobor
Copy link
Member

golobor commented May 22, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 22, 2018

Well, it seems it's quite commonly present in reverse reads, but not in forward reads, whatever that means, and however that is possible... Here is an example.
image
image

@Phlya
Copy link
Member Author

Phlya commented May 22, 2018

And yeah, similar effect in G vs C in reverse reads is also present.

@golobor
Copy link
Member

golobor commented May 22, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 22, 2018

I will check that.

Well, as I said, I think it's unlikely that two facilities at the same time stop producing good data with similar symptoms...

@Phlya
Copy link
Member Author

Phlya commented May 22, 2018

It seems like quality (scaling shallowness) is correlated to this bias, yes. But hard to say definitively.

@golobor
Copy link
Member

golobor commented May 22, 2018 via email

@Phlya
Copy link
Member Author

Phlya commented May 22, 2018

And actually I was wrong when I said that it's not present in forward reads - reverse reads are certainly more affected, but in at least one run most samples have it in forward reads too to a different extent - which seems to correlate with just how bad they are.

@Phlya
Copy link
Member Author

Phlya commented May 22, 2018

Yeah, I noticed that too, and it's present (again, to a different extent) in a few samples I checked, but not sure how to look at them - like, what exactly should I look into?

@Phlya
Copy link
Member Author

Phlya commented May 22, 2018

Also, how about we switch to e.g. G-chat @golobor ? I doubt everyone is interested in this!

@wbszhu
Copy link

wbszhu commented Jun 23, 2021

Hi,
Is there any way that we can get another column (pair_frags_type?)? I really expect this kind of stats.
Thanks a lot.
Luzhang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants