Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregating results across pySCENIC runs #169

Closed
diegoalexespi opened this issue May 17, 2020 · 6 comments
Closed

Aggregating results across pySCENIC runs #169

diegoalexespi opened this issue May 17, 2020 · 6 comments
Labels
results Question about pySCENIC results

Comments

@diegoalexespi
Copy link

Hi!

I've seen a couple threads (#136 #157) that mention running the pySCENIC pipeline multiple times and then aggregating the regulons across runs.

My current approach to this has been to take the regulons after the pruning step and perform an inner-type merge on them where I only keep TF-Target pairs that are common across all runs. However, I wanted to know if the authors or others had other approaches that they've found useful?

Thanks!

@diegoalexespi diegoalexespi added the results Question about pySCENIC results label May 17, 2020
@cflerin
Copy link
Contributor

cflerin commented May 18, 2020

Hi @d93espinoza ,

The response to #157 is generally what I would do, and I expanded on this a little in the documentation here, point 2. The Nextflow pipeline does this for you semi-automatically but it sounds like you've already done the multiple runs.

The approach I've taken is a bit different from yours though. There are two items of interest: the regulons themselves (TFs) and the target genes of each regulon. First, I do an outer join, keeping all regulons/targets. Then, I count the number of times each regulon/target occurred. This will give a distribution and some idea of how much variability there is. You can then select a threshold as needed. For example, in a 100x run, you'll find some regulons that occur 100% of the time, and others that only occur once (obviously these are low confidence). Generally I've taken a threshold of ~80% occurrence, but I still find it useful to know that other regulons were found, but didn't pass the threshold (in case there is a specific regulon you expect to find, but don't). The same logic applies for the target genes.

Hope that helps, I'm interested to hear other approaches also.

@diegoalexespi
Copy link
Author

Thank you for your response! Looking at the distributions as you mentioned are what I will do, prior to choosing a threshold. Also would be curious to hear other approaches, but this certainly covers what I was looking for!

@cflerin cflerin pinned this issue Feb 18, 2021
@cflerin cflerin closed this as completed Feb 18, 2021
@fbrundu
Copy link

fbrundu commented Apr 12, 2021

Hi all, I have another question regarding aggregation. Since this thread is pinned I thought to post it right here, but let me know if I should open a different issue.

My question is on how to correctly perform the aggregation of results from the pyscenic docker image.

Currently, I run the grn and ctx steps with:

for i in `seq 5`
do 
    docker run -it --rm \
        -v /home/ubuntu:/scenicdata \
        -p 8787:8787 \
        aertslab/pyscenic:0.11.0 pyscenic grn \
            --num_workers 8 \
            --cell_id_attribute cell \
            --gene_attribute gene \
            -o /scenicdata/data.adj.$i.tsv \
            /scenicdata/data.counts.loom \
            /scenicdata/hs_hgnc_curated_tfs.txt
    docker run -it --rm \
        -v /home/ubuntu:/scenicdata \
        -p 8788:8787 \
        aertslab/pyscenic:0.11.0 pyscenic ctx \
            /scenicdata/data.adj.$i.tsv \
            /scenicdata/hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather \
            /scenicdata/hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather \
            --annotations_fname /scenicdata/motifs-v9-nr.hgnc-m0.001-o0.0.tbl \
            --expression_mtx_fname /scenicdata/data.counts.loom \
            --mode "dask_multiprocessing" \
            --output /scenicdata/data.regulons.$i.csv \
            --num_workers 8 \
            --gene_attribute gene \
            --cell_id_attribute cell
done

At this point, what would be the cleanest way to integrate the different regulons files? In this notebook the following function is used (adapted for hg38):

def derive_regulons(
    motifs,
    db_names=(
        'hg38__refseq-r80__10kb_up_and_down_tss.mc9nr',
        'hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr'
    )
):
    from cytoolz import compose
    import numpy as np
    import operator as op
    from pyscenic.transform import df2regulons
    motifs.columns = motifs.columns.droplevel(0)

    def contains(*elems):
        def f(context):
            return any(elem in context for elem in elems)
        return f

    # For the creation of regulons we only keep the 10-species databases and
    # the activating modules. We also remove the
    # enriched motifs for the modules that were created using the method 
    # 'weight>50.0%' (because these modules are not part
    # of the default settings of modules_from_adjacencies anymore.
    motifs = motifs[
        np.fromiter(
            map(compose(op.not_, contains('weight>50.0%')), motifs.Context), 
            dtype=np.bool
        ) & \
        np.fromiter(map(contains(*db_names), motifs.Context), dtype=np.bool) & \
        np.fromiter(map(contains('activating'), motifs.Context), dtype=np.bool)
    ]

    # We build regulons only using enriched motifs with a NES of 3.0 or higher; 
    # we take only directly annotated TFs or TF annotated
    # for an orthologous gene into account; and we only keep regulons with at 
    # least 10 genes.
    regulons = list(
        filter(lambda r: len(r) >= 10,
            df2regulons(motifs[
                (motifs['NES'] >= 3.0) & (
                    (motifs['Annotation'] == 'gene is directly annotated') | (
                        motifs['Annotation'].str.startswith(
                            'gene is orthologous to'
                        )
                        & motifs['Annotation'].str.endswith(
                            'which is directly annotated for motif'
                        )
                    )
                )
            ])
        )
    )
    
    # Rename regulons, i.e. remove suffix.
    return list(map(lambda r: r.rename(r.transcription_factor), regulons))

However, is it the same process that is used by aucell within the pyscenic docker image?

To further elaborate, from the current version of the documentation, the python code to get regulons from grn results is the following:

modules = list(modules_from_adjacencies(adjacencies, ex_matrix))

# Calculate a list of enriched motifs and the corresponding target genes for all modules.
with ProgressBar():
    df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME)

# Create regulons from this table of enriched motifs.
regulons = df2regulons(df)

Is the results from the docker ctx step composed by already pruned regulons? If so, the dataframe of regulons can be generated simply with:

motifs = load_motifs('ctx_output.csv')
regulons = df2regulons(motifs)

r_df = None
for r in regulons:
    ks = []
    for k,v in r.gene2weight.items():
        ks += [k]
    r_df = pd.concat([r_df, pd.DataFrame({r.name: ks})], axis=1)
r_df.fillna('').to_csv('data.regulons.csv')

And subsequent selections of regulons can be based on how many times a target gene appears on the different iterations.

However, in this case the format of the regulons file is different from what is expected by aucell, so the last step can be executed only within python? E.g.:

auc_mtx = aucell(ex_matrix, regulons, num_workers=4)

What would be the best way to integrate the n runs of grn and ctx, select top target genes for each regulon to present to pyscenic aucell, and generate a dataframe where each column represents a regulon (with target genes as values)?

Thanks,
Francesco

@cflerin
Copy link
Contributor

cflerin commented Apr 15, 2021

Hi @fbrundu ,

For integrating multiple runs of scenic, there is a semi-automated implementation in our collection of Nextflow pipelines (VSN Pipelines). So it might be easiest to use this approach (see here for details).

But given that you've already started with the grn+ctx steps you could maybe integrate these "manually" using the existing scripts we have. Take a look at aggregate_multi_runs_regulons.py, and aggregate_multi_runs_features.py although this starts from the AUCell loom file output.

But for your approach above, you could also take the ctx output data.regulons.$i.csv, load into pyscenic with

from pyscenic.cli.utils import load_signatures
sig = load_signatures('data.regulons.$i.csv')

Access all of the TFs (in one run) with [ x.transcription_factor for x in sig ], and the genes in each TF with sig[0].genes. You'll have to aggregate TFs and genes across all runs, deciding on an occurrence threshold to keep. Then you can create each of your aggregated regulons with:

from pyscenic.genesig import Regulon

Regulon(name='Aggregated regulon', gene2weight={'target1': 1.0, 'target2': 1.0}, transcription_factor='TF1', gene2occurrence={'target1': 1, 'target2': 1 })

You can set the gene2weight to 1.0, and the gene2occurrence to the number of times you see that target gene among your runs.

@fbrundu
Copy link

fbrundu commented Apr 16, 2021

Hi @cflerin ,

Thank you for the information. I had looked at the nextflow pipelines but since I have no experience with Nextflow I decided not to use them for now.

I resorted to a similar approach you outlined (given I had already computed the grn+ctx steps).

Briefly, I filtered regulons and target genes and I generated a GMT file as input for the AUCell step (assuming that due to the integration, gene weights were difficult to compute), code below (3 runs):

from collections import Counter
import pandas as pd
from pyscenic.utils import load_motifs
from pyscenic.transform import df2regulons

# Load motifs
motifs = []
for i in range(1, 4):
    motifs.append(load_motifs(f'{prefix}.regulons.{i}.csv'))

# Collect regulons
regulons = []
for regulons_l in map(df2regulons, motifs):
    single = None
    for r in regulons_l:
        ks = []
        for k, v in r.gene2weight.items():
            ks += [k]
        single = pd.concat([single, pd.DataFrame({r.name: ks})], axis=1)
    regulons.append(single.fillna(''))

# Get regulons identified at least in 2 runs (out of 3)
columns = Counter([
    e for l in map(lambda x: list(x.columns), regulons) for e in l
])
columns = [e for e in columns if columns[e] > 1]

# Get target genes identified at least in 2 runs for each regulons
stable_regulons = None
for c in columns:
    targets = []
    for r in regulons:
        if c in r.columns:
           targets.extend(filter(lambda x: x != '', r[c]))
    targets = Counter(targets)
    targets = [e for e in targets if targets[e] > 1] 
    stable_regulons = pd.concat([
        stable_regulons, pd.DataFrame({c: targets})
    ], axis=1) 

# List of target genes per stable regulons
stable_regulons.to_csv(f'{prefix}.regulons.final.csv', index=False)

# Input data for AUCell
with open(f'{prefix}.regulons.final.gmt', 'wt') as gmt:
    for c in stable_regulons.columns:
        genes = '\t'.join(stable_regulons.loc[:, c].dropna())
        gmt.write(f'{c}\tRegulon_{c}\t{genes}\n')

I am not sure if gene2occurrence (the number of times a target gene appears in my runs) has any influence on the results of AUCell. In my code above, I only passed the list of genes for each regulon.

Thanks

@lucygarner
Copy link

Hi @cflerin,

For multi-run analysis, do you calculate an AUCell score on the "final" regulons, after filtering for regulons that occur above a threshold and also for targets that are found within that regulon e.g. 80% of the time?

Best wishes,
Lucy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
results Question about pySCENIC results
Projects
None yet
Development

No branches or pull requests

4 participants