Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate edges in Hetionet #13

Open
veleritas opened this issue Mar 17, 2017 · 6 comments
Open

Duplicate edges in Hetionet #13

veleritas opened this issue Mar 17, 2017 · 6 comments

Comments

@veleritas
Copy link

Hi Daniel,

Just wanted to note that there are still duplicate edges in hetionet in the newest integrate.ipynb. Specifically, the following two types of relationships give duplicate edge errors when the notebook is run:

Disease-gene differential expression edges

commit = '1a11633b5e0095454453335be82012a9f0f482e4'
url = rawgit('dhimmel', 'stargeo', commit, 'data/diffex.tsv')
stargeo_df = pandas.read_table(url)
# Filter to at most 250 up and 250 down-regulated genes per disease
stargeo_df = stargeo_df.groupby(['slim_id', 'direction']).apply(
    lambda df: df.nsmallest(250, 'p_adjusted')).reset_index(drop=True)
stargeo_df.head(2)

for row in stargeo_df.itertuples():
    source_id = 'Disease', row.slim_id
    target_id = 'Gene', row.entrez_gene_id
    kind = row.direction + 'regulates'
    data = {
        'source': 'STARGEO',
        'log2_fold_change': round(row.log2_fold_change, 5),
        'unbiased': True,
        'license': 'CC0 1.0'
    }
    graph.add_edge(source_id, target_id, kind, 'both', data)

LINCS Compound-gene dysregulation edges

url = rawgit('dhimmel', 'lincs', commit, 'data/consensi/signif/dysreg-drugbank.tsv')
l1000_df = pandas.read_table(url)
l1000_df = l1000_df.query("perturbagen in @compound_df.drugbank_id and entrez_gene_id in @coding_genes")
l1000_df = filter_l1000_df(l1000_df, n=125)
l1000_df.tail(2)

mapper = {'up': 'upregulates', 'down': 'downregulates'}
for row in l1000_df.itertuples():
    source_id = 'Compound', row.perturbagen
    target_id = 'Gene', row.entrez_gene_id
    data = {
        'source': 'LINCS L1000',
        'z_score': round(row.z_score, 3),
        'method': row.status,
        'unbiased': True,
    }
    kind = mapper[row.direction]
    graph.add_edge(source_id, target_id, kind, 'both', data)

Also, is the metaedge generation supposed to be exponential with the number of metapaths in the network? I noticed that if I don't include these types of metapaths in the network, but include everything else, then the number of metapaths drops from 1200 to only 130

['Compound', 'Disease', 'palliates', 'both']
['Compound', 'Gene', 'downregulates', 'both']
['Compound', 'Gene', 'upregulates', 'both']
['Disease', 'Gene', 'downregulates', 'both']
['Disease', 'Gene', 'upregulates', 'both']

The four regulation metapaths were not included due to the edge import errors, and the palliates one due to my excluding them for testing purposes.

@veleritas
Copy link
Author

At the moment I'm bypassing the error by enclosing the add_edge() call in a try except block and it seems to work fine. Including the up/down regulation edges increased the number of metapaths to ~900, so it does seem to be exponential.

@dhimmel
Copy link
Owner

dhimmel commented Mar 18, 2017

Specifically, the following two types of relationships give duplicate edge errors

@veleritas you're getting the AssertionError: edge already exists? I just reinstalled my integrate conda environment and tried out the two metaedges that were giving you trouble. I didn't get any errors. One possibility is that you ran those notebook cells multiple times? Every repeat execution of a cell containing graph.add_edge will now cause an error.

At the moment I'm bypassing the error by enclosing the add_edge() call in a try except block and it seems to work fine.

Hopefully we can diagnose your issue, so you can remove the error handling here.

Also, is the metaedge generation supposed to be exponential with the number of metapaths in the network? I noticed that if I don't include these types of metapaths in the network, but include everything else, then the number of metapaths drops from 1200 to only 130

It's a combinatorial explosion! Not sure if that counts as exponential. The reason the 5 edges you mention have such a huge effect on the total number of possible metapaths is that they connect genes, compounds, and diseases -- which also have lot's of other metaedges. In the future, I could see some heuristic method that only computed DWPCs for metapaths that were likely to provide novel information.

@veleritas
Copy link
Author

So I went back to see if I could pin down the reason why we seem to be getting different results. On a fresh Ubuntu 16.04 instance I have confirmed that integrate.ipynb runs just fine without the edge exists AssertionError with the Anaconda environment specified by https://github.com/dhimmel/integrate/blob/master/environment.yml

(I am using Anaconda 4.3.1 for these tests).

However, if you update the packages in the integrate environment through a conda update --all command, then the integrate notebook breaks on the two edge types that I mentioned in the first comment. It seems weird to me that updating Python dependencies would break the integrate code at this point in time, but it seems like this should classify as a bug?

Here's the environment.yml file dump after the conda update command:

name: integrate
channels:
- defaults
dependencies:
- bleach=1.5.0=py35_0
- cycler=0.10.0=py35_0
- dbus=1.10.10=0
- decorator=4.0.11=py35_0
- entrypoints=0.2.2=py35_1
- et_xmlfile=1.0.1=py35_0
- expat=2.1.0=0
- fontconfig=2.12.1=3
- freetype=2.5.5=2
- glib=2.50.2=1
- gst-plugins-base=1.8.0=0
- gstreamer=1.8.0=0
- html5lib=0.999=py35_0
- icu=54.1=0
- ipykernel=4.5.2=py35_0
- ipython=5.3.0=py35_0
- ipython_genutils=0.1.0=py35_0
- ipywidgets=6.0.0=py35_0
- jdcal=1.3=py35_0
- jinja2=2.9.5=py35_0
- jpeg=9b=0
- jsonschema=2.5.1=py35_0
- jupyter=1.0.0=py35_1
- jupyter_client=5.0.0=py35_0
- jupyter_console=5.1.0=py35_0
- jupyter_core=4.3.0=py35_0
- libffi=3.2.1=1
- libgcc=5.2.0=0
- libgfortran=3.0.0=1
- libiconv=1.14=0
- libpng=1.6.27=0
- libsodium=1.0.10=0
- libxcb=1.12=1
- libxml2=2.9.4=0
- markupsafe=0.23=py35_2
- matplotlib=2.0.0=np112py35_0
- mistune=0.7.4=py35_0
- mkl=2017.0.1=0
- nbconvert=5.1.1=py35_0
- nbformat=4.3.0=py35_0
- notebook=4.4.1=py35_0
- numexpr=2.6.2=np112py35_0
- numpy=1.12.1=py35_0
- openssl=1.0.2k=1
- pandas=0.19.2=np112py35_1
- pandocfilters=1.4.1=py35_0
- path.py=10.1=py35_0
- pcre=8.39=1
- pexpect=4.2.1=py35_0
- pickleshare=0.7.4=py35_0
- pip=9.0.1=py35_1
- prompt_toolkit=1.0.13=py35_0
- ptyprocess=0.5.1=py35_0
- pygments=2.2.0=py35_0
- pyparsing=2.1.4=py35_0
- pyqt=5.6.0=py35_2
- python=3.5.3=1
- python-dateutil=2.6.0=py35_0
- pytz=2016.10=py35_0
- pyzmq=16.0.2=py35_0
- qt=5.6.2=3
- qtconsole=4.2.1=py35_1
- readline=6.2=2
- requests=2.13.0=py35_0
- scipy=0.19.0=np112py35_0
- seaborn=0.7.1=py35_0
- setuptools=27.2.0=py35_0
- simplegeneric=0.8.1=py35_1
- sip=4.18=py35_0
- six=1.10.0=py35_0
- sqlite=3.13.0=0
- terminado=0.6=py35_0
- testpath=0.3=py35_0
- tk=8.5.18=0
- tornado=4.4.2=py35_0
- traitlets=4.3.2=py35_0
- wcwidth=0.1.7=py35_0
- wheel=0.29.0=py35_0
- widgetsnbextension=2.0.0=py35_0
- xlsxwriter=0.9.6=py35_0
- xz=5.2.2=1
- zeromq=4.1.5=0
- zlib=1.2.8=3
- pip:
  - et-xmlfile==1.0.1
  - hetio==0.2.3
  - ipython-genutils==0.1.0
  - jupyter-client==5.0.0
  - jupyter-console==5.1.0
  - jupyter-core==4.3.0
  - prompt-toolkit==1.0.13
  - py2neo==2.0.8
  - tqdm==4.11.2
prefix: /home/ubuntu/anaconda3/envs/integrate

@dhimmel
Copy link
Owner

dhimmel commented Mar 20, 2017

My guess is that some pandas behavior has changed.

Can you see which rows are duplicated using the following:

l1000_df[l1000_df.duplicated(['perturbagen', 'entrez_gene_id'], keep=False)]
stargeo_df[stargeo_df.duplicated(['slim_id', 'entrez_gene_id'], keep=False)]

It seems weird to me that updating Python dependencies would break the integrate code at this point in time, but it seems like this should classify as a bug?

Version changes frequently break things! If you want to update a dependency for an existing codebase, I'd do it one at a time and carefully. I wouldn't recommend conda update --all in these instances. Different codebases have different comparability needs. For example, dhimmel/hetio targets python 3.4+, but for a scripted analysis like dhimmel/integrate it usually makes sense to pick a single environment and stick to it.

That being said, I'm happy to implement a forward compatible syntax if we can figure out what the bug is.

@veleritas
Copy link
Author

I can try to figure out what changed to cause these duplicate edges, but that will probably take a few days as I work through other priorities.

@dhimmel
Copy link
Owner

dhimmel commented Mar 21, 2017

I can try to figure out what changed to cause these duplicate edges, but that will probably take a few days as I work through other priorities.

Up to you. The motivation to diagnose it rather than use error handling is the possibility that's it's part of a bigger problem... but if you're getting the expected number of edges, it's probably not a huge issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants