Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem when running LMM tutorial #109

Open
flashton2003 opened this issue Jan 23, 2020 · 12 comments
Open

Problem when running LMM tutorial #109

flashton2003 opened this issue Jan 23, 2020 · 12 comments
Assignees

Comments

@flashton2003
Copy link

Hello,

We are running through the tutorial for the linear mixed model and are having some issues.

We downloaded the data from FigShare, ran fsm-lite (can't see how to get the fsm-lite version, but was fresh conda install).

fsm-lite -l ../2020.01.21/fsm_file_list.txt -s 6 -S 610 -v -t 2020.01.21.fsm_kmers.tmp | gzip -c - > ../2020.01.21/2020.01.21.fsm_kmers.txt.gz

There were 16630178 kmers.

Generated the phylogeny distance from the tree in the figshare:

python scripts/phylogeny_distance.py --lmm core_genome_aln.tree > phylogeny_K.tsv

Then, ran pyseer:

pyseer --lmm --phenotypes ../resistances.pheno --kmers 2020.01.21.fsm_kmers.txt.gz --similarity phylogeny_K.tsv --output-patterns kmer_patterns.txt --cpu 4 > penicillin_kmers.txt

Which reported this

Detected binary phenotype
Setting up LMM
Similarity matrix has dimension (616, 616)
Analysing 603 samples found in both phenotype and similarity matrix
h^2 = 0.90
16630178 loaded variants
1028222 filtered variants
15601956 tested variants
15601956 printed variants

Counted the patterns to get the significance threshold, which was 5.52E-08. Then filtered for significant k-mers

cat <(head -1 penicillin_kmers.txt) <(awk '$4<5.52E-08 {print $0}' penicillin_kmers.txt) > significant_kmers.txt

Which reports no significant k-mers.

Also, the qq_plot looks like this:

qq

We've run through it all a couple of times in case we made a blunder somewhere, can't figure out where we've gone wrong.

Any help much appreciated!

@mgalardini
Copy link
Owner

Hi @flashton2003, what I can see so far is a slight difference in the number of input kmers:

16630178 loaded variants
1028222 filtered variants
15601956 tested variants
15601956 printed variants

versus the tutorial:

15167239 loaded variants
1042215 filtered variants
14125024 tested variants
14124993 printed variants

I'm unsure that really matters though. I will try to find some time to run again the tutorial on my end and see if I can diagnose this difference. For what it's worth I don't see any mistake on your part, and recent re-runs of pyseer on my own data gave the same results as runs with previous versions.

@johnlees
Copy link
Collaborator

johnlees commented Jan 23, 2020 via email

@flashton2003
Copy link
Author

In terms of the environment etc, sorry should have mentioned, it's running in a fresh conda env. PySeer v1.3.4.

Other packages installed in the env

#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       0_gnu    conda-forge
bedops                    2.4.37               hc9558a2_0    bioconda
bedtools                  2.29.2               hc088bd4_0    bioconda
binutils_impl_linux-64    2.33.1               he1b5a44_7    conda-forge
binutils_linux-64         2.33.1              h9595d00_16    conda-forge
boost                     1.68.0          py36h8619c78_1001    conda-forge
boost-cpp                 1.68.0            h11c811c_1000    conda-forge
bwa                       0.7.17               hed695b0_7    bioconda
bzip2                     1.0.8                h516909a_2    conda-forge
ca-certificates           2019.11.28           hecc5488_0    conda-forge
capnproto                 0.6.1                hfc679d8_1    conda-forge
certifi                   2019.11.28               py36_0    conda-forge
curl                      7.65.3               hf8cf82a_0    conda-forge
cycler                    0.10.0                     py_2    conda-forge
dbus                      1.13.6               he372182_0    conda-forge
dendropy                  4.4.0                      py_1    bioconda
expat                     2.2.5             he1b5a44_1004    conda-forge
fontconfig                2.13.1            he4413a7_1000    conda-forge
freetype                  2.10.0               he983fc9_1    conda-forge
fsm-lite                  1.0                  hc9558a2_1    bioconda
gcc_impl_linux-64         7.3.0                hd420e75_4    conda-forge
gcc_linux-64              7.3.0               h553295d_16    conda-forge
gettext                   0.19.8.1          hc5be6a0_1002    conda-forge
glib                      2.58.3          py36h6f030ca_1002    conda-forge
glmnet_py                 1.0.0            py36he991be0_1    conda-forge
gsl                       2.5                  h294904e_1    conda-forge
gst-plugins-base          1.14.5               h0935bb2_0    conda-forge
gstreamer                 1.14.5               h36ae1b5_0    conda-forge
gxx_impl_linux-64         7.3.0                hdf63c60_4    conda-forge
gxx_linux-64              7.3.0               h553295d_16    conda-forge
htslib                    1.9                  ha228f0b_7    bioconda
icu                       58.2              hf484d3e_1000    conda-forge
joblib                    0.14.1                     py_0    conda-forge
jpeg                      9c                h14c3975_1001    conda-forge
kiwisolver                1.1.0            py36hc9558a2_0    conda-forge
krb5                      1.16.4               h2fd8d38_0    conda-forge
ld_impl_linux-64          2.33.1               h53a641e_7    conda-forge
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
libclang                  9.0.1           default_hde54327_0    conda-forge
libcurl                   7.65.3               hda55be3_0    conda-forge
libdeflate                1.0                  h14c3975_1    bioconda
libedit                   3.1.20170329      hf8c457e_1001    conda-forge
libffi                    3.2.1             he1b5a44_1006    conda-forge
libgcc-ng                 9.2.0                h24d8f2e_2    conda-forge
libgfortran-ng            7.3.0                hdf63c60_4    conda-forge
libgomp                   9.2.0                h24d8f2e_2    conda-forge
libiconv                  1.15              h516909a_1005    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
libllvm9                  9.0.1                hc9558a2_0    conda-forge
libopenblas               0.3.7                h5ec1e0e_6    conda-forge
libpng                    1.6.37               hed695b0_0    conda-forge
libssh2                   1.8.2                h22169c7_2    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_2    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxkbcommon              0.9.1                hebb1f50_0    conda-forge
libxml2                   2.9.9                hea5a465_1  
mantis                    0.2                  h8b12597_2    bioconda
mash                      2.2.2                h3d38be6_1    bioconda
matplotlib                3.1.1            py36h5429711_0  
ncurses                   6.1               hf484d3e_1002    conda-forge
newick_utils              1.6                  h470a237_2    bioconda/label/cf201901
nspr                      4.24                 he1b5a44_0    conda-forge
nss                       3.47                 he751ad9_0    conda-forge
numpy                     1.17.3           py36h95a1406_0    conda-forge
openssl                   1.1.1d               h516909a_0    conda-forge
pandas                    0.25.3           py36hb3f55d8_0    conda-forge
patsy                     0.5.1                      py_0    conda-forge
pcre                      8.43                 he1b5a44_0    conda-forge
perl                      5.26.2            h516909a_1006    conda-forge
pip                       19.3.1                   py36_0    conda-forge
pthread-stubs             0.4               h14c3975_1001    conda-forge
pybedtools                0.8.1            py36he513fc3_0    bioconda
pyparsing                 2.4.6                      py_0    conda-forge
pyqt                      5.9.2            py36hcca6a23_4    conda-forge
pysam                     0.15.3           py36hda2845c_1    bioconda
pyseer                    1.3.4                      py_0    bioconda
python                    3.6.7             h357f687_1006    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
pytz                      2019.3                     py_0    conda-forge
qt                        5.9.7                h52cfd70_2    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
samtools                  1.9                 h10a08f8_12    bioconda
scikit-learn              0.22.1           py36hcdab131_1    conda-forge
scipy                     1.4.1            py36h921218d_0    conda-forge
sdsl-lite                 2.1.1             hc9558a2_1001    conda-forge
setuptools                45.0.0                   py36_1    conda-forge
sip                       4.19.8           py36hf484d3e_0  
six                       1.13.0                   py36_0    conda-forge
sqlite                    3.30.1               hcee41ef_0    conda-forge
squeakr                   0.6                  h8bd17aa_0    bioconda
statsmodels               0.10.2           py36hc1659b7_0    conda-forge
tk                        8.6.10               hed695b0_0    conda-forge
tornado                   6.0.3            py36h516909a_0    conda-forge
tqdm                      4.7.2                    py36_0    bioconda
unitig-caller             1.0.0            py36h76f5088_0    bioconda
unitig-counter            1.0.5                hdbcaa40_0    bioconda
wheel                     0.33.6                   py36_0    conda-forge
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxdmcp             1.1.3                h516909a_0    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge

@johnlees
Copy link
Collaborator

@flashton2003 great, thanks for the info. I'm away for the next week, but will try and replicate this as soon as I get a chance

Must be an issue with some of the numerical packages, but hopefully we can work it out and stop it in future

@mgalardini
Copy link
Owner

Hi @flashton2003, could you clone the repository, activate your environment and try to run the unit tests? That could help us diagnose the problem or its source a bit better.

The tests can be run like this, from the root of the repo:

pytest -v tests/
cd tests/
bash run_tests.sh

I will see if I can reproduce this behavior by mimicking your environment.

@flashton2003
Copy link
Author

Hi @mgalardini, ok I've done that.

First thing, not quite sure if this is the intended output of pytest command?

going into /home/ubuntu/pyseer-master/tests
no test dir found testing here: /home/ubuntu/pyseer-master
*******************************************************************************

Then, all the tests passed, this was the output

Comparing results and error messages to baseline 1
Changes in stderr messages
Comparing results and error messages to baseline 2
Changes in stderr messages
Comparing results and error messages to baseline 3
Changes in stderr messages
Comparing results and error messages to baseline 4
Comparing results and error messages to baseline 5
af	1.0000
beta	1.0000
beta-std-err	1.0000
intercept	1.0000
PC1	1.0000
PC2	1.0000
PC3	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 6
af	1.0000
beta	1.0000
beta-std-err	1.0000
intercept	1.0000
PC1	1.0000
PC2	1.0000
PC3	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 7
Changes in stderr messages
Comparing results and error messages to baseline 8
Changes in stderr messages
Comparing results and error messages to baseline 9
af	1.0000
beta	1.0000
beta-std-err	1.0000
intercept	1.0000
PC1	1.0000
PC2	1.0000
PC3	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 10
Changes in stderr messages
Comparing results and error messages to baseline 11
Changes in stderr messages
Comparing results and error messages to baseline 12
Changes in stderr messages
Comparing results and error messages to baseline 13
Changes in stderr messages
Comparing results and error messages to baseline 14
Changes in stderr messages
Comparing results and error messages to baseline 15
af	1.0000
beta	1.0000
beta-std-err	1.0000
intercept	1.0000
PC1	1.0000
PC2	1.0000
PC3	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 16
af	1.0000
beta	1.0000
beta-std-err	1.0000
intercept	1.0000
PC1	1.0000
PC2	1.0000
PC3	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 17
af	1.0000
beta	0.9999
beta-std-err	1.0000
intercept	0.9999
PC1	0.9989
PC2	0.9995
PC3	0.9997
filter-pvalue	1.0000
lrt-pvalue	0.9998
Changes in stderr messages
Comparing results and error messages to baseline 18
af	1.0000
beta	1.0000
beta-std-err	1.0000
intercept	1.0000
PC1	1.0000
PC2	1.0000
PC3	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 19
af	1.0000
beta	1.0000
beta-std-err	1.0000
intercept	1.0000
PC1	1.0000
PC2	1.0000
PC3	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 20
af	1.0000
beta	1.0000
beta-std-err	1.0000
variant_h2	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 21
af	1.0000
beta	1.0000
beta-std-err	1.0000
variant_h2	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 22
af	1.0000
beta	1.0000
beta-std-err	1.0000
variant_h2	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 23
af	1.0000
beta	1.0000
beta-std-err	1.0000
variant_h2	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 24
Comparing results and error messages to baseline 25
Changes in stderr messages
Comparing results and error messages to baseline 26
af	1.0000
beta	1.0000
beta-std-err	1.0000
variant_h2	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 27
af	1.0000
beta	1.0000
beta-std-err	1.0000
variant_h2	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 28
af	1.0000
beta	1.0000
beta-std-err	1.0000
Skipping column intercept because of all-NaNs
filter-pvalue	1.0000
/home/ubuntu/miniconda3/envs/GWAS/lib/python3.6/site-packages/pandas/core/series.py:856: RuntimeWarning: invalid value encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 29
af	1.0000
beta	1.0000
beta-std-err	1.0000
intercept	1.0000
filter-pvalue	1.0000
lrt-pvalue	1.0000
Changes in stderr messages
Comparing results and error messages to baseline 30
Changes in stderr messages
Comparing results and error messages to baseline 31
af	1.0000
beta	0.9993
filter-pvalue	1.0000
Skipping column lrt-pvalue because of all-NaNs
Changes in stderr messages
Comparing results and error messages to baseline 32
af	1.0000
beta	1.0000
filter-pvalue	1.0000
Skipping column lrt-pvalue because of all-NaNs
Changes in stderr messages
Comparing results and error messages to baseline 33
Changes in stderr messages
Comparing results and error messages to baseline 34
af	1.0000
beta	0.9993
filter-pvalue	1.0000
Skipping column lrt-pvalue because of all-NaNs
Changes in stderr messages
Comparing results and error messages to baseline 35
Changes in stderr messages
All good!

@johnlees
Copy link
Collaborator

The elastic net tests (31-34) might have a problem, and a couple of others have warnings, but the LMM tests all passed (though as you point out the unit tests didn't run, I think @mgalardini would have more idea what went wrong there).

Would you be able to run the original command again with this clone using python pyseer-runner.py instead of just pyseer, and see if the results look the same? Just the first few lines would be ok, if you didn't want to wait for the whole thing to run again?

@flashton2003
Copy link
Author

I've re-run with python pyseer-runner.py and the first few lines are the same as just using the pyseer executable.

@mgalardini
Copy link
Owner

Hi @flashton2003, maybe pytest is not the best idea to make sure you are running the tests correctly; from the repository's root try python setup.py test. I think the best course of action is to pick a k-mer that is supposed to have the lowest p-value and have you try it. Once we have an isolated example we can start drilling down with some debugging messages. I'll reach out offline to setup this test

@jsan4christ
Copy link

Hi @mgalardini and @johnlees,

Thanks for all the insights shared here,

I'm also having trouble with inconsistent results running the tutorial, see below:

$ pyseer --phenotypes resistances.pheno --vcf snps.vcf.gz --load-m mash_mds.pkl --lineage --print-samples > penicillin_SNPs.txt
Read 603 phenotypes
Detected binary phenotype
Loaded projection with dimension (603, 269)
Analysing 603 samples found in both phenotype and structure matrix
Writing lineage effects to lineage_effects.txt
No observations of 26_23_G in selected samples
No observations of 26_845_A in selected samples
.
.
.
No observations of 26_2219469_C in selected samples
No observations of 26_2219533_C in selected samples
No observations of 26_2219786_G in selected samples
No observations of 26_2220808_A in selected samples
198248 loaded variants
108545 filtered variants
89703 tested variants
89553 printed variants

$ pyseer --phenotypes resistances.pheno --vcf snps.vcf.gz --load-m mash_mds.pkl --lineage --print-samples > penicillin_SNPs.txt
Read 603 phenotypes
Detected binary phenotype
Loaded projection with dimension (603, 269)
Analysing 603 samples found in both phenotype and structure matrix
Writing lineage effects to lineage_effects.txt
No observations of 26_23_G in selected samples
No observations of 26_845_A in selected samples
No observations of 26_1110_C in selected samples
.
.
.
No observations of 26_2219533_C in selected samples
No observations of 26_2219786_G in selected samples
No observations of 26_2220808_A in selected samples
198248 loaded variants
108545 filtered variants
89703 tested variants
89553 printed variants

After updating numpy:

No observations of 26_2219533_C in selected samples
No observations of 26_2219786_G in selected samples
No observations of 26_2220808_A in selected samples
198248 loaded variants
108545 filtered variants
89703 tested variants
89553 printed variants

real 25m48.461s
user 24m56.073s
sys 0m16.701s

$ time pyseer --phenotypes resistances.pheno --vcf snps.vcf.gz --load-m mash_mds.pkl --min-af 0.02 --max-af 0.98 > penicillin_SNPs.txt
Read 603 phenotypes
Detected binary phenotype
Loaded projection with dimension (603, 269)
Analysing 603 samples found in both phenotype and structure matrix
No observations of 26_23_G in selected samples
No observations of 26_845_A in selected samples
No observations of 26_1110_C in selected samples
No observations of 26_3106_C in selected samples
.
.
.
No observations of 26_2219533_C in selected samples
No observations of 26_2219786_G in selected samples
No observations of 26_2220808_A in selected samples
198248 loaded variants
132535 filtered variants
65713 tested variants
65660 printed variants

real 16m57.107s
user 26m4.676s
sys 0m24.717s

time python $(which phylogeny_distance.py) --lmm core_genome_aln.tree > phylogeny_K.tsv

or

ls -1 assemblies/*.contigs_velvet.fa | xargs -n 1 basename | cut -d '.' -f1 > samples.txt

time similarity_pyseer --vcf snps.vcf.gz samples.txt > gg.snps.txt
Reading in variants
No observations of 26_16_G_A in selected samples
No observations of 26_23_G in selected samples
No observations of 26_31_G_T in selected samples
.
.
.
No observations of 26_2219786_G in selected samples
No observations of 26_2220808_A in selected samples
Calculating sample similarity

real 4m37.942s
user 4m18.043s
sys 0m1.941s

$ time pyseer --lmm --phenotypes resistances.pheno --kmers fsm_kmers.txt.gz --similarity phylogeny_K.tsv --output-patterns kmer_patterns.txt --cpu 2 > penicillin_kmers.txt
Read 603 phenotypes
Detected binary phenotype
Setting up LMM
Similarity matrix has dimension (616, 616)
Analysing 603 samples found in both phenotype and similarity matrix
h^2 = 0.90
14700703 loaded variants
14700703 filtered variants
0 tested variants
0 printed variants

real 401m47.563s
user 122m13.091s
sys 12m27.899s

$ time pyseer --lmm --phenotypes resistances.pheno --kmers fsm_kmers.txt.gz --similarity gg.snps.txt --output-patterns kmer_patterns2.txt --cpu 4 > penicillin_kmers2.txt
Read 603 phenotypes
Detected binary phenotype
Setting up LMM
Similarity matrix has dimension (616, 616)
Analysing 603 samples found in both phenotype and similarity matrix
h^2 = 0.91
14700703 loaded variants
14700703 filtered variants
0 tested variants
0 printed variants

real 119m38.755s
user 122m24.156s
sys 13m15.104s

I also cant seem to get unitig-counter to install on mac:

$ conda install unitig-counter unitig-caller
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  • unitig-caller
  • unitig-counter

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

I would appreciate any ideas on how to work around these.

@johnlees
Copy link
Collaborator

  • The fixed effect models look ok to me – what is the problem you are having?
  • The mixed effect model looks like it's probably a label mismatch causing all the variants to be filtered out
  • The unitig packages are linux only unfortunately

@jsan4christ
Copy link

Thanks John, turns out that the sample names (ids) were in decimal point format with some ending with zeros. The trailing zeros would be dropped in some cases creating inconsistency. Adding a character at the end of the names so they are always treated as text helped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants