Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pca sklearn failure in latest development version #45

Open
chapmanb opened this issue May 30, 2018 · 10 comments
Open

pca sklearn failure in latest development version #45

chapmanb opened this issue May 30, 2018 · 10 comments

Comments

@chapmanb
Copy link

Brent;
We've been trying to use the latest development in bcbio (to get the initial hg38 support) and running into a failure during pca. Is there a minimum number of input samples needed to run pca, or am I doing something else wrong?

2018-05-30 23:07:51 r3467 peddy.pca[22570] INFO loaded and subsetted thousand-genomes genotypes (shape: (2504, 1)) in 0.8 seconds
Traceback (most recent call last):
  File "/g/data3/gx8/local/development/bcbio/anaconda/bin/peddy", line 11, in <module>
    load_entry_point('peddy==0.4.0', 'console_scripts', 'peddy')()
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 204, in peddy
    in ("ped_check", "het_check", "sex_check")]):
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 43, in run
    prefix=prefix, **kwargs)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/peddy.py", line 871, in het_check
    pca_df, background_pca_df = pca(pca_plot, sitesfile, gt_types, sites)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/pca.py", line 61, in pca
    clf.fit(genos1kg, background_target)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 248, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 213, in _fit
    **fit_params_steps[name])
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
    return self.func(*args, **kwargs)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 348, in fit_transform
    U, S, V = self._fit(X)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 394, in _fit
    return self._fit_truncated(X, n_components, svd_solver)
  File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 468, in _fit_truncated
    % (n_components, n_features, svd_solver))
ValueError: n_components=4 must be between 1 and n_features=1 with svd_solver='randomized'

Ideally we'd love to be able to get ancestry prediction for single sample inputs. Thanks for any pointers/suggestions.

@brentp
Copy link
Owner

brentp commented May 30, 2018

hmm. maybe when there's a single sample we need to use X[:, np.newaxis] or something as the number of features should be ~ 20K.

@brentp
Copy link
Owner

brentp commented May 30, 2018

I'll try to carve out some time on friday to work on peddy and get these resolved. thanks for reporting.

@brentp
Copy link
Owner

brentp commented May 30, 2018

I can run with 1 sample on master. It seems like maybe you're only finding a single snp out of the full set.
E.g. your log info says (shape: (2504, 1)) . Are you using the right genome-build for your test-case?

@brentp
Copy link
Owner

brentp commented May 30, 2018

I see the problem that's causing this. Will release 0.4.2 shortly.

@brentp
Copy link
Owner

brentp commented May 30, 2018

actually, I don't think that should affect what you are reporting here. You can try updating to cyvcf2 v0.9.0 to see if that resolves it, but I don't think it will.

@chapmanb
Copy link
Author

Brent;
Thanks for the help with this. I agree it's probably also not the parsing. It's possible there was 1 overlapping SNP; bcbio ends up throwing a lot at peddy: panels, somatic calls (when there is no germline, hoping for germline leakage). Is it possible for peddy to skip pca here if there are not enough sites? It's a bit hard to tell up from what and will not work, so I've been just doing garbage in/garbage out and handling outputs that aren't useful. Would skipping here be possible? Thanks again for taking a look at all this so quickly.

@brentp
Copy link
Owner

brentp commented May 31, 2018

It could, but it uses those sites for the heterozygosity stuff too, so the only thing left would be the sex plots.

@chapmanb
Copy link
Author

That would be helpful for me. I know the output is not especially useful, but in bcbio we often get not good enough data that goes in and would be great to have peddy exit cleanly just without the analysis we can't do. Right now I'm catching exceptions and doing other ugly things when it fails, so would love to improve on that. Thanks for considering this.

@brentp
Copy link
Owner

brentp commented May 31, 2018

I'm wary of writing software that catches all possible problems and gives a 0 exit code.
I'm not completely against what you're saying, but I'm not sure how to implement. My though would be that if you send peddy--which has a list of sites that it uses, a VCF that contains none (or 1) of those sites, it's reasonable to raise an error.
I suppose the alternative is to output an informative error message and still exit 0.

@chapmanb
Copy link
Author

Brent;
I definitely hear what you're saying about design. Raising a consistent error is also cool with me, it's just hard to tell up front if a dataset will have very few overlaps with the sites (without calculating it myself) so would be awesome if peddy told me this. Right now we essentially catch the symptom (index errors):

https://github.com/bcbio/bcbio-nextgen/blob/ebf096ea43c6464403bd656a1081bc322746a0aa/bcbio/variation/peddy.py#L91

which I'd like to improve on. Thanks again for this discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants