Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible incompatibility with underlying sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2 #22

Open
astrophys opened this issue Mar 27, 2019 · 1 comment

Comments

@astrophys
Copy link

A colleague is working on using plasflow to analyze on all contigs >1000 bp in her dataset. After filtering using filter_sequences_by_length.pl, she has a total of 2,964,210 contigs. We are using plasflow-1.1, python-3.5 and sklearn-0.18.1 on CentOS 6.9. Plasflow was installed via Anaconda.

Running :

PlasFlow.py --input all.contigs.1000.fasta --output output.plasflow.all.contigs.csv --threshold 0.7

Yields:
Stdout:

Importing sequences
Imported  2964210  sequences
Calculating kmer frequencies using kmer 5
Due to large number of sequences in the input file, it is splitted to smaller chunks (maximum size: 25000 sequences)
processing chunk: 1
.
.
.
processing chunk: 119
Transforming kmer frequencies

Stderr :

/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py:1059: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
Traceback (most recent call last):
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 346, in <module>
    vote_proba = vote_class.predict_proba(inputfile)
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in predict_proba
    self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in <listcomp>
    self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 252, in predict_proba_tf
    self.calculate_freq(data)
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 243, in calculate_freq
    test_tfidf = transformer.fit_transform(kmer_count)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/base.py", line 494, in fit_transform    return self.fit(X, **fit_params).transform(X)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1084, in transform
    X = normalize(X, norm=self.norm, copy=False)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/preprocessing/data.py", line 1352, in normalize
    inplace_csr_row_normalize_l2(X)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 359, in sklearn.utils.sparsefuncs_fast.inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:12648)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 362, in sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:13750)
ValueError: Buffer dtype mismatch, expected 'int' but got 'long'

This issue leads me to think this is due to passing the underlying C-funtion, sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2, too large of a matrix. Following a path of links, lead me to this commit which makes me think that this may be fixed in a more recent version of scikit-learn. The input data, all.contigs.1000.fasta is 12GB in size

Question:

  1. Is my assessment of this issue correct?
  2. Is there a work-around this issue?
  3. Is the input data too big?

Thanks.

@smaegol
Copy link
Owner

smaegol commented Mar 27, 2019

Hi, thank for submitting that issue. I will take a closer look at that and will think about the fix. However, I think that the answer to the 3rd question is yes, and limiting the number of input sequences (for example splitting in the half) should help by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants