Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating CSR array with duplicates fails #136

Open
rth opened this issue Nov 11, 2018 · 6 comments
Open

Creating CSR array with duplicates fails #136

rth opened this issue Nov 11, 2018 · 6 comments

Comments

@rth
Copy link
Contributor

rth commented Nov 11, 2018

Currently, CsMat::new appears to fail on input that has duplicate values (i.e. when in a CSR matrix multiple indices for a row are the same). They are summed by default, if I read the following comment correctly,

By convention, duplicate locations are summed up when converting into CsMat

An example from #135,

use sprs::CsMat;

let indptr = vec![0, 5, 8];
let indices = vec![761698, 328290, 828689, 761698, 780185, 901149, 780185, 144749, 258307];
let data = vec![1, 1, 1, 1, 1, 1, 1, 1, 1];
let a = CsMat::new((2, 1000000), indptr, indices, data);

this produces,

thread 'sparse::csmat::test::test_example' panicked at 'Indices length and inpdtr's nnz do not match: 9 != 8', src/sparse/csmat.rs:1095:13

using the code from that PR.

In this case, the initial len(indices) is indeed 9 but it becomes 8 after the duplicates are summed.

FWIW scipy.sparse.csr_matrix in Python (see docs and implementation ) handles this case as follows,

>>> from scipy.sparse import csr_matrix
>>> indptr = [0, 5, 8]
>>> indices = [761698, 328290, 828689, 761698, 780185, 901149, 780185, 144749, 258307]
>>> data = [1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> X = csr_matrix((data, indices, indptr))
>>> len(indices)
9
>>> len(X.indices)
8
>>> X.indptr
array([0, 4, 7], dtype=int32)
>>> X.indices
array([328290, 761698, 780185, 828689, 144749, 780185, 901149],
      dtype=int32)
>>> X.data
array([1, 2, 1, 1, 1, 1, 1])

I can contribute this example as a test if necessary.

@rth
Copy link
Contributor Author

rth commented Nov 11, 2018

After more investigation and reading the docstring more carefully, maybe it's not a bug for CsMat::new but is explicitly unsupported.

Currently, the result is indeed not consistent for this input, where nnz (determined from the last element of indptr) is 8 while indices.len() is 9.

Maybe there is a difference in convention with scipy. Anyway, replacing indptr by,

let indptr = vec![0, 5, 9];

in the above example, I now get,

panicked at 'called `Result::unwrap()` on an `Err` value: NonSortedIndices', libcore/result.rs:945:5

which seems to happen in here.

At the same time, if one has such data, the question is how one is expected to handle it. I could do it manually, but I would need to first sort the indices, and currently that would mean manually copy-pasting this code from sprs, as there is no stand alone API entry for it.

@vbarrielle
Copy link
Collaborator

Hello and thanks for this report.

It was indeed a design choice that the basic constructor for CsMat would not do too much magic. Looks like the proper choice would be to add a constructor that is able to produce this summation of repeated indices and sorting of indices. I'll look into that.

I'm however a bit curious about your input data. As I understand it, the last value in your indices value is to be ignored (in scipy)? Why is it there?

@rth
Copy link
Contributor Author

rth commented Nov 23, 2018

It was indeed a design choice that the basic constructor for CsMat would not do too much magic

Sounds like a reasonable thing to do, and now I do wonder about the performance cost of always doing this duplicate summing in scipy, even when it's not necessary.

Looks like the proper choice would be to add a constructor that is able to produce this summation of repeated indices and sorting of indices.

Even a util function something like sum_duplicates(indices, indptr, values) -> (indices, values) could be a start.

As I understand it, the last value in your indices value is to be ignored (in scipy)?

All values in indices are used. There is some difference in the indptr convention about the last element I think -- and I also wonder why.

@rth
Copy link
Contributor Author

rth commented Nov 23, 2018

Also, I would be happy to contribute this feature in general.

@vbarrielle
Copy link
Collaborator

Looks like the proper choice would be to add a constructor that is able to produce this summation of repeated indices and sorting of indices.

Even a util function something like sum_duplicates(indices, indptr, values) -> (indices, values) could be a start.

Also, I would be happy to contribute this feature in general.

Yes I also think this is the way to go. If you want to contribute this feature, please do. May I suggest that
the util function takes its indices and values parameters as &mut pointer? That way integrating this
function into a new CsMat constructor would be smoother, and there's no loss in flexibility, one can always clone before passing to this function.

As I understand it, the last value in your indices value is to be ignored (in scipy)?

All values in indices are used. There is some difference in the indptr convention about the last element I think -- and I also wonder why.

I really think the last value has been ignored by scipy. If you look at the sum of the data array you pass, it is 9, but X.data.sum() == 8. So it looks like scipy will only consider your indices and data arrays up to the nnz value specified by the last element of indptr.

@rth
Copy link
Contributor Author

rth commented Nov 23, 2018

If you look at the sum of the data array you pass, it is 9, but X.data.sum() == 8

yes, my indptr was incorrect, it should have been indptr = [0, 5, 9] which explains why the last element of indices was ignored.

May I suggest that
the util function takes its indices and values parameters as &mut pointer?

Sure, will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants