Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance bug when creating SparseMatrixCSC #735

Closed
TomDeWeer opened this issue Dec 24, 2019 · 5 comments
Closed

Performance bug when creating SparseMatrixCSC #735

TomDeWeer opened this issue Dec 24, 2019 · 5 comments

Comments

@TomDeWeer
Copy link

TomDeWeer commented Dec 24, 2019

Dear all,

I'm trying to import a very large scipy.sparse.csc_matrix into Julia but I'm running into a performance problem. I have tried to create a minimal working example to show what I mean.

First of all, I made the following module in Julia (file PerformanceBug.jl):

module PerformanceBug
using PyCall
using SparseArrays
using LinearAlgebra
export transformToJulia
    function transformToJulia(m, n, colPtr, rowVal, nzVal)
        colPtrJ = PyArray(PyObject(colPtr))
        rowValJ = PyArray(PyObject(rowVal))
        nzValJ = PyArray(PyObject(nzVal))
        vec(Int64[i+1 for i in colPtrJ]), vec(Int64[i+1 for i in rowValJ]), vec(Array(nzValJ))
        colPtrJ = vec(Int64[i+1 for i in colPtrJ])
        rowValJ = vec(Int64[i+1 for i in rowValJ])
        nzValJ = vec(Array(nzValJ))
        A = SparseMatrixCSC{Float64, Int64}(m, n, colPtrJ, rowValJ, nzValJ)
        # Until here everything is fast.
        return A # Executing this statement takes ages and consumes all RAM.
    end
end

Returning the matrix A in the last statement of the function consumes all my RAM and takes ages.

I use the above Julia module from within Python via the following script (executed via pycharm):

import julia, scipy.sparse, os
# setting up Julia from python, called in pycharm
juliaScriptsDir = # path to directory of PerformanceBug.jl
juliaDir = r"C:\Program Files\JuliaPro\Julia-1.2.0\bin"
juliaEXE = r"C:\Program Files\JuliaPro\Julia-1.2.0\bin\julia.exe"
julia.install(julia=juliaEXE)
oldDir = os.getcwd()
os.chdir(juliaDir)
j = julia.Julia(runtime=juliaEXE, compiled_modules=True)
os.chdir(oldDir)
j.include(os.path.join(juliaScriptsDir, "PerformanceBug.jl"))
os.chdir(juliaDir)
from julia import Main
# Creating diagonal matrix in python
n = 1000000
A = scipy.sparse.eye(n).tocsc()
# Conversion to Julia
Main.m = A.shape[0]
Main.n = A.shape[1]
Main.colPtr = A.indptr
Main.rowVal = A.indices
Main.nzVal = A.data
j.eval("A = PerformanceBug.transformToJulia(m, n, colPtr, rowVal, nzVal)")

Does anybody have a clue as to why Julia suddenly has trouble creating a simple sparse matrix? Working purely in Julia poses no performance issues at all.

Kind regards,
Tom

@stevengj
Copy link
Member

Duplicate of #204. PyCall doesn't know about the scipy sparse formats.

I'm not sure that a dependence on scipy belongs in PyCall — but you could certainly create a "SciPySparse" package on top of PyCall that knows how to convert between Julia and Scipy sparse formats.

@stevengj
Copy link
Member

In particular, PyCall automatically tries to convert the return value back to a Python object, which is why you are running into #204 — since it doesn't know about SciPy sparse formats, it does so by converting to a dense matrix.

You can fix this by just doing:

return PyCall.pyjlwrap_new(A)

which will return a thin Python wrapper around the native Julia A rather than trying to convert A to a native Python array.

By the way, these lines look wrong to me:

colPtrJ = PyArray(PyObject(colPtr))
rowValJ = PyArray(PyObject(rowVal))
nzValJ = PyArray(PyObject(nzVal))

I don't understand why you need any conversion here at all — the arrays have already been converted to Julia arrays when the function is called. (You can use pyfunction to suppress these conversions.)

This is type-unstable:

colPtrJ = vec(Int64[i+1 for i in colPtrJ])

since you are changing the type of colPtrJ. And why are you calling vec?

@stevengj
Copy link
Member

stevengj commented Jan 22, 2020

It seems like you can do something like

function transformToJulia(m, n, colPtr, rowVal, nzVal)
    A = SparseMatrixCSC{Float64, Int}(m, n, Int[i+1 for i in colPtr], Int[i+1 for i in rowVal], Vector{Float64}(nzVal))
    return PyCall.pyjlwrap_new(A)

To make it even more efficient, you can tell it to do copy-free passing of the array arguments on the Python side:

transformToJulia = j.pyfunction(j.eval("PerformanceBug.transformToJulia"), j.Int, j.Int, j.PyArray, j.PyArray, j.PyArray))

and then call transformToJulia(A.shape[0], A.shape[1], A.indptr, A.indices, A.data)

@stevengj
Copy link
Member

stevengj commented Jan 22, 2020

Here is an even simpler function allowing you to pass the SciPy sparse array directly:

using PyCall, SparseArrays

function scipyCSC_to_julia(A)
    m, n = A.shape
    colPtr = Int[i+1 for i in PyArray(A."indptr")]
    rowVal = Int[i+1 for i in PyArray(A."indices")]
    nzVal = Vector{Float64}(PyArray(A."data"))
    B = SparseMatrixCSC{Float64,Int}(m, n, colPtr, rowVal, nzVal)
    return PyCall.pyjlwrap_new(B)
end

@TomDeWeer
Copy link
Author

TomDeWeer commented Jan 23, 2020

Hi Steven, thanks for you answer! I didn't realize PyCall was trying to convert every function output to Python. Adding the PyCall.pyjlwrap_new(A) statement fixed the performance problem!

I am confused as to how you communicate with Julia without using global Julia variables. Surely, calling (in Python)

Awrap = j.eval("scipyCSC_to_julia(A)")

returns an error because Julia has no variable A in its workspace.

EDIT: Using j.PerformanceBug.scipyCSC_to_julia(A) works.The whole global variable stuff is (fortunately) avoided with this syntax. I guess this is all contained in the documentation, but maybe a small demo on how to get PyCall and Python working together could prevent others from bumping into similar problems? Thanks for the help anyways!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants