Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivot seems to not respect lazy evaluation #163

Open
alberto-i opened this issue Apr 10, 2023 · 2 comments
Open

Pivot seems to not respect lazy evaluation #163

alberto-i opened this issue Apr 10, 2023 · 2 comments

Comments

@alberto-i
Copy link

Hello, is this the expected behavior?

I'm running the code below, using a composition of groupBy, select and inflate and comparing it to a pivot call, both returning the same result. The first call runs in 0.235 ms while the pivot one runs in 146.8 ms, a 62,000% slower. A call to "toArray" takes 51.27 ms with the groupBy and 34.456 ms using pivot. 48 % faster.

Dataset is a 1.5 Mbytes file containing 27k rows.

const dataForge = require('data-forge');
require('data-forge-fs');

let start = process.hrtime();

const elapsed_time = function(note) {
    const precision = 3; // 3 decimal places
    const elapsed = process.hrtime(start)[1] / 1000000; // divide by a million to get nano to milli
    console.log(process.hrtime(start)[0] + " s, " + elapsed.toFixed(precision) + " ms - " + note); // print message + time
    start = process.hrtime(); // reset the timer
}

const df = dataForge
    .readFileSync('./data.csv')
    .parseCSV({ dynamicTyping: true })
    .withIndex((row) => `${row.meeting_id}_${row.item_id}_${row.user_id}_${row.source_id}`)

elapsed_time('parsecsv')

const sintetico = df
    .groupBy((row) => `${row.meeting_id}_${row.item_id}_${row.vote}`)
    .select((group) => ({
        meeting_id: group.first().meeting_id,
        item_id: group.first().item_id,
        vote: group.first().vote,
        stock: group.deflate(row => row.stock).sum(),
    }))
    .inflate()

elapsed_time('groupBy, select, inflate')

const sinteticoPivot = df.pivot(['meeting_id', 'item_id', 'vote'], {
    stock: dataForge.Series.sum
})

elapsed_time('pivot')

const data = sintetico.head(5).toArray()

elapsed_time('groupBy, select, inflate => toArray')

const data2 = sintetico.head(5).toArray()

elapsed_time('groupBy, select, inflate => toArray again')

const data3 = sinteticoPivot.head(5).toArray()

elapsed_time('pivot => toArray')

const data4 = sinteticoPivot.head(5).toArray()

elapsed_time('pivot => toArray again')

These are the outputs:

0 s, 183.236 ms - parsecsv
0 s, 0.235 ms - groupBy, select, inflate
0 s, 146.789 ms - pivot
0 s, 51.270 ms - groupBy, select, inflate => toArray
0 s, 1.200 ms - groupBy, select, inflate => toArray again
0 s, 34.456 ms - pivot => toArray
0 s, 13.261 ms - pivot => toArray again

Is this intended? Should I dig deeper to fix it and make a pull request?

Thanks,

@alberto-i
Copy link
Author

The performance problem seems to be at the orderBy block. I'm not ordering my original groupBy.

Bypassing the orderBy block in the pivot method, these are the timings:

0 s, 157.930 ms - parsecsv
0 s, 0.131 ms - groupBy, select, inflate
0 s, 0.715 ms - pivot
0 s, 65.004 ms - groupBy, select, inflate => toArray
0 s, 2.217 ms - groupBy, select, inflate => toArray again
0 s, 98.662 ms - pivot => toArray
0 s, 6.909 ms - pivot => toArray again

@ashleydavis
Copy link
Member

Orderby will force the entire data set to be evaluated so it could be quite slow.

If you can improve the performance of it I'm happy to accept a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants