Skip to content
This repository has been archived by the owner on Feb 2, 2024. It is now read-only.

[Question] Can groupby agg scale in HPAT? #180

Open
bigwater opened this issue Sep 30, 2019 · 1 comment
Open

[Question] Can groupby agg scale in HPAT? #180

bigwater opened this issue Sep 30, 2019 · 1 comment

Comments

@bigwater
Copy link

bigwater commented Sep 30, 2019

Hi,

I am trying to use HPAT to accelerate data science workloads, especially the ETL process.

The data frame I am using contains 21,721,922 rows and 45 columns. All the data entries use float64 dtype. There is no missing data after cleaning.

I was putting the following code into an HPAT decorated function. It simply groups the data frame by ``year" and calculates the average number for each year. I am tracking the execution time of the groupby-agg operator.

    t0 = time.time()
    tmp1 = df.groupby('YEAR')['INCTOT'].mean()
    tt = time.time() - t0

I am using a server with 2 x Intel(R) Xeon(R) CPU E5-2699 v4 CPU, where it has 44 cores in total.

The results look like this:

Baseline is to use Pandas only without HPAT.

Num of cores groupby-agg time (sec.)
baseline 0.227021694
1 1.437
2 1.39
3 1.398
4 1.427
11 1.51
22 1.794
44 2.838

We observe that when the number of processes used increases, the time spent on groupby-agg also increases. Since GroupBy-agg is a simple map-reduce parallel pattern which should be able to parallelize, the observation is a bit weird to me as far as I understood.

Second, even we only use one thread, applying HPAT gives slowdown compared to pandas.

The groupby-count results of my dataset. Note that in each year, plenty of data entries exist--- there should have sufficient parallelism.

YEAR count
1970 1486744
1980 8746006
1990 1906165
2000 2199860
2010 2494822

Am I missing something? Could you give some suggestions on how should I do to accelerate the groupby-agg operation using HPAT?

Thank you so much.

Best regards,
Hongyuan Liu

@ghost
Copy link

ghost commented Oct 1, 2019

Thank you @bigwater! We're currently working on groupby

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant