Skip to content
This repository has been archived by the owner on May 9, 2024. It is now read-only.

device-linear-multifrag execution mode #648

Open
akroviakov opened this issue Aug 24, 2023 · 16 comments
Open

device-linear-multifrag execution mode #648

akroviakov opened this issue Aug 24, 2023 · 16 comments
Labels
enhancement New feature or request

Comments

@akroviakov
Copy link
Contributor

As of now, HDK's heterogeneity looks like this:

  • We have X fragments, when we schedule them on a GPU, it will receive X kernels, X fragments and execute kernels sequentially.

The currently disabled multifragment option does the following:

  • We have X fragments, when we schedule them on a GPU, it will receive 1 kernel and X fragments and execute this one kernel.

Why is current multifragment good:

  • We avoid repeated administrative work around preparing X kernels, scheduling them and retrieving results.
  • Fine grained control (allocate/free) over memory of fragments.

Why is it not that good:

  • Currently we place fragments in GPU memory arbitrarily, this is taken care of in the code of GPU kernels which increases their complexity for the programmer, we cannot simply index into the multifragmented column and we have to do more checks per tuple (e.g., JoinColumnIterator& operator++() and related for loops). This likely eats some performance from the executePlan stage, which is often the costliest.

Basic idea:
The best case for the kernel is if all columns are linear (simple loops over dense data). All GPU-assigned fragments have to be materialized anyways, so why not place them linearly?
We can add a new execution mode where fragments (regardless of their position in host memory) are assembled linearly on GPU, this way, the fragments become transparent for the GPU kernel, it will treat them as one big fragment. We still know the address of each fragment and can get it back to CPU, but we cannot free individual fragments, only if we do it for the whole linear buffer. Such behavior is not different from just a big fragment size.

Why is it supposed to be better?

  • Cheap initial data locality
  • Easier indexing into a column -> easier and (likely) faster kernel code

What's the catch?

  • We lose fine granularity of memory control on GPU. But do we always care?
  • What if the next step takes more fragments? We know which fragments are still on GPU and their location, so one could avoid using bandwidth by memmove/memcpy'ing on the device itself and add only missing ones. If it is not possible, we pay the fetching price. Maybe this is a subject to more sophisticated data structures.

This won't bring any benefit for CPU as it is expensive and redundant (with regards to memory) to linearize fragments, also likely pointless anyways, since each thread processes one fragment (which is linear) at a time.

When does this execution mode make sense:

  • Query is just one step (or more with the latter ones being more compute intense)
  • Proportions do not change between steps (this is in a simple CPU+GPU scenario, but the more devices we have, the more likely we are to shuffle the fragments between them anyways)
  • Pure plan execution speed, the workload is not severely bandwidth bound.

Fetching chunks takes noticeably less time than executePlan, possible savings in executePlan might justify a small increase in latency caused by chunk placement.

Summary:
GPUs will be able to treat any number of fragments as one big fragment of arbitrary size, which aligns with the goal of heterogeneity and GPU's preferred programming model. But there's a price: memory management granularity + bandwidth. The question is if the price is justified? After all, it will be just an option, so if the bandwidth is the bottleneck, one could just switch to the current way of multifragment execution. But if one knows that a workload needs only a handful of columns, why not get the most out of the hardware?

The purpose of the related work is to see if it brings noticeable benefits at all, if not, then we will at least know where not to look into.

Any tips, suggestion and critic is welcomed.

@akroviakov akroviakov added the enhancement New feature or request label Aug 24, 2023
@ienkovich
Copy link
Contributor

The basic idea looks very reasonable. But before implementation, I'd try to measure possible gain by running some queries on GPU with different fragment sizes. Changing fragment size you can run the same query with multi-fragment kernels and single-fragment kernels by setting a big enough fragment size (and maybe setting config flag?). Then you'd be able to decide if the implementation complexity is worth it.

@kurapov-peter
Copy link
Contributor

@akroviakov, could you please post the preliminary data here as well?

@akroviakov
Copy link
Contributor Author

To see the benefit and compare to the current multifrag, one must be able to (1) linearize on GPU one-by-one and (2) modify some kernel (I'm thinking of HT builder) to make it somewhat "linear". I'm currently working on linearizing on GPU which needs something like getOneTableColumnFragmentTo(int8_t* dest, ..) to omit materialization on CPU and copy to the desired location.

@ienkovich
Copy link
Contributor

To see the benefit and compare to the current multifrag, one must be able to (1) linearize on GPU one-by-one and (2) modify some kernel

Can you run the same query on the same data imported with different fragment sizes? E.g. you can import 100M rows with fragment_size=1'000'000 and fragment_size=100'000'000. The first one would use multi frag kernel with 100 fragments and the other one can be run with a single fragment

@akroviakov
Copy link
Contributor Author

Quick data on a simple group by 10'000'000 rows:

Current multifrag, 2nd run, 20 frags:
          77ms start(16ms) executePlan Execute.cpp:3540
Current one frag, 2nd run:
          76ms start(13ms) executePlan Execute.cpp:3576

The difference simply due to a memory layout is indeed not noticeable.
Does that mean that it is simply better to focus on enabling current multifragment? Btw, why is it currently disabled?
I was thinking about the execution mode with HT builder in mind, which is not codegen'ed and looks a bit messy, so would you say it is just because of its implementation, not related to the memory layout assumptions?

@kurapov-peter
Copy link
Contributor

I'm rooting for the approach. I see no reason to linearize the data on the CPU. I also think kernel data sizes should be arbitrary. Here are some thoughts on the topic.

Interleaved execution

The multi-fragment execution approach potentially allows for interleaving data copying with the execution. Having all the data linearized assumes (though not necessarily, but we imply it when we say the kernel will be simplified) that the iteration space for threading is now the whole linearized memory chunk (instead of a single fragment as it is now). The proposed approach can employ the technique as well but at a coarse granularity: interleaving will happen between kernels. The tradeoff affects latency and is probably not a considerable point since any latency-sensitive (small data) query is perfectly fine on the CPU.

Memory management on re-fragmenting

First, let's assume the data is small enough (for the other case see below). In such a case there can be a situation when we would need to move an additional fragment to or from the device. A naïve approach would reallocate the device buffer and move the data again to preserve linearity. My take on it is twofold:

  1. Having a GPU proportion small enough to not occupy GPU memory is likely unreasonable to change as the bookkeeping would outweigh the benefit.
  2. If we do have a benefit from rearranging the data in such a case, first, for moving the data from GPU there's no problem as we can keep track of the fragments with metadata. Second, moving fragments to GPU does not necessarily mean reallocation. One could either use the multi-fragment approach at a larger scale (chained linear chunks) or use a heap manager to do allocations dynamically (if the cost of memory provider allocations is tolerable; this can be managed by heuristics).

For larger datasets, a proportion on GPU would mean rather how many kernels we schedule to GPU while each of them may occupy "all" (probably half to allow interleaved execution?) the available memory. In this case, there's no need for fine-grained fragment management.

@kurapov-peter
Copy link
Contributor

Quick data on a simple group by 10'000'000 rows:

Current multifrag, 2nd run, 20 frags:
          77ms start(16ms) executePlan Execute.cpp:3540
Current one frag, 2nd run:
          76ms start(13ms) executePlan Execute.cpp:3576

The difference simply due to a memory layout is indeed not noticeable. Does that mean that it is simply better to focus on enabling current multifragment? Btw, why is it currently disabled? I was thinking about the execution mode with HT builder in mind, which is not codegen'ed and looks a bit messy, so would you say it is just because of its implementation, not related to the memory layout assumptions?

This is an expected result since for a simple group-by query you replace a global loop with more work for each thread essentially saving several always-taken branch instructions. The benefit of simplicity as we discussed may come for those checks (or rather their absence) in the HT build.

@ienkovich
Copy link
Contributor

This is an expected result since for a simple group-by query you replace a global loop with more work for each thread essentially saving several always-taken branch instructions. The benefit of simplicity as we discussed may come for those checks (or rather their absence) in the HT build.

Does it mean linear-multifrag can be beneficial only when we have proper runtime/codegen to remove multifrag costs we don't need when running on a single linear chunk? Then I'd suggest working on this first to get an opportunity to measure the real benefit of big linear chunks through experiments with fragment size.

The separate issue I guess is the efficiency of data fetch to GPU devices. Currently, we can have unnecessary copies of fragment data to CPU memory prior to copying it to GPU even when we fetch a single fragment. In many cases fragment is not zero-copy accessible on CPU and therefore we linearize it just to copy it to GPU. If we could change it to copy zero-copy accessible fragment parts directly to GPU then we could save some time and CPU memory. After that, the solution might be extended to multi-frag linearized fetches.

@kurapov-peter
Copy link
Contributor

Does it mean linear-multifrag can be beneficial only when we have proper runtime/codegen to remove multifrag costs we don't need when running on a single linear chunk?

I think so. I don't see much reason for a linear chunk to be significantly superior in most common cases. It does save us some overhead on branching, but that's about it. Joins, however, is a different story.

The issue with data fetch is an evident one, I agree.

@alexbaden
Copy link
Contributor

A few thoughts...

On GPU execution, we already support multi-fragment kernels. Therefore, we could have arbitrary sized fragments (yes there is some outer loop overhead but that seems negligible, and let's ignore copy overheads for now) for GPU now except that we need per-fragment metadata to properly size data structures - though, because we do not have any concept of sharding we actually don't size data structures for anything more than the full data. So, multi-frag execution of arbitrary size on GPU should be doable today quite easily.

As an aside, linearization cost is very high for varlen data. JoinColumnIterator was created to support joins on varlen data across multiple fragments. It was never finished, though, because the payload of the hash table really needs the ability to do the index. At the end of the day, though, it is just a multi-index and I cannot imagine the overhead of adding to two integers is much higher than the overhead of adding to one.

On CPU, we do not support multi-fragment kernels because it seems redundant. On GPU you have kernel launch overhead, on CPU not so much. That being said, it probably would not be too hard to add multi-fragment capability. But, I think that is probably misguided. What you really want is to be able to assign each kernel some set of rows to compute - those rows may come from 10 fragments if the kernel is simple, or from < 1 fragment if the kernel is complex (e.g. sub-fragments). The metadata problem is as above, right? You only care about whole-table stats, not per-fragment stats at the end of the day. So, if you could (1) arbitrarily slice storage and assign it to kernels and (2) parametrize the kernels to generically take a set of rows to process, you could have both multi-fragment kernels and sub-fragments with just one code path (and again, likely minimal overhead - but would be interesting to test).

There is one place where per-fragment metadata can be incredibly useful, though, and that is in fragment skipping. If we ingest data in date order, for example (lots of data is generated in date order - think of point of sale records for one) and then query on a date range, we can avoid even loading a lot of data if we have per-fragment metadata. Since we have to compute metadata anyway, it may make sense to leave the concept of fragment in storage and use the per-fragment metadata for fragment skipping on filter queries. I don't know how prevalent this is in our use cases or benchmarks, though.

@akroviakov
Copy link
Contributor Author

As an aside, linearization cost is very high for varlen data.

In relation to the idea: if it has to be sent to GPU anyways and we know the total size of what we are sending, we can assemble data linearly without any additional cost, can't we?

overhead of adding to two integers is much higher than the overhead of adding to one.

We iterate there using operator++ which checks if we are inside the chunk for each row for each column. Isn't such a branch a performance penalty for GPU? Maybe it concerns only this unfinished iterator.

you could (1) arbitrarily slice storage

That is the biggest issue, we have columns that are chunked (ArrowStorage ChunkedArray), we read CSVs in 20MB blocks. That means that e.g., some of 8GB (20Mil. rows) Taxi dataset partition columns get sliced into ~407 arrays (of size 48.5-50k rows). With parquet it is a bit better, but still ~153 arrays (~131k rows per array). Getting arbitrary slices (i.e., fragments) would at the end mean redundant linearized materialization of these arrays. Currently we materialize fragments that require linearization (so most of the fragments). Any fragment load onto GPU also materializes, if it requires linearization, on CPU (memory levels).

The big plan is to move away from fragments and operate directly on these arrow chunks which will be zero-copy for CPU, so no actual need to materialize data on CPU. But on GPU we have to materialize data anyways, so why not walk over chunks and directly assemble into a linear GPU buffer (i.e., set of rows of arbitrary size), we can stick to fixed size types for now in this execution mode. This way CPU (with multifragment or "multichunk") still has good granularity, and GPU can easily slide through its set of rows. This way both CPU and GPU will be in their optimal setting for a step/pipeline execution, regardless of the fragment/chunk size.

(2) parametrize the kernels to generically take a set of rows

this is supposed to be part of implementing this execution mode for GPU.

it may make sense to leave the concept of fragment in storage and use the per-fragment metadata for fragment skipping on filter queries

Aren't we doing it when deciding which fragments to skip on dispatching somewhere in createKernels()? And the fragments already have a nice function mergeStats(), which later can be easily adapted to arrow chunks and won't go away.

@ienkovich
Copy link
Contributor

As an aside, linearization cost is very high for varlen data.

It's true, we have to adjust offsets for varlen data. But we pay this price anyway even for CPU now because arrow chunks don't match fragments and we make an adjusted copy of the offset buffer on fetch.

On CPU, we do not support multi-fragment kernels because it seems redundant.

Actually, we do support it. When the aggregation result is so big that the reduction kills any gain of multiple kernels, we run a single multi-fragment kernel on CPU.

So, if you could (1) arbitrarily slice storage and assign it to kernels and (2) parametrize the kernels to generically take a set of rows to process, you could have both multi-fragment kernels and sub-fragments with just one code path

Yes, this is the goal. An additional challenge here is to use zero-copy fetch as much as possible, i. e. follow arrow chunks for imported tables, or ResultSets when working with execution results. Following existing arrow chunks in ArrowStorage can be quite tricky though. Different columns might have different chunks in our arrow data because some columns have the original split, while others were linearized during nulls/data transformations. The resulting batches would actually depend on the set of rows we want to use in a query. The easiest solution is to always use the most granular split we have for all columns. I guess it should be OK for CPU unless chunks become extremely small.

@alexbaden
Copy link
Contributor

In relation to the idea: if it has to be sent to GPU anyways and we know the total size of what we are sending, we can assemble data linearly without any additional cost, can't we?

Well, one issue I can think of is you lose your notion of where you are in the original dataset in "storage". Consider SELECT * FROM t WHERE x > 10. We only need to do computation on column x - the other columns in t are immaterial to the query. So, we would place column x on the GPU and then place the fragment offsets that pass the filter in the output buffer, materializing lazily as the user iterates (this is especially important for SELECT * FROM t WHERE x > 10 LIMIT 10 queries, where you may be able to skip the load entirely once enough rows pass while iterating the result set). That works nicely now because we can easily map the fragment ID and offset in fragment to the list of fragments when iterating Results, but would require more computation if we assembled data linearly on the way to the GPU.

@alexbaden
Copy link
Contributor

Yes, this is the goal. An additional challenge here is to use zero-copy fetch as much as possible, i. e. follow arrow chunks for imported tables, or ResultSets when working with execution results. Following existing arrow chunks in ArrowStorage can be quite tricky though. Different columns might have different chunks in our arrow data because some columns have the original split, while others were linearized during nulls/data transformations. The resulting batches would actually depend on the set of rows we want to use in a query. The easiest solution is to always use the most granular split we have for all columns. I guess it should be OK for CPU unless chunks become extremely small.

I don't follow - why can't we just maintain Arrow chunks of whatever size is optimal, and slice them globally using some notion of "virtual fragment" when we fetch data for the query?

@akroviakov
Copy link
Contributor Author

I agree that having a linear buffer for a column on GPU, where we place loaded fragments one after another doesn't make a lot of sense in case there is LIMIT in the query, linearity here indeed doesn't play any role and linear-on-device mode wouldn't make any difference.

you lose your notion of where you are in the original dataset in "storage".

I was thinking about sticking to the current multifrag infrastructure of fragment bookkeeping, but adding a LINEAR flag to the kernel, that would tell the GPU to treat it as one fragment. So given an index of a qualifying tuple, we can easily get fragID and offset, because we know where we have started, fragments were loaded in their sequential order and only the last fragment can be underfilled.

I would also like to bring another aspect of linearity on GPU. This is GPU sharing between different HDK instances while in heterogeneous mode. Heterogeneous mode implies that we want to have more fragments than there are logical CPU cores to actually share the workload, right? So the more logical cores a CPU has, the more we put GPU at a disadvantage as it now has more outer loop iterations. This seems negligible, but up to some point, for example in a system with 64 logical cores and 128 fragments, 3 out of 4 taxi queries (40Mil. rows) are running up to 2x slower on GPU than when we have 16 fragments.

Now consider the following scenario: same dataset, HDK instance (1) is running many queries on GPU (even if partially), HDK instance (2) also wants to run queries on GPU. And as expected, the more data proportion we run on GPU on (1), the slower we are (launchGpuCode) on (2), but that is only if we have many fragments. On small fragment count, there is no slow down, but as the fragment count grows, the queries become up to 4x slower.

Do you think this is caused by what we are discussing here in the context of linearity on device, or could it be something else?

@alexbaden
Copy link
Contributor

I don't know why launchGpuCode would be slower with more fragments than with fewer on the same data. The join column iterator is all pre-processing, so that should not effect gpu code (perhaps with the lookup - but I am pretty sure the lookup is still a global index, which causes other problems). It would be good to understand where the current overheads are coming from - perhaps just ordering the data blocks passed to GPU or changing the algorithm for balancing fragments across GPUs would provide the same benefits as linearization without having to re-construct (and then put back together in ResultSet) the fragment ordering?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants