Premise Selection on GPUs #53

yangky11 · 2024-01-29T01:30:08Z

We only select premises on CPUs but want to make it work also for GPUs.

Line 246 in 8c7b338

// TODO: We should remove this line when everything can work well on CUDA.

@Peiyang-Song . A recent discussion on Zulip may be relevant: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/LeanCopilot/near/418208166

namibj · 2024-02-21T11:44:15Z

I believe they should work for CUDA if changed from:

LeanCopilot/cpp/ct2.cpp

Lines 335 to 336 in 38a5a48

    
           const int *p_topk_indices = topk_indices.data<int>(); 
        
           const float *p_topk_values = topk_values.data<float>();

to:

auto p_topk_indices = topk_indices.to_vector<int>();
auto p_topk_values = topk_values.to_vector<float>();
// Possibly the following line is also required; I'd have tested if I had a system that the CUDA support ran on; sadly it appears something accidentally assumed a newer GPU than mine, as mine is technically supported according to the documentation. Well, encoding and generating fail with the prebuild binary; I can't even seem to compile with CUDA support.
ctranslate2::synchronize_stream(device);

.
It is my expectation that attempting to run it without the synchronize_stream, if it's necessary to have, would cause spurious old or even garbage values to be read out in the subsequent loop, because in that case the device-to-host memcpy would be issued asynchronously and would race the execution of said subsequent loop.
As they are issued later, p_topk_values would be more prone to containing wrong/old values; I'd expect a simple loop checking that they are in order as mandated by the TopK operator, with a query embedding just pulled out of nowhere (the loss function for the encoder fine tuning should have arranged for the premise embeddings to relatively evenly cover the embedding space's hypershphere), to surface the issue if it exists.

For better throughput retrieval should be batched; as-is, the bulk of it's work does a single multiply-add operation per scalar loaded from memory; for example, a Ryzen 9 5950X would, if I'm not mistaken, be DRAM read bandwidth bound up to a batch size of about 80; this would though require loop tiling to both register file and L1 cache, as with a 1472 embedding dimension and f32 values 80 queries would already barely fit in L2 cache.
The idea I'd have would be to process all queries on all involved cores, chunk the premise list, and spread the chunks across the cores.
My understanding is that GPUs suffer somewhat harder from this need for large enough batch size to no longer be memory-bound.

And yes, I do mean that you retrieving 100 queries with a small enough k in the TopK, if running sufficiently optimal code, would be expected to take approximately twice as long as retrieving a single query. It's that bad.

yangky11 added the bug Something isn't working label Jan 29, 2024

yangky11 changed the title ~~Premise Selection on GPU~~ Premise Selection on GPUs Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Premise Selection on GPUs #53

Premise Selection on GPUs #53

yangky11 commented Jan 29, 2024

namibj commented Feb 21, 2024

Premise Selection on GPUs #53

Premise Selection on GPUs #53

Comments

yangky11 commented Jan 29, 2024

namibj commented Feb 21, 2024