Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Premise Selection on GPUs #53

Open
yangky11 opened this issue Jan 29, 2024 · 1 comment
Open

Premise Selection on GPUs #53

yangky11 opened this issue Jan 29, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@yangky11
Copy link
Member

We only select premises on CPUs but want to make it work also for GPUs.

// TODO: We should remove this line when everything can work well on CUDA.

@Peiyang-Song . A recent discussion on Zulip may be relevant: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/LeanCopilot/near/418208166

@yangky11 yangky11 added the bug Something isn't working label Jan 29, 2024
@yangky11 yangky11 changed the title Premise Selection on GPU Premise Selection on GPUs Jan 29, 2024
@namibj
Copy link

namibj commented Feb 21, 2024

I believe they should work for CUDA if changed from:

LeanCopilot/cpp/ct2.cpp

Lines 335 to 336 in 38a5a48

const int *p_topk_indices = topk_indices.data<int>();
const float *p_topk_values = topk_values.data<float>();

to:

auto p_topk_indices = topk_indices.to_vector<int>();
auto p_topk_values = topk_values.to_vector<float>();
// Possibly the following line is also required; I'd have tested if I had a system that the CUDA support ran on; sadly it appears something accidentally assumed a newer GPU than mine, as mine is technically supported according to the documentation. Well, encoding and generating fail with the prebuild binary; I can't even seem to compile with CUDA support.
ctranslate2::synchronize_stream(device);

.
It is my expectation that attempting to run it without the synchronize_stream, if it's necessary to have, would cause spurious old or even garbage values to be read out in the subsequent loop, because in that case the device-to-host memcpy would be issued asynchronously and would race the execution of said subsequent loop.
As they are issued later, p_topk_values would be more prone to containing wrong/old values; I'd expect a simple loop checking that they are in order as mandated by the TopK operator, with a query embedding just pulled out of nowhere (the loss function for the encoder fine tuning should have arranged for the premise embeddings to relatively evenly cover the embedding space's hypershphere), to surface the issue if it exists.

For better throughput retrieval should be batched; as-is, the bulk of it's work does a single multiply-add operation per scalar loaded from memory; for example, a Ryzen 9 5950X would, if I'm not mistaken, be DRAM read bandwidth bound up to a batch size of about 80; this would though require loop tiling to both register file and L1 cache, as with a 1472 embedding dimension and f32 values 80 queries would already barely fit in L2 cache.
The idea I'd have would be to process all queries on all involved cores, chunk the premise list, and spread the chunks across the cores.
My understanding is that GPUs suffer somewhat harder from this need for large enough batch size to no longer be memory-bound.

And yes, I do mean that you retrieving 100 queries with a small enough k in the TopK, if running sufficiently optimal code, would be expected to take approximately twice as long as retrieving a single query. It's that bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants