New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

added gpu benchmarking script #192

Open

jcaip wants to merge 2 commits into main from jcaip/tile-wise-sparsity

Contributor

jcaip commented Apr 30, 2024 •

edited

Add combined GPU sparsity benchmarking script.

This is really a combination of two scripts - https://gist.github.com/cpuhrsch/7fec60079cbe2daeff59c0577f933320 for BSR benchmarking and https://github.com/pytorch/pytorch/blob/8db72a430d0c3a7d3388749d5d438fb805f53407/benchmarks/sparse/benchmark_semi_structured_sparsity.py for semi-structured sparse benchmarking

We're planning on releasing superblock soon, so I want to point the benchmarks to here, with idea being we can farm out consumer card benchmarks for block sparse like we did with #174

For superblock benchmarks run:

python benchmarks/benchmark_gpu_sparsity.py --mode sam-vitb-shapes --sparsity block-sparse --sparsity-level 0.8 --block-size 64 --dtype fp32
python benchmarks/benchmark_gpu_sparsity.py --mode sam-vitb-shapes --sparsity block-sparse --sparsity-level 0.9 --block-size 64 --dtype fp32
python benchmarks/benchmark_gpu_sparsity.py --mode sam-vitb-shapes --sparsity block-sparse --sparsity-level 0.8 --block-size 32 --dtype fp32
python benchmarks/benchmark_gpu_sparsity.py --mode sam-vitb-shapes --sparsity block-sparse --sparsity-level 0.9 --block-size 32 --dtype fp32


          added gpu benchmarking script

73f8efc

facebook-github-bot added the CLA Signed label

msaroufim reviewed

View reviewed changes

benchmarks/benchmark_gpu_sparsity.py

+                  if args.save:
+                      save_file = f"{args.mode}_{args.dtype}_{args.backend}.csv"
+                      df.to_csv(save_file)
+                      print(f"Finished benchmark: {args.mode} saved results to {save_file}")

Member

msaroufim Apr 30, 2024

wanna also recommend people post their results on a central issue here?

msaroufim reviewed

View reviewed changes

benchmarks/benchmark_gpu_sparsity.py

+              import torch.utils.benchmark as benchmark
+              import torch.nn.functional as F
+              from torch import nn
+              from torch.sparse import SparseSemiStructuredTensor, to_sparse_semi_structured

Member

msaroufim Apr 30, 2024

does this require nightlies?

Contributor Author

jcaip May 10, 2024

We should probably log torch.__version__, but this doesn't require nightlies. Is there a way we can track torchao version as well?

msaroufim reviewed

View reviewed changes

benchmarks/benchmark_gpu_sparsity.py Show resolved Hide resolved

cpuhrsch reviewed

View reviewed changes

benchmarks/benchmark_gpu_sparsity.py

		return sparse_weight


		def benchmark_in_us(f, args, *kwargs):

Contributor

cpuhrsch Apr 30, 2024

There's a couple caveats to this function. I somewhat trust it less than using cuda synchronize and a for loop. Also I'd report the standard deviation as well. If you have 5us but it's +/- 20us something went wrong. blocked autorange is supposed to help with that, but better to verify and print it.

Contributor Author

jcaip May 10, 2024

Yeah I think blocked autorange is not great - Actually for the benchmarks, a lot of the time it's just running once, likey as @HDCharles highlighted here.

I think we can use adaptive autorange instead, wrapped in torch.synchronize() to minimze the variability.

cpuhrsch reviewed

View reviewed changes

benchmarks/benchmark_gpu_sparsity.py

+              def run_gpu_sparse_benchmark(m, k, n, args):
+                  dtype = DTYPE_LOOKUP[args.dtype]
+                  x = torch.randn(n, k).to(dtype).cuda()

Contributor

cpuhrsch Apr 30, 2024

Since we don't care about accuracy here I assume, subnormal number performance aside, you could also try torch.empty(n, k, dtype=dtype, device='cuda') which might be faster to allocate and doesn't require calling randn. Especially if you run a lot of benchmarks in a row this can become annoying to wait for.

Contributor Author

jcaip May 10, 2024

I think we want to avoid this because this will bias numbers https://www.thonking.ai/p/strangely-matrix-multiplications

Contributor

cpuhrsch May 10, 2024

Yes, zeros will be an issue. Also subnormal numbers or such. But yes, we can't rely on empty to not give us just all zeros.

Hm, I guess if this does ever really become a bottleneck we can write a simpler random number generator (like arange + some mod with a prime number etc.).

cpuhrsch reviewed

View reviewed changes

benchmarks/benchmark_gpu_sparsity.py Outdated

+                  elif args.eval_fn == "mm":
+                      dense_output = torch.mm(x, A)
+                      sparse_output = torch.mm(x, A_sparse)
+                      correct = torch.allclose(dense_output, sparse_output, rtol=1e-3, atol=1e-3)

Contributor

cpuhrsch Apr 30, 2024

Alright, so we do care about correctness. It seems like maybe something to turn on/off. Morally this should be covered by unit tests, but I also never mind more sources of verification.

Contributor

cpuhrsch Apr 30, 2024

You can use torch.testing.assert_allclose to have it raise an exception with location and error.

Contributor Author

jcaip May 10, 2024

I think when I first wrote this script, we didn't have tests. Let's just remove this correctness checking - since we have better testing now.

cpuhrsch reviewed

View reviewed changes

benchmarks/benchmark_gpu_sparsity.py Outdated Show resolved Hide resolved

cpuhrsch reviewed

View reviewed changes

benchmarks/benchmark_gpu_sparsity.py Show resolved Hide resolved

cpuhrsch reviewed

View reviewed changes

benchmarks/benchmark_gpu_sparsity.py

		}


		if __name__ == "__main__":

Contributor

cpuhrsch Apr 30, 2024

There's pros/cons to creating new processes for each benchmark, but it seems like in general this script will need a default setting to run all relevant or interesting configurations. If someone runs python benchmark_gpu_sparsity.py and then posts the result, is that enough to be useful?

wip

20edc7c

pytorch-bot bot commented May 15, 2024 •

edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/192

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 20edc7c with merge base e3ed90f ():

NEW FAILURES - The following jobs have failed:

Run Regression Tests / test (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/whl/nightl... / linux-job (gh)
test/quantization/test_quant_primitives.py::TestQuantPrimitives::test_quantize_dequantize_group_sym
Run Regression Tests / test (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://download.pytorc... / linux-job (gh)
test/quantization/test_quant_primitives.py::TestQuantPrimitives::test_quantize_dequantize_group_sym

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment