Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excuse me, is the performance evaluation of small/skinny matrix by the following program correct? #629

Open
ProgrammerWLY opened this issue May 2, 2022 · 1 comment

Comments

@ProgrammerWLY
Copy link

		if ( bli_does_notrans( transa ) )
			bli_obj_create( dt, m, k, rs_a, cs_a, &a );
		else
			bli_obj_create( dt, k, m, cs_a, rs_a, &a );

		if ( bli_does_notrans( transb ) )
			bli_obj_create( dt, k, n, rs_b, cs_b, &b );
		else
			bli_obj_create( dt, n, k, cs_b, rs_b, &b );
		if ( bli_does_notrans( transc ) )
			bli_obj_create( dt, m, n, rs_c, cs_c, &c );
		else
			bli_obj_create( dt, n, m, cs_c, rs_c, &c );

		bli_randm( &a );
		bli_randm( &b );
		bli_randm( &c );

		// warm up
		for(i = 0 ; i < 2; i++)
			bli_gemm( &alpha, &a, &b, &beta, &c);

		// loo= 50
		start = dclock();
		for(i = 0 ; i < loop; i++)
			bli_gemm( &alpha, &a, &b, &beta, &c);
		cost = (dclock() - start)/loop;

		printf("blis sup : M=%d , N= %d, K =%d, Gflops= %lf, effic = %lf%\n", 
			m, n, k, ops/cost, ops/cost /8.8 * 100);

		bli_obj_free( &a );
		bli_obj_free( &b );
		bli_obj_free( &c );
@fgvanzee
Copy link
Member

fgvanzee commented May 2, 2022

A few comments.

  1. We don't typically advocate for honoring a trans_t parameter on the output matrix.
  2. This code leaves out the setting of alpha and beta (and other details).
  3. Whether small/skinny execution takes place (vs. the conventional code path) is decided by comparing the problem dimensions to hardware-specific thresholds. Thus, I can't really say from the above code that sup would execute at all.
  4. Your code calculates the average execution time. There's nothing inherently wrong with this. That said, we almost always report the fastest of n_repeat executions (Note: Typically for us, n_repeat == 3, although there is nothing special about that number aside from that it's low enough that it allows the suite of experiments to finish in a reasonable time.)
  5. BLIS does not export a dclock() function, so I'm assuming you are defining that on your own (or obtaining it elsewhere). Instead, we have bli_clock() and its helper function, bli_clock_min_diff(), which we use for reporting the fastest of multiple trials, as mentioned above. (See test/test_gemm.c for an example of how this is used in a simple setting.)
  6. We don't usually perform any "warm up" executions, in part because we report the fastest rather than average time, but also because there is no guarantee that the warmed-up data will still reside in the core's local caches by the time the measured tests commence. Why is this? The OS scheduler may have since migrated the process to another core. You can guarantee this, however, by binding threads to cores, as described in our Multithreading.md document. (That said, while there is no guarantee that the process will stay in one place, in practice it's probably unlikely that it moves in that short time.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants