This post will discuss the process I went through to optimize a sparse matrix vector multiply (SPMV) that I use in my multibody dynamics code. I wanted to document the process and information that I gathered so that it may help others, and so that I could refer back to it.
Here is how I plan on going about this: First I want to determine the peak performance I could hope to achieve on my testing machine. I know I will never reach peak performance so comparing my code’s performance to the maximum theorical doesn’t make sense. I will rely on clpeak as my control. Then I will take my SPMV operation and try to get it to run as close as possible to what the benchmark tells me is possible.
I plan on detailing every kernel modification I made even when it makes the code perform slower. I am by no means an expert with OpenCL (currently at 3 days of experience), but I do know a bit about writing parallel code.
The specifications for the test rig are as follows:
1 x Supermicro 1042-LTF SuperServer
4 x AMD Opteron 6274 2.2GHz 16 core processor
L1 cache (excl.)
L2 cache/# cores
L3 cache/# cores
128GB DDR3 ECC Registered
(1333Mhz) x 64 divided by 8 = Memory Bandwidth = ~10.41 GB/s
I wanted to get a better understanding of the peak performance I could hope for when using OpenCL on this machine. I found a nice little benchmark called clpeak written by Krishnaraj Bhat. He has a blog post here discussing it in more detail.
On to the results!
If I take my test machine’s specifications and multiply them out I get the following values:
563.2 DP Gflops for 4xAMD Opteron 6274
166.56 GB/s peak memory bandwidth based on 1333 Mhz memory speed
Comparing that to the results form clpeak
246.49 DP Gflops
51.59 GB/s memory bandwidth
This means that I am reaching 43.76% of peak flop rate and 30.97% peak memory bandwidth, not too bad!
Problem I want to solve
My problem involves performing a sparse matrix vector multiply (SPMV) between a list of contact jacobian matricies (one for each contact) and a vector of values. The vector represents my lagrange multipliers but for the purposes of this discussion this is not important.
I will time my code by running the kernel 100 times, averaging the results. Because the first kernel call is usually expensive, I will run the code once outside of the timing loop to remove any bias.
Each contact i has two jacobians (one for each body, A, B, in the contact) stored in dense form and organized as follows:
Jacobian Matrix (D_i)
I need to multiply:
Output_A = (D_i_A)^T * x
Output_B = (D_i_B)^T * x
For each contact. Output_A, Output_B have 6 values (3 for translation, 3 for rotation) , x has 3 values (one for each constraint).
The total number of floating point operations is 60, 18 multiplications and 12 additions performed twice (once for A, once for B).
I need to load 2x18 floats for the jacobians, load 3 floats for x and store 2x6 floats. Total of 156 bytes loaded, 48 bytes stored.
Taking these numbers at face value and ignoring latency and other computations. Using the benchmark numbers from clpeak I should be able to process 4.11E+09 contacts per second if I am compute bound or 2.72E+08 contacts per second if I am memory bound. I will never reach these performance numbers but still, it is an interesting exercise.
I can pack the two D matricies however I want. I cannot change how x is stored because it is used in my linear solver.
Output_A and Output_B can be changed.
For each test I will compute with 1,024,000 contacts.
% max GFlops
% max GB/s
I wrote an identical impelementation in OpenMP for comparison
% max GFlops
% max GB/s
Basic implementation using floats. Too many function arguments and not very fun to write.
Memory is organized on a per contact basis. For the jacobians, memory for each row in D^T is offset by number of contacts.
Switch up the way the memory is layed out. Rather than indexing by the contact number for each constraint, transpose the data. This does mean that that when storing the data the order needs to be changed.
Based on the clpeak benchmark using float4 should give me better performance.
Note that here I am using float3 which is stored as a float4 in memory.
Preload all of the data into registers
Store values to float3, this allows the math to be written more succintly
Use host pointer in calls. This allows memory already allocated on the host to be used instead of re-allocating. Also if using a GPU or accelerator, this will copy memory to the device when used. (This can be slow)
float8 math and store.
Weirdly, this code will cause a segmentation fault. After a bit of digging around there is a bug in the avx implementation for float8 link Adding in the -fdisable-avx flag allows the code to run. Interestingly the performance does not suffer.
Knowing a bit more about my problem, I can say that the entries in JxB are the negative of the entries in JxA. This is due to the way that contact jacobians are computed. The same does not hold true for the rotational components of the jacobians.
With 2.0 I explicitly compute the three vectors in the Jacobian from the contact normal. This decreases the amount of memory I need to read in. Note that I still need to disable avx because of the float 8 store. THe next version will see what happens with avx enabled and float4 stores
Back to using float4, meaning that I can re-enable AVX, performance doesn’t chagne much sadly