This post will discuss the process I went through to optimize a sparse matrix vector multiply (SPMV) that I use in my multibody dynamics code. I wanted to document the process and information that I gathered so that it may help others, and so that I could refer back to it.
Here is how I plan on going about this: First I want to determine the peak performance I could hope to achieve on my testing machine. I know I will never reach peak performance so comparing my code’s performance to the maximum theorical doesn’t make sense. I will rely on clpeak as my control. Then I will take my SPMV operation and try to get it to run as close as possible to what the benchmark tells me is possible.
I plan on detailing every kernel modification I made even when it makes the code perform slower. I am by no means an expert with OpenCL (currently at 3 days of experience), but I do know a bit about writing parallel code.
Testing Setup
Hardware:
The specifications for the test rig are as follows:
1 x Supermicro 1042-LTF SuperServer
4 x AMD Opteron 6274 2.2GHz 16 core processor
Processor
AMD 6274
Nickname
Interlagos
Clock (GHz)
2.2
Sockets/Node
4
Cores/Socket
16
NUMA/Socket
2
DP GFlops/Socket
140.8
Memory/Socket
32 GB
Bandwidth/Socket
102.4 GB/s
DDR3
1333 MHz
L1 cache (excl.)
16KB
L2 cache/# cores
2MB/2
L3 cache/# cores
8MB/8
128GB DDR3 ECC Registered
(1333Mhz) x 64 divided by 8 = Memory Bandwidth = ~10.41 GB/s
I wanted to get a better understanding of the peak performance I could hope for when using OpenCL on this machine. I found a nice little benchmark called clpeak written by Krishnaraj Bhat. He has a blog post here discussing it in more detail.
On to the results!
If I take my test machine’s specifications and multiply them out I get the following values:
563.2 DP Gflops for 4xAMD Opteron 6274
166.56 GB/s peak memory bandwidth based on 1333 Mhz memory speed
Comparing that to the results form clpeak
246.49 DP Gflops
51.59 GB/s memory bandwidth
This means that I am reaching 43.76% of peak flop rate and 30.97% peak memory bandwidth, not too bad!
Problem I want to solve
My problem involves performing a sparse matrix vector multiply (SPMV) between a list of contact jacobian matricies (one for each contact) and a vector of values. The vector represents my lagrange multipliers but for the purposes of this discussion this is not important.
I will time my code by running the kernel 100 times, averaging the results. Because the first kernel call is usually expensive, I will run the code once outside of the timing loop to remove any bias.
Each contact i has two jacobians (one for each body, A, B, in the contact) stored in dense form and organized as follows:
Jacobian Matrix (D_i)
Normal Constraint
x
y
z
u
v
w
Tangent Constraint
x
y
z
u
v
w
Tangent Constraint
x
y
z
u
v
w
I need to multiply:
Output_A = (D_i_A)^T * x
Output_B = (D_i_B)^T * x
For each contact. Output_A, Output_B have 6 values (3 for translation, 3 for rotation) , x has 3 values (one for each constraint).
The total number of floating point operations is 60, 18 multiplications and 12 additions performed twice (once for A, once for B).
I need to load 2x18 floats for the jacobians, load 3 floats for x and store 2x6 floats. Total of 156 bytes loaded, 48 bytes stored.
Taking these numbers at face value and ignoring latency and other computations. Using the benchmark numbers from clpeak I should be able to process 4.11E+09 contacts per second if I am compute bound or 2.72E+08 contacts per second if I am memory bound. I will never reach these performance numbers but still, it is an interesting exercise.
Constraints
I can pack the two D matricies however I want. I cannot change how x is stored because it is used in my linear solver.
Output_A and Output_B can be changed.
Performance Table:
For each test I will compute with 1,024,000 contacts.
OpenCL:
Version
Time [ms]
Giga flops
Bandwidth GB/s
% max GFlops
% max GB/s
1.0
17.947
3.423
10.840
1.36
20.60
1.1
17.185
3.575
11.321
1.45
21.94
1.2
6.984
8.764
37.145
3.57
72.00
1.3
6.856
8.964
37.778
3.64
73.23
1.4
6.79
9.055
38.066
3.67
73.79
1.5
6.088
10.093
41.352
4.09
80.16
1.6
6.244
9.841
40.556
3.99
78.61
1.7
6.107
10.069
41.277
4.08
80.01
1.8
6.063
10.144
41.515
4.12
80.47
1.9
5.06
12.145
42.229
4.93
81.86
2.0
4.382
29.450
41.796
11.95
81.02
2.1
4.345
29.727
42.188
12.05
81.78
OpenMP
I wrote an identical impelementation in OpenMP for comparison
Version
Time [ms]
Giga flops
Bandwidth GB/s
% max GFlops
% max GB/s
1.0
17.081
3.611
11.435
1.46
22.17
1.4
7.187
8.672
27.461
3.52
53.23
2.0
5.312
11.846
35.304
4.81
68.43
Kernel Versions
Version 1.0
Basic implementation using floats. Too many function arguments and not very fun to write.
Memory is organized on a per contact basis. For the jacobians, memory for each row in D^T is offset by number of contacts.
Version 1.1
Switch up the way the memory is layed out. Rather than indexing by the contact number for each constraint, transpose the data. This does mean that that when storing the data the order needs to be changed.
Version 1.2
Based on the clpeak benchmark using float4 should give me better performance.
Note that here I am using float3 which is stored as a float4 in memory.
Version 1.3
Preload all of the data into registers
Version 1.4
Store values to float3, this allows the math to be written more succintly
Version 1.5
Use host pointer in calls. This allows memory already allocated on the host to be used instead of re-allocating. Also if using a GPU or accelerator, this will copy memory to the device when used. (This can be slow)
Version 1.6
float16 math
Version 1.7
float8 math
Version 1.8
float8 math and store.
Weirdly, this code will cause a segmentation fault. After a bit of digging around there is a bug in the avx implementation for float8 link Adding in the -fdisable-avx flag allows the code to run. Interestingly the performance does not suffer.
Version 1.9
Knowing a bit more about my problem, I can say that the entries in JxB are the negative of the entries in JxA. This is due to the way that contact jacobians are computed. The same does not hold true for the rotational components of the jacobians.
Version 2.0
With 2.0 I explicitly compute the three vectors in the Jacobian from the contact normal. This decreases the amount of memory I need to read in. Note that I still need to disable avx because of the float 8 store. THe next version will see what happens with avx enabled and float4 stores
Version 2.1
Back to using float4, meaning that I can re-enable AVX, performance doesn’t chagne much sadly