OpenCL Kernel Setup
Most of the content in this post used to be a part of another post. I felt that it was important enough to have its own post so I moved it and made some minor changes to the content.
For a working example please see the following github repository
Setting up OpenCL
Getting started with OpenCL isn’t very straight forward. Compared to OpenMP and CUDA where you need 1-2 lines to run a function in parallel. OpenCL needs 30-40 lines of code just to get started.
The process of starting up OpenCL can be split into several parts:
- Getting our OpenCL platform
- A platform specifies the OpenCL implementation
- EX: AMD, Intel, NVIDIA, Apple are all valid platforms
- Multiple platforms can exist on a single machine
- Apple is a special case, they have a custom OpenCL implementation
- Get the devices for a platform
- Enumerate all of the devices for a specified platform
- EX: AMD CPU, AMD APU, AMD GPU
- On an Apple platform you might see:
- Intel CPU, Intel Integraged GPU, NVIDIA GPU
- Create an OpenCL context for a specified device
- Can create multiple contexts, one for each device
- Create a Command Queue
- This queue is used to specify operations such as kernel launches and memory copies.
- Operations sent to the queue can be executed in order or out of order, the user is in control of this at runtime.
- Create our Program for a specified context
- Read in our kernel as a string
- Create a program from this kernel
- Compile the program for our device
- Create a kernel from our program
- A program can have multiple kernel functions inside of it. This specifies which one we want to run.
- Specify arguments to the kernel
- Provide a pointer and argument number for the kernel.
And there we are!
At this point we can allocate memory, copy it to the device and run our kernel as we would if this was CUDA.
Example Code
The following example code can be simplified if the target platform/device numbers are known at run time. This isn’t always the case so we must first count the platforms/devices and then pick which one we want. I am not going to go into specifics about some of the options in the code below, will leave that for a different post.
- Getting our OpenCL platform
- Get the devices for a platform
- Create an OpenCL context for a specified device
- Create a Command Queue (with profiling enabled, needed for timing kernels)
- Create our Program for a specified context
- Build the program
- Create a kernel from our program
- Specify arguments to the kernel
- Run the Kernel
I have glossed over some of the implementation details which need to be delt with on a case by case basis. This gives an idea of the steps involved with getting a kernel running.