cf4ocl (C Framework for OpenCL)
v2.1.0
Object-oriented framework for developing and benchmarking OpenCL projects in C/C++
|
Developing an application with cf4ocl.
This tutorial is based on the canon
example available in the examples folder. The goal is to add two vectors, a
and b
, as well as a constant d
, and save the result in a third vector, c
. The OpenCL kernel which performs this operation is given in the following code:
For the purpose of this tutorial, we'll assume the kernel code is in a file called mysum.cl
, and the host code in mysum.c
.
The cf4ocl header should be included at the beginning of the mysum.c
file:
The next step is to create a context with an OpenCL device where we can perform our computation. cf4ocl has several constructor functions for creating contexts with different types of devices, some very simple, some very flexible. For example, ccl_context_new_from_menu() allows the user to select the OpenCL device if more than one is available in the system, and returns a context containing the selected device. For example:
Where we pass NULL
we could have passed an error management object, which we'll discuss in detail further ahead. Error-throwing cf4ocl functions signal errors in two ways: 1) using the return value; and, 2) populating the error management object. In this case, because we're not passing this object, we have to rely on the return value to check for errors. A NULL
return value indicates an error in all cf4ocl constructors:
In cf4ocl, all objects created with new
constructors must be released using the respective destroy
destructors. For contexts, this is the ccl_context_destroy() function:
We now have compilable, leak-free cf4ocl program, although it doesn't do very much yet:
We can compile the program with gcc
(or clang
), and run it:
With Clang the command is the same, just replace gcc
with clang
.
The goal of the program is to sum two vectors and a constant. As such, we need to declare three host vectors, two of which will be initialized with some values to sum, and the third one will be used to hold the final result. We also need to declare the respective device buffers and the constant.
Two of the device buffers, a
and b
, will be initialized with the values of the corresponding host vectors:
Don't forget the destructors at the end:
The complete program so far can be compiled and executed. Still doesn't do anything useful, but we're getting close.
The ccl_queue_new() constructor provides the simplest way to create a command queue. However, command queue creation requires a device. The context contains a reference to the selected device, which can be fetched using the ccl_context_get_device() function:
There's no need to release the device object. It will be automatically released when the context is destroyed, in accordance with the new/destroy rule.
Now we can create the command queue. We don't require any special queue properties for now, so we pass 0
as the third argument to ccl_queue_new():
Both ccl_context_get_device() and ccl_queue_new() expect an error handling object as the last argument. By passing NULL
we must rely on the return value of these functions in order to check for errors. Here's the complete code so far:
Compile and run the code. As expected, nothing special happens yet.
cf4ocl provides several constructors for creating program objects. When a single OpenCL C kernel source file is involved, the most adequate constructor is ccl_program_new_from_source_file():
Building the program object is just as easy. For this purpose, we use the ccl_program_build() function which returns CL_TRUE
if the build is successful or CL_FALSE
otherwise:
Here's the current state of our code:
Compile and run the code. Don't forget to put the mysum.cl
file containing the kernel source code in the same folder, otherwise the program object will not be successfully created.
cf4ocl greatly simplifies the execution of OpenCL programs. Instead of creating a kernel with clCreateKernel(), setting kernel arguments one-by-one with clSetKernelArg(), and finally executing the kernel with clEnqueueNDRangeKernel(), cf4ocl allows client code to do this using a single function:
Buffer, image and sampler objects can be passed directly as kernel arguments. Local and private variables are passed using the ccl_arg_local() and ccl_arg_priv() macros, respectively.
A local work size vector is expected as the 6th argument to ccl_program_enqueue_kernel(). In this example, we pass NULL
, which means that the local work size is to be automatically determined by the OpenCL implementation (as specified in the clEnqueueNDRangeKernel() documentation). Often we need more control over this value, because OpenCL implementations don't let us know what local work size was effectively used. It can be a bit of a chore to determine a local work size, especially when multiple dimensions are involved. Among other things, it's necessary to check kernel and device limits. The ccl_kernel_suggest_worksizes() is a very versatile function which can help in this regard.
While the ccl_program_enqueue_kernel() simplifies executing a kernel (including setting its arguments), cf4ocl provides additional functions which allow client code to have finer control over this process.
To check the results of the kernel execution, it's necessary to read the contents of device buffer c
into host buffer vec_c
:
Now we can check the results:
We now have a fully working OpenCL program, simplified with cf4ocl:
Our code may be correctly implemented, but a number of OpenCL errors can still occur. Checking the return values of cf4ocl functions allows us to determine that something went wrong, but not what went wrong. Fortunately, all error-throwing cf4ocl functions accept the memory location of a CCLErr error handling object, usually as the last argument. If given, this object will be populated with an error code and an error domain, as well as an error message, if an error occurs. In our program we're passing NULL
where the CCLErr object memory location was expected, so no information about errors is made available to the caller. To change the way we're handling errors, we must first declare a CCLErr object, and initialize it to NULL
:
Then, we should pass the memory location of this object to all cf4ocl error-throwing functions, e.g.:
Now we can check this object after cf4ocl function calls. Let's create a small macro to do so:
We can also remove the status
and evt
variables, because we don't rely on them anymore for error checking. Here's the complete code, with more informative error checking:
The error checking strategy in our code is just an example. Client code can implement any strategy. More information about this topic is available in the user guide.
Profiling OpenCL code by hand, i.e. by gathering and processing information about all cl_event
objects, is an extremely verbose and error-prone process. However, it comes for free with cf4ocl. Well, mostly. The first change we must perform on our program is to enable profiling when the command queue is created:
Additionally, we copied host vectors to device implicitly during device buffer creation. These implicit transfers don't produce OpenCL events, and are thus unavailable for profiling. As such, we should make these operations explicit:
We can now profile our code using functionality provided by the profiler module:
Our final code, with error checking and profiling, is as follows:
Compile and run the code. A profiling summary will be printed on the screen, something like:
Aggregate times by event : ------------------------------------------------------------------ | Event name | Rel. time (%) | Abs. time (s) | ------------------------------------------------------------------ | NDRANGE_KERNEL | 64.8583 | 1.7993e-05 | | WRITE_BUFFER | 31.8254 | 8.8290e-06 | | READ_BUFFER | 3.3163 | 9.2000e-07 | ------------------------------------------------------------------ | Total | 2.7742e-05 | --------------------------------- Event overlaps : None
While this profiling summary is useful, the profiler module can do much more.