OpenCL: Running a KERNEL

CODE: 1-openclTest.cpp  1-openClUtilities.cpp  1-openClUtilities.h  Makefile zeroValuesKernel.cl

 PS: rename the file "1-Makefile" to simply "Makefile" (without quotes).

 

 

Now, everyone want's to make a really nice use of the GPUs we all have, in each and almost every computer. From a Raspberry PI to complete servers full of GPUs, passing through your very powerful HD videocard in your Desktop, the way to fully exploit this extra power is through Kernels. I mean, a Kernel is a little portion of your sequencial code, rewritten to Parallel execution.

 

This parallelization process is complex, full of little steps. It only makes sence to parallelize something that is really occupying the CPU (the procesor) of your machine, so for that you would need to profile your code (*1). After profiling it, and figuring out which functions are worth parallelizing, you need to adapt your program to use OpenCL, and prepare it to read, parse, and execute the extra parallel function you created. 

 

I am not going to cover how to parallelize some code; You need to study how to approach the problem in an entirely new and efficient way. What I am going to do is to provide a full example on how to execute a Kernel. 

 

To use the GPU/CPU through OpenCL, one needs:

-> A context (linked to a device)

-> A Program         (Program to host Kernels, that will compile code for different devices)

-> A Kernel         (method runing in the device)

-> A Command Queue (to operate a device, either with FIFO or Events)

-> A Buffer          (Allocation in the Global Mem. of a Device -- Linked to a context + device)

-> Write to Buffer  (passing arguments into the Global Memory of the Device)

-> Execute Kernel        (enqueueing the parallel execution in the GPU/CPU)

-> Read from buffer   (Get the result from the device back into the HOST program)

 

I know, it is too much more complicated than CUDA, but the alternative might pass through Simple-OpenCL Library, however I don't know how well it ports to other non-Linux systems, and therefore I refuse to use it.

 

There is not much to say about the first steps. This is the logic of OpenCL:

OpenCL Steps

 

There is, above all, a Context. This context owns a queue for Commands, the program handler, the binary Kernels compiled for a specific Device(s), etc. Next, you have the creation of Program and Kernel. Why do you need a Program if you also need a Kernel handler? Well, since OpenCL is too Generic, the Program will basically be your runtime compiler, INSIDE your program; if succedded, you end up with a binary file compiled for a specific device (GPU, CPU, etc, whatever is the device of the context). For this, you have mainly two ways of passing your parallel program (Kernel):

 

1) Source String:

 

2) Kernel File with .cl extension:

 

 

I clearly prefer the second option, for the sake of code organization and visibility. After this, it comes the interesting part, which is start giving commands to the device (GPU, ...). What commands? Well, you need to allocate memory for the arguments of your parallel function, you need to copy those arguments to the device, you need to start the execution from the device, and, after that, you need to retrieve the data back from the device to the HOST (or your sequential program). For all of this, you create a Command Queue and the several Buffers (those allocations I told you about).

 

 

Instead of posting here a bunch of lines of code, I am giving you the complete code. It is completely comented, so you will be able to understand the setting up sequence I described here. The example I am putting here generates an array of 128 MB of sequential numbers (0, 1, 2, 3, ......, 33554431) (pretty much the allocation limit for a single object for my GPU - 1/4th of 512MBytes of RAM per Max allocation). If you don't know what's the maximum allocation of your device, execute the initial code, and see your "Global Memory Size"... and from that, you can allocate 1/4th in one allocation.Then I add the value 10 to each element in a parallel way, so I end up having one thread for each element, doing that sum.

 

The expected result is 10, 11, 12, 13, ......, 33554441. And, in fact, here it is as expected:

 

 

 

HAVE FUN, parallelizing your codes! =)

 

 

 

(*1) -> How To profile the code is an entire different topic, but here are my recommended tools:

OSX   -> I hate them all. I recommend installing a virtual machine and profiling it elsewhere, or measure the times yourself with a timer. Alteratives are using the Instruments Tool, but it didn't work for me.

LINUX -> Use Valgrind if you coded in C/C++. You can also use QT Creator for more options like Memory Leaks.

WIN     -> I would definitely use Visual Studio.

Published by fxsf às 13:32