Quarta-feira , 22 de Janeiro DE 2014

Getting the Specifications of CPU and GPU cards, OpenCL + OSX

Code: openclTest.cpp    Makefile




For several reasons I chose to have a go with OpenCL. Being mainly a High Performance Engineer, and even though I prefer NVIDIA CUDA, OpenCL is very interesting, and a very nice treat in what parallelism, portabiliy, and General computation is concerned. Therefore I grabbed my 27" IMac, and started Jumping around this Quick General Guide from IVEC. 


For other systems like Linux and Windows, you need to install OpenCL explicitly, however in OS X it already comes embedded, so you only need to learn how to Link your program to it. If you're in OSX as me, and if you love things uncomplicated and wrapped in a pleasant-to-the-eye layer, all you need to do is use Xcode, create your project, and add the framework in this place:

Xcode Adding OpenCL Framework


If you're like me, then you prefer Linux style, everything raw but uncomplicated. For me, programming is all about the program in text files plus a Makefile. It is clean, with absolutely no extra garbage, and easier to port into other platforms. So, you can grab a C/C++ compliler and add the flag "-framework OpenCL", and that's it. It is different from Linux, but at least I am telling you the secret here, there is no secret extra hours of searching on the Web on why you're getting this or that compiling error, after adding OpenCL ports and adding tons of extra lines, specifying include paths and Library paths. 


To compile under the Console: g++ -Wall  -g  -o  openclTest  openclTest.cpp  -framework  OpenCL

To execute: ./openclTest


The result is, for my IMac, the following:

OpenCL Specifications OSX





There is also a sample code that usually comes in the AMD APP SDK, called CLInfo. A snippet of the output is:


$ ./CLInfo


Number of platforms: 

Platform Profile: 

Platform Version: 

Platform Name: 

Platform Vendor:  


Number of devices: 


Device Type:

Board name:

Device Topology: 

Max compute units:

Max work items dimensions: 

Max work group size: 

Preferred vector width char: 

Local memory type: 

Local memory size:



Device OpenCL C version: 

Driver version: 

Max Constant Args: 

Global Mem Cache Size:



Device Type:

Device ID:

Max compute units:

Max work group size: 



Device OpenCL C version: 

Global Mem Cache Size:





OpenCL 1.2 AMD–APP (938.1)

AMD Accelerated Parallel Processing 

Advanced Micro Devices, Inc. 





AMD Radeon HD 7900 Series

PCI[ B#1, D#0, F#0 ]








Advanced Micro Devices, Inc.

OpenCL C 1.2

CAL 1.4.1741 (VM) 









Intel(R) Core(TM) i5 CPU 680 @ 3.60 GHz 


OpenCL C 1.2 



The code CLInfo gets the majority of the specifications of all the OpenCL Capable devices of your system. I am posting here the full code, so it is just compile and run. With this being said, I personaly find the code a bit bitter for the beginner's eyes, and therefore I recommend to use it only to get all the specs you need (or just to try things out). Other than that, I recommend using the other code examples I've put here, as they're friendlier and easier to understand, and faster for you to copy and reuse it for your needs.


Code: clInfo.cpp

Published by fxsf às 13:18
Terça-feira , 21 de Janeiro DE 2014

OpenCL: Running a KERNEL

CODE: 1-openclTest.cpp  1-openClUtilities.cpp  1-openClUtilities.h  Makefile zeroValuesKernel.cl

 PS: rename the file "1-Makefile" to simply "Makefile" (without quotes).



Now, everyone want's to make a really nice use of the GPUs we all have, in each and almost every computer. From a Raspberry PI to complete servers full of GPUs, passing through your very powerful HD videocard in your Desktop, the way to fully exploit this extra power is through Kernels. I mean, a Kernel is a little portion of your sequencial code, rewritten to Parallel execution.


This parallelization process is complex, full of little steps. It only makes sence to parallelize something that is really occupying the CPU (the procesor) of your machine, so for that you would need to profile your code (*1). After profiling it, and figuring out which functions are worth parallelizing, you need to adapt your program to use OpenCL, and prepare it to read, parse, and execute the extra parallel function you created. 


I am not going to cover how to parallelize some code; You need to study how to approach the problem in an entirely new and efficient way. What I am going to do is to provide a full example on how to execute a Kernel. 


To use the GPU/CPU through OpenCL, one needs:

-> A context (linked to a device)

-> A Program         (Program to host Kernels, that will compile code for different devices)

-> A Kernel         (method runing in the device)

-> A Command Queue (to operate a device, either with FIFO or Events)

-> A Buffer          (Allocation in the Global Mem. of a Device -- Linked to a context + device)

-> Write to Buffer  (passing arguments into the Global Memory of the Device)

-> Execute Kernel        (enqueueing the parallel execution in the GPU/CPU)

-> Read from buffer   (Get the result from the device back into the HOST program)


I know, it is too much more complicated than CUDA, but the alternative might pass through Simple-OpenCL Library, however I don't know how well it ports to other non-Linux systems, and therefore I refuse to use it.


There is not much to say about the first steps. This is the logic of OpenCL:

OpenCL Steps


There is, above all, a Context. This context owns a queue for Commands, the program handler, the binary Kernels compiled for a specific Device(s), etc. Next, you have the creation of Program and Kernel. Why do you need a Program if you also need a Kernel handler? Well, since OpenCL is too Generic, the Program will basically be your runtime compiler, INSIDE your program; if succedded, you end up with a binary file compiled for a specific device (GPU, CPU, etc, whatever is the device of the context). For this, you have mainly two ways of passing your parallel program (Kernel):


1) Source String:


2) Kernel File with .cl extension:



I clearly prefer the second option, for the sake of code organization and visibility. After this, it comes the interesting part, which is start giving commands to the device (GPU, ...). What commands? Well, you need to allocate memory for the arguments of your parallel function, you need to copy those arguments to the device, you need to start the execution from the device, and, after that, you need to retrieve the data back from the device to the HOST (or your sequential program). For all of this, you create a Command Queue and the several Buffers (those allocations I told you about).



Instead of posting here a bunch of lines of code, I am giving you the complete code. It is completely comented, so you will be able to understand the setting up sequence I described here. The example I am putting here generates an array of 128 MB of sequential numbers (0, 1, 2, 3, ......, 33554431) (pretty much the allocation limit for a single object for my GPU - 1/4th of 512MBytes of RAM per Max allocation). If you don't know what's the maximum allocation of your device, execute the initial code, and see your "Global Memory Size"... and from that, you can allocate 1/4th in one allocation.Then I add the value 10 to each element in a parallel way, so I end up having one thread for each element, doing that sum.


The expected result is 10, 11, 12, 13, ......, 33554441. And, in fact, here it is as expected:




HAVE FUN, parallelizing your codes! =)




(*1) -> How To profile the code is an entire different topic, but here are my recommended tools:

OSX   -> I hate them all. I recommend installing a virtual machine and profiling it elsewhere, or measure the times yourself with a timer. Alteratives are using the Instruments Tool, but it didn't work for me.

LINUX -> Use Valgrind if you coded in C/C++. You can also use QT Creator for more options like Memory Leaks.

WIN     -> I would definitely use Visual Studio.

Published by fxsf às 13:32

About Me