Next article: Friday Q&A 2010-04-09: Comparison of Objective-C Enumeration Techniques
Previous article: And Again
Tags: fridayqna opencl
I'm back in action at last, so it's time for another Friday Q&A. This week, both someone named "charles" and Brian Olsen have suggested that I talk about OpenCL, so I'm going to go through the basics of what OpenCL is and how to do some simple computation with it.
SMUGOpenCL
I made heavy use of Chris Liscio's SMUGOpenCL library, which provides some nice Objective-C wrappers for basic OpenCL functionality.
What is OpenCL?
OpenCL is a natural outgrowth of the move toward doing general-purpose computation on graphics cards using OpenGL. The graphics card is actually the most powerful number cruncher in many PCs, and taking advantage of its power for non-graphics computations has been an emerging trend in recent years. My GPULife screensaver is an example of using OpenGL to do non-graphics computations, in this particular case computing the Game of Life.
OpenGL, while providing enough power to do general-purpose computations with the GLSL shader language, is awkward for non-graphical purposes. GLSL is still heavily graphics-oriented, and lacks things which many non-graphics computations need, like the ability to do random memory access for writing. Working around these limitations is possible, but is hard to do and can greatly hurt efficiency.
Not all graphics cards are capable of GLSL, and CPU emulation can be slow. A truely adaptable program will run computations on the GPU when possible and on the CPU as a fallback, ideally with vectorized, parallelized CPU code, but OpenGL doesn't make this easy.
OpenCL does make it easy. OpenCL allows you to program your GPU using normal C, with all the pointer craziness and random memory access that implies. Of course your performance will be better with more structured accesses, just like with anything, but for when you really need random accesses, OpenCL allows it. OpenCL can also target both GPUs and CPUs, and when you need to fall back to CPUs, OpenCL will do its best to automatically vectorize and parallelize your code. OpenCL is structured so that it can do a much better job at these tasks than your normal compiler in many cases. Since you can use essentially the same code for both CPU and GPU, this means that you can maintain less code, run on any hardware, and still take full advantage of beefy GPUs when they're available.
Overview
OpenCL is a library that you access at runtime. You give programs to OpenCL, which transforms them into kernels, which are functions that you can then call using OpenCL.
Unlike normal C programming, you don't compile OpenCL kernels ahead of time. Instead, you feed the raw textual source to OpenCL, which compiles them at runtime. This is necessary because you can't know in advance just what sort of hardware you're going to target. And in the case where you're targeting CPUs, this still allows OpenCL to make specializations based on the specific CPU's vector capabilities and number of cores.
These kernels are written in plain C with some basic extensions. When compiled, they can't be directly accessed as C functions. Instead, you set arguments and make calls using OpenCL functions, which then manage the execution of the kernels.
A Frequency Counter
To illustrate the use of OpenCL, I built a byte frequency counter. This just loads a file and counts the number of times each byte value occurs, and prints out the result. This is not the most useful program in the world, but it's a decent illustration of how to use OpenCL.
Code
As usual, I'll be presenting excerpts here, but the full code is available from my public subversion repository:
svn co http://mikeash.com/svn/OpenCLFreqCount/
Or just click on the URL above to browse it.
Plain C Implementation
I first wrote a non-OpenCL implementation of the frequency counter. This is useful to understand the problem, and also for debugging, to ensure that the OpenCL version produces the correct output. The function is simple: it takes an NSData
and returns an NSData
containing an array of uint32_t
s representing the frequency counts. The computation is straightforward:
static NSData *SimpleFreqCount(NSData *inData)
{
NSMutableData *freqCount = [NSMutableData dataWithLength: 256 * sizeof(uint32_t)];
uint32_t *freqs = [freqCount mutableBytes];
const unsigned char *ptr = [inData bytes];
NSUInteger len = [inData length];
for(NSUInteger i = 0; i < len; i++)
freqs[ptr[i]]++;
return freqCount;
}
An OpenCL kernel gets invoked multiple times in parallel by OpenCL with the same parameters. The kernel can differentiate between these different instances by examining its work-item ID. These work-item IDs can get complex, but in the simplest case, each call to the kernel has one ID which you can fetch by calling
get_global_id(0)
.
Because the kernels can execute in parallel, there are the standard problems with concurrent data access. To avoid clashes, I decided to write the frequency count in two stages.
The first stage will go through the input data in blocks of 256 bytes and compute a local frequency count of the bytes just in that block. This local count will be stored into a large array which contains one local frequency count per block. The second stage will then go through and add all of the local counts together into one global count.
OpenCL Kernel Code
Here's what the start of the freqcount
kernel, the first stage kernel, looks like:
__kernel void freqcount(const unsigned char *input, unsigned short *output)
{
__kernel
keyword. This is an OpenCL-specific keyword which indicates that this function is a kernel, which is to say that it can be accessed from the outside program. It's also possible to write functions that can only be accessed by other OpenCL functions.
Next, the kernel gets its work-item ID:
const uint index = get_global_id(0);
const uint start = index * 256;
input
, getting the value of each byte, and incrementing the value in the corresponding spot in output
:
for(uint i = 0; i < 256; i++)
{
uint value = input[start + i];
output[start + value]++;
}
}
freqsum
kernel is the second stage. It uses a fixed number of work items, 256, one for each entry in the frequency count. Each work item then loops through all of the local count arrays to compute a final total. This is what the kernel looks like:
__kernel void freqsum(const unsigned int count, unsigned short *freqs, unsigned int *totals)
{
const uint index = get_global_id(0);
for(uint i = 0; i < count; i++)
totals[index] += freqs[index + i * 256];
}
Building the kernels was easy in this case, calling them is a bit more work. The first thing I do is pad the incoming data to a multiple of 256, so that it plays nice with the kernel's chunking:
static NSData *CLFreqCount(NSData *inData)
{
NSMutableData *data = [NSMutableData dataWithData: inData];
// pad data to multiple of 256
NSUInteger dataLength = [data length];
NSUInteger paddedLength = dataLength + 255 - (dataLength + 255) % 256;
NSUInteger pad = paddedLength - dataLength;
[data setLength: paddedLength];
pad
variable so that it can be subtracted off at the end.)
Next I create buffers to hold the local counts and the final count. The local counts array is twice as large as the incoming data, because it needs two bytes for each byte value for each 256-byte block. The final count array is just enough to hold 256 32-bit integers:
// create large output area
NSMutableData *largeOutput = [NSMutableData dataWithLength: paddedLength * 2];
// and the final totals area
NSMutableData *freqCount = [NSMutableData dataWithLength: 256 * sizeof(uint32_t)];
The first thing to do is to set up a context in which the kernels can be executed:
SMUGOpenCLContext *context = [[SMUGOpenCLContext alloc] initCPUContext];
initGPUContext
instead to execute code on the GPU. It's your choice, not the system's, and it will fail if your GPU can't handle OpenCL, so you need to handle this carefully. For this simple example, I just always use a CPU context.
Next, I load the OpenCL program into the context, and fetch the two kernels out of the program. The CLFreqCountSourceString
function just returns an NSString
containing the code to the two kernels:
SMUGOpenCLProgram *program = [[SMUGOpenCLProgram alloc] initWithContext: context sourceString: CLFreqCountSourceString()];
SMUGOpenCLKernel *freqCountKernel = [program kernelNamed: @"freqcount"];
SMUGOpenCLKernel *freqSumKernel = [program kernelNamed: @"freqsum"];
cl_mem
objects. SMUGOpenCL makes this really easy:
cl_mem dataCL = [data getOpenCLBufferForReadingInContext: context];
cl_mem largeOutputCL = [largeOutput getOpenCLBufferForWritingInContext: context];
cl_mem freqCountCL = [freqCount getOpenCLBufferForWritingInContext: context];
cl_int err;
err = [freqCountKernel setArgument: 0 withSize: sizeof(dataCL) data: &dataCL;];
if(err)
ERROR("OpenCL error: %lld", (long long)err);
err = [freqCountKernel setArgument: 1 withSize: sizeof(largeOutputCL) data: &largeOutputCL;];
if(err)
ERROR("OpenCL error: %lld", (long long)err);
cl_uint sumCount = paddedLength / 256;
err = [freqSumKernel setArgument: 0 withSize: sizeof(sumCount) data: &sumCount;];
if(err)
ERROR("OpenCL error: %lld", (long long)err);
err = [freqSumKernel setArgument: 1 withSize: sizeof(largeOutputCL) data: &largeOutputCL;];
if(err)
ERROR("OpenCL error: %lld", (long long)err);
err = [freqSumKernel setArgument: 2 withSize: sizeof(freqCountCL) data: &freqCountCL;];
if(err)
ERROR("OpenCL error: %lld", (long long)err);
get_global_id(0)
can return. The local work size is something I honestly don't quite understand, and I just use a SMUGOpenCL method to get a good size. The sizes are passed as arrays because some fancy stuff can be done (presumably for multi-dimensional data and such) by passing multiple values, but I only need one:
size_t globalSizeCount[] = { paddedLength / 256 };
size_t localSizeCount[] = { [context workgroupSizeForKernel: freqCountKernel] };
[context enqueueKernel: freqCountKernel
withWorkDimensions: 1
globalWorkSize: globalSizeCount
localWorkSize: localSizeCount];
size_t globalSizeSum[] = { 256 };
size_t localSizeSum[] = { [context workgroupSizeForKernel: freqSumKernel] };
[context enqueueKernel: freqSumKernel
withWorkDimensions: 1
globalWorkSize: globalSizeSum
localWorkSize: localSizeSum];
[context finish];
freqCount
, except that they're incorrect due to padding. To fix this, I do a quick fix-up of the frequency count for zero:
uint32_t *freqs = [freqCount mutableBytes];
// compensate for the padding
freqs[0] -= pad;
[program release];
[context release];
return freqCount;
}
Conclusion
OpenCL isn't too complicated to get started with, and SMUGOpenCL makes it especially easy. Performance might be trickier. With a CPU context, OpenCL was significantly slower than the non-OpenCL code for a single pass through the data. With some loops to force it to do more work, and reduce the overhead of OpenCL setup, the OpenCL version pulled ahead a bit. My Mac Pro's video card doesn't do OpenCL, so I didn't test GPU speed, but I'm going to guess that these kernels are not well adapted to the GPU. How to wrest the maximum speed out of OpenCL is beyond the scope of today's post. OpenCL is an exciting new technology, and much easier to get started with than doing general-purpose computation with OpenGL, and I hope that I've shown you enough to start experimenting on your own.
That's it for this week. Come back in another seven days for the next edition of Friday Q&A. Until then, keep sending your ideas for posts. Friday Q&A is driven by reader suggestions, so if you have a topic that you'd like to see covered here, send it in!
Comments:
Does OpenCL guarantee that setting a single byte in a shared array like your kernel does will be "thread" safe?
During the SL seeds, Ian Ollman posted to the Developer Forums "Why you should vectorize and some tips to get started". Unfortunately, my ADC membership has since expired and it doesn't seem to have been converted to a tech note, but if you have access it could help you get more performance out of OpenCL.
In answer to your question, that probably is not guaranteed. I didn't even think about that. You'd probably need to make the final results array use a larger data type in order to be safe.
It explains what SMUGOpenCL does under the hood and more (probably more then you want to know...)
Is that even possible in openCL ?
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.