Dennis Núñez

PhD (c) in AI and Neuroimaging. CEA / Inria / Université Paris-Saclay

Starting with OpenCL on Parallella Board

Note: All next steps should be done on Parallella Board.

STDCL® is a portable API for targeting compute offload accelertors and co-processors. STDCL is designed in a style inspired by familiar and traditional UNIX APIs for C programming. The implementation of STDCL leverages OpenCL™ to provide broad portability across modern accelerator and co-processor architectures.

STDCL provides the following features and support:

- Default compute contexts.

- Dynamic loader for compute-offload programs.

- Distributed memory management.

- Compute offload kernel execution.

- Synchronization.

More info:

Create folder

Go to the projects folder with "cd", or create it with "mkdir".

Into the projects folder:

Create a folder:

$ mkdir matvecmult

Host code

Create the file "matvecmult_host.c":

$ sudo nano matvecmult_host.c

and write:

/* matvecmult_host.c */ #include <stdio.h> #include <stdlib.h> #include <stdcl.h> inline int parity( int x ) { return ( (x%2)? +1 : -1 ); } int main() { int i,j; unsigned int n = 1024; /* allocate device-shareable memory */ float* aa = (float*)clmalloc(stdacc,n*n*sizeof(float),0); float* b = (float*)clmalloc(stdacc,n*sizeof(float),0); float* c = (float*)clmalloc(stdacc,n*sizeof(float),0); /* initialize matrix aa[] and vector b[], and zero result vector c[] */ for(i=0; i < n; i++) for(j=0; j < n; j++) aa[i*n+j] = (1.0/n/n)*i*j*parity(i*j); for(i=0; i < n; i++) b[i] = (1.0/n)*i*parity(i); for(i=0; i < n; i++) c[i] = 0.0f; /* sync data with device memory */ clmsync(stdacc,0,aa,CL_MEM_DEVICE|CL_EVENT_NOWAIT); clmsync(stdacc,0,b,CL_MEM_DEVICE|CL_EVENT_NOWAIT); clmsync(stdacc,0,c,CL_MEM_DEVICE|CL_EVENT_NOWAIT); /* perform calculation */ clndrange_t ndr = clndrange_init1d( 0, n, 16 ); clexec(stdacc,0,&ndr,matvecmult_kern,n,aa,b,c); /* sync data with host memory */ clmsync(stdacc,0,c,CL_MEM_HOST|CL_EVENT_NOWAIT); /* block until co-processor is done */ clwait(stdacc,0,CL_ALL_EVENT); for(i=0; i < n; i++) printf("%d %f %f\n",i,b[i],c[i]); clfree(aa); clfree(b); clfree(c); }

Then save and exit.

Kernel code

Create the file "":

$ sudo nano

and write:

/* */ #include <stdcl.h> void matvecmult_kern( unsigned int n, float* aa, float* b, float* c ) { int i,j; i = get_global_id(0); float tmp = 0.0f; for(j=0; j < n; j++) tmp += aa[i*n+j] * b[j]; c[i] = tmp; }

Then save and exit.

Program Compilation

Using the COPRTHR clcc compiler tools, building the program is easy and follows a standard compilation model. The kernel code is first compiled using clcc and then the host program can be compiled with the kernel code being directly linked to create a single executable program.

$ clcc -k -c

$ gcc -o matvecmult.x matvecmult_host.c matvecmult_kern.o -I/usr/local/browndeer/include \ -L/usr/local/browndeer/lib -lstdcl -locl

We can verify that the kernel code is linked in using the clnm command:

$ clnm matvecmult.x

The last command should shows:

clnm: 'matvecmult_kern.c' bin [coprthr:ARMv7] clnm: 'matvecmult_kern.c' bin [coprthr:E16G Needham] clnm: 'matvecmult_kern.c' ksym matvecmult_kern clnm: 'matvecmult_kern.c' src [<generic>]


We can then run the program that offloads the parallel computation to the Epiphany co-processor:

$ ./matvecmult.x