Heterogeneous Computing with OpenCL. Gaster, Howes, Kaeli, Mistry, Schaa. Book.

This is not expected to be of interest to anyone else yet, but taking notes on openCL as it does seem to do very nice stuff

Ch 2

  1. the global id in the kernel identifying what piece of work to be done can be up to 3-dimensional (p. 18)
  2. It is possible to set work group sizes, each item in a workgroup can syncronize and they share memory (p. 18)
  3. There are functions to check what platforms are available (p. 20)
  4. A context is an abstract container that exists on the host.  It coordinates the mechanisms for host-device integration, manages memory, and keeps track of programs and kernels made for each device (p. 22)
  5. Communication with a device occurs by submitting commands to a command queue.  Once the host decides which devices to work with and the context is created, one command queue needs to be made per device. (p. 22)
    1. A flag can be set that allows operations to occur out of order which can improve perf
  6. Any operation that places an item in the command queue creates an event (p.23).  They represent dependencies and provide a mechanism for profiling
  7. There are 2 types of memory objects: buffers and images.  Buffers are equivalent to arrays in C, stored contiguously.  Images are “opaque” objects.  Any memory object is valid only in a single context.  Movement to and from devices is managed by openCL as necessary (p. 24)
    1. Buffers can be read-only, write-only, or read-write
    2. Buffers are related to a context, not a device so whether they are totally copied or in part depends on the runtime
  8. Images abstract the storage of data to allow for device-specific optimizations (p 25), so I probably won’t use them
    1. They can’t be directly referenced, like arrays.  Data is not guaranteed to be adjacent in memory
    2. Basically specially set up for the GPU
  9. There are a couple of synchronization methods (p.26):
    1. clFinish() blocks until all commands in the queue are done
    2. clFlush() blocks until all commands in the queue have been started, but not necessarily finished
  10. There are ways to manipulate loading, saving, compiling openCL code (p 27)
  11. Memory:
    1. There are different memory spaces available.  (29)
    2. Global memory is visible to all compute units on the device. When data is transferred from host to device, data will be in global memory
    3. Constant memory is part of global memory that can be accessed simultaneously by many items, and cannot change. Data is mapped there by using a particular keyword
    4. Each workgroup has shared local memory
    5. Each work item in a group has its own private memory
    6. Diagram of how the mapping works in a concrete example (30)
    7. In some cases, it is worth manually copying items from global memory to local for speedups, particularly in GPUs (32)
  12. Kernels (31):
    1. A kernel is executed once for each work item created
    2. Kernels return void
    3. Buffers passed in can be of type global or constant (images are constant)
    4. specifications for read/write access are also passed in with the parameters
    5. When local data is is declared, it allocates memory for one of those things per workgroup, and is shared
    6. Variables defined in the kernel itself can also be specified in terms of what memory space it lives in
    7. Kernels clean up local memory and other things for themselves
  13. Sweet lord, the amount of code required to do all this in C is terrible (37)

Ch 3 – OpenCL Device Architectures

  1. Although the idea is to make code really simple for all sorts of processor types, in reality we aren’t there yet.  Its important to know how each piece of hardware works. (41)
  2. Superscalar execution – CPUs have areas that execute code out of order to keep itself busy, but much of this is throwaway work
  3. VLIW – Moving the reordering of code to keep the CPU busy to the compiler instead of in hardware
  4. SIMD is big in GPUs
  5. GPU architecture (53)
  6. Although the common arguments are that CPUs are serial (a couple of threads) and GPUs are parallel (thousands of threads) the lines are really blurry and the differences arent fundamental (55)
  7. Diagram of difference architectures: AMD X6, i7, Radeon 6970 (this book is gladly recently written so these bits are relevant) (57)
  8. A common tradeoff is increasing core counts, or increasing threads per core (58)
  9. Sum-up of common CPU features: basically set up for fast single (or dual) thread execution
  10. Discussion of GPUs (60) “…GPUs are simply multithreaded processors with their parameterization aimed at processing large numbers of pixels very efficiently”
    1. Have high memory bandwidth, but poor latency
  11. AMDs APUs that are CPUs and GPUs together dont have the bandwidth of regular GPUs, but have the benefit that data doesnt have to go through the PCI bus, so when bits of processing have to happen on each, it is probably advantageous

Ch 4 – Basic OpenCL Examples

  1. (70) Diargram for programming steps: (Query platform, query devices, command queue: at the platform layer ), (create buffers (compile program, compile kernel: Compiler), set arguments, execute kernel: runtime layer)
  2. Have examples of matrix multiplication, image rotation, image convolution

Ch 5 – Understanding OpenCL’s Concurrency and Execution Model

  1. Can synchronize across operations within a workgroup, but in most cases it is best to restrict communication across to increase performance (88)
  2. On GPUs, as many as 64 work items can execute “in lock step as a single hardware thread on SIMD unit”  in AMD this is called a wavefront, and on NVIDIA its called a warp.  They say this is a simpler model than SIMD that runs on CPUs
  3. Because of this SIMD execution, for a given device an opencl dispatch should be an even multiple of the device’s SIMD width
    1. This can be found by getInfo() in the runtime… (details b. 88)
  4. In addition to getting the global id across one of the 3 dimensions, you can also query for other things such as workgroup membership, workgroup size, workgroup ID (89)
  5. In general, there is no ordering guarantees between work items
  6. Synchronization on CPUs is much simpler than on GPUs at a fundamental level.  Because of this, global synchronization is only allowed at kernel boundaries.  There is no way of specifying ordering of work items outside a work group
  7. barrier in a workgroup is a synchronization point where all work items wait until all items in that work group have reached that point – seems like this can cause deadlocks if not used carefully (b 90)
  8. Enqueueing work items, transmitting data, etc are done asynchronously.  The following commands can be enqueued (94):
    1. Kernel execution commands
    2. Memory commands
    3. Synchronization commands
  9. All kernel and synchronization commands are enqueued async.  From the point of view of the host, completion is only guaranteed at async point, the following are:
    1. clFinish command that blocks until an entire queue completes execution
    2. Waiting on the completion of a specific event
    3. Execution of a blocking memory op (this is the simplest, can be done with a call to enqueueReadBuffer with the 2nd param CL_True)
  10. Memory is guaranteed only to be consistent at sync points (96). Even in these cases, memory consistency is only in the runtime, not to the host API.  To do this one of the above blocking operations must be used
  11. Memory is associated with contexts, not devices
  12. OpenCL generates task graphs to coordinate dependencies, often based on events (97 w/ex).  Syncing on events can only be done within the same context
  13. Mult cmd queues for diff devices can be made in the same context
  14. Multiple devices in the same context can be set up to wait for each other, or work indep (100)
  15. Dealing with error conditions and profiling arent supported in openCL, can check getInfo on an event, such as command queue, context, command type, and execution status (103)
  16. User events allow enqueueing of commands that depend on the completion of an arbitrary task.  They give the example of making sure a buffer isn’t overwritten by an async opencl read until the buffer is no longer used
  17. Event callbacks allow callbacks to be set up when a specific execution status is reached.  In practice, this is used in cases where the CPU and GPU interleave processing (104)
  18. Callbacks should be used with caution for a number of reasons (104), a couple of reasons is the callback has to be thread safe and some behavior is undefined
  19. Can write native kernels in standard C that is complied by a regular compiler, and can use normal C libraries, but otherwise function like normal kernels, it basically boxes some C code and passes it over to where it it will be executed (107)
  20.  Command barries and Markers (108).  Barriers block until all items finish, but markers do not block execution.
    1. The mirror of a marker is waitForEvents, which blocks until an even occurs
  21. Buffers (110)
  22. Images (113) are structures only really relevant to GPUs, which are basically regular images and have support for particular operations (such as filtering) specific to images (efficiency depends on how they are stored in memory)
  23. Memory model: global, local, constant, private (116)
  24. Because of perf, the consistency of memory is more relaxed than in normal code:
    1. Within a work item, memory operations are ordered predictably.  Any two reads and writes to the same address are not reordered
    2. Between work items in a workgroup, memory is only guaranteed consistent at a barrier
    3. Between workgroups there are no guarantees of consistency until completion of the kernel
  25. Because of this the compiler only really needs to make the last write to a particular address visible outside a work item
  26. Fences allow some level of communication between work items and workgroups but there are still no guarantees of ordering.  The fence itself doesn’t imply any synchronization, but the barrier does
  27. There are some predefined atomic methods, such as for addition
  28. (122) particulars on memory and registers on GPUs

Ch 5 – Dissecting a CPU/GPU OpenCL Implementation

  1. On a CPU work items will be spread around processors, but workgroups are all run in the same thread, which makes the balance between overhead and synchronization good, so there is no parallelism within a workgroup, but between workgroups there is
  2. On GPUs cost of thread scheduling needs to be small because there are many threads and most computations on threads are small(130)
  3. workloads need to map properly to the underlying hardware to attain good performance
  4. PCIExpress bus isn’t too slow.  On the example machine they use, its 8GB/s, and the bus to DDR3 ram is 22GB/s, the GPU has 170GB/s, though.  The GPU is high bandwidth, high latency
  5. (133) The type of code that gets generated for running on the GPU
  6. Branching is done at the granularity of wavefronts, and any sub-wavefront branching requires redoing parts in the wave scheduler until all combinations have finished executing, which generates inefficiency (135)
  7. Unrolled loops or vectorizing code makes GPU processing most efficient (137)
  8. <This chapter is mainly about GPU particulars so I’m not reading very carefully>

Ch 7 – Case study in convolution

  1. Might be relevant for processing data from rollouts when action space is multidimensional
  2. Picking the right work group size depends on the particular architecture (GPU manufacturer, model even)
  3. Unrolling loops can make code 2-2.5x faster

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: