Optimization Strategy

From OpenCV on the Cell

Jump to: navigation, search

Contents

Parallel processing

Six specialized vector processors (SPUs) can be used in PS3. When there is no dependency in each processings, SPU processes in parallel the image data divided into six as follows. Thereby, Processing time is reduced to 1/6. Please refer to Hiding data-access latencies for the data transfer in this case.


Parallel processing

Hiding data-access latencies

When processing by SPU, the processing data is acquired from main memory, and it is necessary to return after the completion of processing. Although DMA is used for data transfer, SPU will be kept waiting during transfering time.(see 'Single buffer DMA' of the following figure)

In order to hide these data-access latencies, use double-buffering techniques.As the following figure 'Double buffer DMA', a data latency can be hidden.Please refer to here for details.


Double-buffering



SIMDizing

SIMD(Single Instruction Multiple Data) is a technique for data level parallelism in vector processor as PPU or SPU. In image processing, if there is no dependability of each processing, multiple pixels can be collectively processed by SIMDization.

For more information, See SPU and SIMD optimization.

SIMDizing

SPU program size reduction

SPU program must execute in 256KB of local store. However, this size is not enough for C++ program. So OpenCV Library is using C++, program size becomes large easily. For example, an OpenCV library function like cvCanny exceeds local store size only as -O3 option.

The following table shows the compile options and each object's section size, when compiling a cvRandArr function by spu-g++. According to this, -fno-exceptions option is effective for size reduction.


Section size of cvRandArr function
Option text rodata data bss Description
-O0 021c88 0027b0 0008f0 008e40 None optimize
-O3 021868 0027b0 0008f0 008e40 Optimize
-Os 0217d8 0027b0 0008f0 008e40 Optimize for size
-fno-rtti 021c88 0027b0 0008f0 008e40 Disable generation of runtime type identification.
-fno-exceptions 012e78 001340 0008d0 008600 Disable exception handling.
-ffunction-sections
-fdata-sections
021c88 0027b0 0008f0 008d70 Place each function or data item into its own section in the output file.(-Wl,-gc-sections option is required when linking.)
-Os
-fno-exceptions
-ffunction-sections
-fdata-sections
0129d8 001340 0008d0 008530 All the options which had an effect in size reduction of cvRandArr


Optimizing for speed

The SPU hardware does not fully support the IEEE floating-point standard. However, spu-gcc generates the code generally based on IEEE standard. If -ffast-math option is used, the floating-point behavior is essentially dictated by the SPU hardware. In this case, note that a processing result may differ from original, because of calculation accuracy.


The following table is an effect of -ffast-math option. It was tried on the cvRandArr function. It is faster twice.

Option Execution time(msec)
without -ffast-math 143.85
with -ffast-math 58.11
Time ratio 40.40%
Views
Personal tools
Toolbox