Optimization Strategy
From OpenCV on the Cell
Contents |
Parallel processing
Six specialized vector processors (SPUs) can be used in PS3. When there is no dependency in each processings, SPU processes in parallel the image data divided into six as follows. Thereby, Processing time is reduced to 1/6. Please refer to Hiding data-access latencies for the data transfer in this case.
Hiding data-access latencies
When processing by SPU, the processing data is acquired from main memory, and it is necessary to return after the completion of processing. Although DMA is used for data transfer, SPU will be kept waiting during transfering time.(see 'Single buffer DMA' of the following figure)
In order to hide these data-access latencies, use double-buffering techniques.As the following figure 'Double buffer DMA', a data latency can be hidden.Please refer to here for details.
SIMDizing
SIMD(Single Instruction Multiple Data) is a technique for data level parallelism in vector processor as PPU or SPU. In image processing, if there is no dependability of each processing, multiple pixels can be collectively processed by SIMDization.
For more information, See SPU and SIMD optimization.
SPU program size reduction
SPU program must execute in 256KB of local store. However, this size is not enough for C++ program. So OpenCV Library is using C++, program size becomes large easily. For example, an OpenCV library function like cvCanny exceeds local store size only as -O3 option.
The following table shows the compile options and each object's section size, when compiling a cvRandArr function by spu-g++. According to this, -fno-exceptions option is effective for size reduction.
| Option | text | rodata | data | bss | Description |
|---|---|---|---|---|---|
| -O0 | 021c88 | 0027b0 | 0008f0 | 008e40 | None optimize |
| -O3 | 021868 | 0027b0 | 0008f0 | 008e40 | Optimize |
| -Os | 0217d8 | 0027b0 | 0008f0 | 008e40 | Optimize for size |
| -fno-rtti | 021c88 | 0027b0 | 0008f0 | 008e40 | Disable generation of runtime type identification. |
| -fno-exceptions | 012e78 | 001340 | 0008d0 | 008600 | Disable exception handling. |
| -ffunction-sections -fdata-sections | 021c88 | 0027b0 | 0008f0 | 008d70 | Place each function or data item into its own section in the output file.(-Wl,-gc-sections option is required when linking.) |
| -Os -fno-exceptions -ffunction-sections -fdata-sections | 0129d8 | 001340 | 0008d0 | 008530 | All the options which had an effect in size reduction of cvRandArr |
Optimizing for speed
The SPU hardware does not fully support the IEEE floating-point standard. However, spu-gcc generates the code generally based on IEEE standard. If -ffast-math option is used, the floating-point behavior is essentially dictated by the SPU hardware. In this case, note that a processing result may differ from original, because of calculation accuracy.
The following table is an effect of -ffast-math option.
It was tried on the cvRandArr function. It is faster twice.
| Option | Execution time(msec) |
|---|---|
| without -ffast-math | 143.85 |
| with -ffast-math | 58.11 |
| Time ratio | 40.40% |



