Mastering Memory Bandwidth in SoCs

How Simulation Resolves Memory Bandwidth Restrictions in SoCs

Cameras play a critical role in many of today’s advanced driving assistance systems (ADAS) and autonomous driving (AD) applications. Powerful system-on-chip (SoC) devices, with their integrated image signal processors (ISP) and other accelerators, are well placed to offload the execution of computer-vision-system algorithms. However, despite the availability of SoCs featuring these highly capable heterogeneous processing cores, automotive system developers are still struggling to ensure that decisions based upon camera input are delivered on time, every time.

The challenges faced are two-fold. The first relates to the memory interconnects. While each processor and accelerator provides plenty of processing performance, they typically operate on data located in a shared memory accessible over a single interconnect. NXP’s second-generation S32V234 vision processor is an example of this, with their APEX image cognition processors sharing the AXI interconnect with the GPU and the quad Arm® Cortex®-A53 processors. The situation is similar for the Renesas R-Car V3H. Another quad Arm Cortex-A53, it features two Image Signal Processors (ISP) and a hardware Convolutional Neural Network (CNN), as well as an image pipeline and further computer vision accelerators. Memory access bottlenecks typically occur as the various processors attempt to acquire data and write back their results.

Maintaining Temporal Consistency

This leads to the second challenge, that of temporal consistency. Because other aspects of the application are also accessing the memory over the interconnect, it becomes challenging to ensure that the execution paths for image processing are guaranteed to complete within the required timeframe.

In such complex applications, two aspects must be carefully considered. The first is to ensure that the image processing algorithms are efficient. The second is to review the order in which these blocks and other system code access memory. This requires an efficient, dynamic system architecture so that memory bottlenecks are minimized or avoided. The starting point to resolve this is to define maximum end-to-end latencies from sensor to output or actuator. These are then split into execution times for each processing stage. The execution time for each element in the chain can be determined using experimentation or approximate timing based on previous projects’ experience.

Imaging Application Processing Path

With this knowledge, the entire processing path can be modeled. This is shown in Figure 1, where images from two cameras are processed by two ISPs, followed by two dedicated accelerators. The results are then analyzed in two separate application code blocks before being fused in a sensor data fusion (SDF) block. In terms of timing, it is essential that the SDF regularly receives fresh input via both paths and is not left to make decisions based upon old image samples from either path.

Figure 1: A simplified example automotive vision system using image processors, accelerators, and application code that is reliant on shared memory. It must deliver correct and timely results to a sensor data fusion block.

In the example shown, the ISPs have access to a dedicated memory and may also be able to automate data transfer using direct memory accesses (DMA). While this helps to reduce the complexity of the task, the ISPs still have a shared memory and must take it in turns to submit their results. The hardware accelerators (ACC), application code (APP), and SDF still share a common memory, requiring that the order of accesses are determined and appropriately interlaced with memory accesses from other parts of the system.

Accounting for SoC Interconnects in System Simulation

The development of a model using chronSUITE 3.0 allows the development team to assess their options, allowing them to decide which algorithms should run on hardware accelerators, and which on application cores. Additionally they can determine how to prioritize and order the code or tasks. Through rounds of simulation using chronSIM, part of the chronSUITE toolkit, the execution timing, including jitter, can be evaluated using chronVIEW. The model also reflects restrictions imposed by the memory busses (Figure 2). Thanks to the visualization of the individual elements of the implementation, trade-offs between timing requirements and processor and bus load are easier to make.

Various approaches can be employed, such as priority changes and execution order, to test improvements without having to deploy the code on hardware. Additionally, mitigation approaches can be quickly developed and simulated if the system requirements change during the project or functional issues are discovered after integration with the rest of the hardware. This ensures a more efficient approach to resolving timing-related issues and speeds up overall problem resolution.

Figure 2: By modeling the vision system processing elements (top half) together with the memory interconnect accesses (bottom half), the end-to-end execution time can be understood. This allows optimizations to be developed when, as here, timing requirements are violated

For further examples of how chronSUITE 3 supports automotive development teams tackle timing challenges earlier in the design of complex, multicore systems, take a look at our design page.