Notes from a parallel universe II

10 October 2017
intel_xeon_lga.jpg

Last time, we looked at the progression of devices from Intel that have dominated the rugged embedded computing market at the high performance end, both for SBCs and for signal processing. A graphic showing the story to date was omitted, so here it is (below). It depicts the progression of processors from 1st Gen through the current 7th Gen, and shows example SBCs in both 3U and 6U formats, plus the timeline for significant architectural introductions that gave us inflection points on the performance curve, especially in the single precision floating point (SP FP) vector operations that tend to dominate embedded signal processing.

We also looked forward to the introduction of AVX512 to this line of processors. Since then, Intel has made some more details public. Particularly relevant is the fact that some CPUs (in the ‘server’ line) will have both AVX512 and two fused multiply-add (FMA) execution units, while some (in the ‘client’) line will get AVX512, but only one FMA unit.

This has some implications for our applications. If you consider purely peak theoretical GFLOPS, then client CPUs will have equivalent numbers (assuming the same base operating frequency) as previous processors with AVX2. Meanwhile, server chips will get double the peak performance (as current AVX2 CPUs have two FMA units, but the pipeline is 256 bits wide).

However: a developer needs to consider how achievable this performance is, as it assumes that the algorithm being performed can issue back-to-back dual FMA operations - which is only possible in certain circumstances.

Our initial testing has shown that we can achieve the doubling in performance when executing a complex matrix product which heavily utilizes FMA instructions. Other algorithms may not fare so well depending on the instruction mix. On the other hand, some algorithms will run faster on a 512-bit pipeline with one FMA unit than on a 256-bit pipeline with two FMA units. As the saying goes: your mileage may vary.

Enter ARM

What, then, is going on in the non-Intel world? To date this has been mostly dominated by Power Architecture. Power PCs were long the chip of choice for power-efficient processing due to the AltiVec vector engine, as well as applications requiring safety certification due to architectural features, including memory structure. There was a hiatus in the availability of AltiVec that reduced the attractiveness for signal processing for a few years, and now we are faced with a sparse roadmap.

Enter ARM. Like Power, ARM is a set of architectures for RISC processors that are licensed and manufactured by a number of vendors, including NXP/Qualcomm, NVIDIA (embedded in Tegra GPUs), TI, Broadcom and many more. In addition, FPGA vendors including Xilinx, Altera/Intel, and Microsemi have ARM cores hardwired into some of their system-on-chip products.

ARM devices can include the NEON SIMD extension which provides 128-bit wide processing of various data types including SP FP. Some have the Mali integrated graphics processor which can support GPGPU processing via OpenCL as well as graphics with OpenGL. Add to this large core counts, multiple PCIe lanes, multiple Ethernet ports supporting rates up to 100GbE, and DDR4 memory support; do all this in a power envelope just over 30W, and it gets interesting.

It should also be noted that many ARM implementations also include features that are expected for modern, secure systems, such as hardware support for virtualization, secure boot, and so on.

Abaco has a rich portfolio of board and system products based on high-performance Intel processors and power-efficient ARM processors to support a wide variety of rugged, embedded applications. We would love to hear what you need for your next program.

Peter Thompson

Peter Thompson is Vice President, Product Management at Abaco Systems. He first started working on High Performance Embedded Computing systems when a 1 MFLOP machine was enough to give him a hernia while carrying it from the parking lot to a customer’s lab. He is now very happy to have 27,000 times more compute power in his phone, which weighs considerably less.