Six months ago Flex Logix flagged up a move into neural network acceleration with plans for an nnMax 512 licensible tile (see Flex Logix tips move into NN processing). However, prior to tape-out and in response to conversations with potential customers Flex Logix has decided to increase the size of the tile to 1,024 DSP multiply-accumulate units.
The result is an nnMAX tile of 1,024 MACs with local SRAM, which in 16nm FinFET process has approximately a 2.1 TOPS peak performance. nnMAX tiles can be arrayed into NxN arrays of any size, without any GDS change, with varying amounts of SRAM as needed to optimize for the target neural network model, up to to >100 TOPS peak performance.
The reason for the change is some mathematics known as the Winograd Transform and how that applies to convolutional neural networks, according to Geoff Tate, CEO of Flex Logix. It turned out that the WInograd transformation when applied to CNNs can provide superior efficiency and speed up of calculations but also requires clusters of 16 MACs close together.
There are implications for loss of resolution so for INT8 accuracy Winograd calculations are done with 12bit resolution.
The result was that it was desirable to have a slightly larger tile but that for 3x3 matrix operations on a stride of one – which can represent 75 percent of CNN operations – it provides a speed up of about 2.25.
InferX X1 edge inference coprocessor: 1.067GHz clock frequency on TSMC16FFC. Source: Flex Logix Technologies Inc.
The InferX X1 edge co-processor will have four such tiles and a single 32bit wide LPDDR4 interface to DRAM.