This site may earn affiliate commissions from the links on this folio. Terms of use.

For the past few years, the battle over AI, deep learning, and other HPC (High-Performance Computing) workloads has been mostly a two-horse race. It's between Nvidia, the first company to launch a GPGPU compages that could theoretically handle such workloads, and Intel, who has connected to focus on increasing the number of FLOPS its Cadre processors can handle per clock bicycle. AMD is ramping upwards its own Radeon Instinct and Vega Frontier Edition cards to tackle AI besides, though the company has yet to win much market share in that arena. Merely at present there's an emerging fourth histrion — Fujitsu.

Fujitsu's new DLU (Deep Learning Unit) is meant to be 10x faster than existing solutions from its competitors, with support for Fujitu'southward torus interconnect. It's not clear if this refers to Tofu (torus fusion) 1, which the existing K computer uses, or if the platform will too support Tofu ii, which improves bandwidth from 40Gbps to 100Gbps (from 5GBps to 12.5GBps). Tofu2 would seem to be the much better option, simply Fujitsu hasn't clarified that point yet.

DLU

Fujitsu DLU overview

Underneath the DLU are an unspecified number of DPUs (Deep Learning Processing Unit). The DPUs are capable of running FP32, FP16, INT16, and INT8 data types. According to the Top500, Fujitsu has previously demonstrated that INT8 can be used without a significant loss of accuracy. Depending on the design specs, this may exist 1 way Fujitsu hopes to hit its performance-per-watt targets.

Here's what nosotros know about the underlying design:

DLU design.

Each of the DPUs contains 16 DLEs (Deep Learning Processing Elements), and each DPE has 8 SIMD units with a very big register file (no enshroud) under software control. The entire DPU is controlled past a dissever master core, which manages execution and manages retention access between the DPU and its on-fleck retentiveness controller.

So just to clarify: The DLU is the unabridged silicon chip — retention, annals files, everything. DPUs are controlled by a dissever principal controller and negotiate retentiveness accesses. The DPUs are made upwardly of DLEs with their viii SIMD units, and this is where the number crunching takes place. At a very loftier level, we've seen both AMD and Nvidia use similar ways of grouping resources into CUs, with certain resources duplicated per Compute Unit, and each compute unit having an associated number of cores.

Fujitsu is already planning a second-generation cadre that will embed itself directly with a CPU, rather than being an off-fleck distinct component. The company hopes to accept the first-generation device ready for sale onetime in 2022, which no firm date given for the introduction of the second-gen device.