XLA on CPU -- where do the gains come from?

766 Views Asked by StatsyBanksy At 13 November 2020 at 13:54

I understand that XLA performs automatic kernel fusion for a computational graph, which comes handy in reducing memory bandwidth usage on a GPU. What gains can one derive using XLA for a CPU? Is it the same principle, in fusing computations and not writing intermediate results to the L1 cache? I would appreciate a laymen's explanation.

Original Q&A

There are 1 best solutions below

Margaret Bloom On 13 November 2020 at 14:35 BEST ANSWER

Yes, basically it's what you said.

In general, the more information (or "context") you, as a compiler, have about a set of computations, the better you can optimize them.

As pointed out in the XLA page, the single most important feature of XLA is fusion.
Instead of computing x + y*z as two separate operations, it can be computed as single fused-multiply-add operation.
This is not only faster (generally) but it also avoids intermediate results which may have smaller precision and need to be stored somewhere.

Probably the TensorFlow model works by taking a set of data from memory and performing one of a defined set of kernels on it, storing each partial result back in memory, so the next kernel can consume it.
With XLA, linear algebra patterns are recognized and further optimized by combining one or more kernels together, avoiding an unnecessary back and forth from memory.

Modern mainstream CPUs have support for "vectors" (in jargon: SIMD) and some do support LA operations as the GPUs do.
So yes, it's the same principle (though GPUs can do a lot more LA operations in parallel, so the gain is bigger there).

XLA on CPU -- where do the gains come from?

There are 1 best solutions below

Related Questions in GPU

Related Questions in CPU

Related Questions in GPGPU

Related Questions in CPU-CACHE

Related Questions in XLA

Trending Questions

Popular # Hahtags

Popular Questions