I understand that XLA performs automatic kernel fusion for a computational graph, which comes handy in reducing memory bandwidth usage on a GPU. What gains can one derive using XLA for a CPU? Is it the same principle, in fusing computations and not writing intermediate results to the L1 cache? I would appreciate a laymen's explanation.
XLA on CPU -- where do the gains come from?
766 Views Asked by StatsyBanksy At
1
There are 1 best solutions below
Related Questions in GPU
- A deterministic GPU implementation of fused batch-norm backprop, when training is disabled, is not currently available
- What is the parameter for CLI YOLOv8 predict to use Intel GPU?
- Windows 10 TensorFlow cannot detect Nvidia GPU
- Is there a way to profile a CUDA kernel from another CUDA kernel
- Does Unity render invisible material?
- Quantization 4 bit and 8 bit - error in 'quantization_config'
- Pyarrow: ImportError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found
- How to setup SLI on two GTX 560Ti's
- How can I delete a process in CUDA?
- No GPU EC2 instances associated with AWS Batch
- access fan and it's speed, in linux mint on acer predator helios 300
- Why can CPU memory be specified and allocated during instance creation but not GPU memory on the cloud?
- Why do CUDA asynchronous errors occur? (occur on the linux OS)
- Pytorch how to use num_worker>0 for Dataloader when using multiple gpus
- Running PyTorch MPS acceleration on Apple M1, get "Placeholder storage has not been allocated on MPS device!" error, but all seems to be on device
Related Questions in CPU
- the end of the I/O operation is notified to the system by an interrupt.how much system time do the mentioned operations occupy?
- Python process CPU usage going high suddenly. how to detect the place?
- Problem on CPU scheduling algorithms in OS
- Will a processor with such a defect work?
- Google Chrome is consuming a lot of CPU on a video call?
- access fan and it's speed, in linux mint on acer predator helios 300
- I am trying to calculate the cpu percentage a certain process take but the values are very differnt than that of the task manger
- Can out-of-order execution of CPU affect the order of new operator in C++?
- Unexpected OS Shutdown
- Maximum CPU Voltage reading
- ClickHouse Materialized View consuming a lot of Memory and CPU
- Use of OpenVINO on a computer with 2 physical cpus
- How is cpu's state saved by os without altering it?
- why the CPU utilization and other indicators collected by glances are larger than those collected?
- Python serial communication causing high CPU Usage when baudrate is 1000000
Related Questions in GPGPU
- OpenCL dynamic parallelism enqueue_kernel() functionality
- Sign a PGP public key using a private key and password, then save the signed key to a file
- Passing arguments to OpenCL kernel, before execution finished
- CUDA kernel for finding the min and max index of values in a 1D array greater than particular threshold
- Cuda __device__ member function with explicit template declaration
- AMD GPU Compute with c++
- Why is webgpu on mac "max binding size" much smaller than reported "max buffer size"?
- Running multiple times a python script from different threads using different gpus
- GPGPU with Radeon Pro VII in Windows
- Pytorch Memory Management Issue
- Perform vector calculation on GPU in C++, regardless of brand
- Reinterpret cast on *shared memory*
- Can I really launch a library kernel (CUkernel) rather than an in-context kernel (CUfunction)?
- How to use shared memory in PyCuda, LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered
- What (if anything) is this GPU compute or shader pattern called?
Related Questions in CPU-CACHE
- How CPUs Use the LOCK Prefix to Implement Cache Locking and ensure memory consistency
- How to check whether the PCIe Memory-mapped BAR region is cacheable or uncacheable
- Are RISC-V SH and SB instructions allowed to communicate with the cache?
- for remote socket cache-to-cache data transfer, why data homed in reader socket shows higher latency than data homed in writer socket?
- Performance implications of aliasing in VIPT cache
- Why do fast memory writes when run over multiple threads take much more time vs when they are run on a single thread?
- question regarding the behavior of the program in Meltdown attack
- Seeking Verification: MIPS Cache Set Update Analysis
- OS cache/memory hierarchy: How does writing to a new file work?
- Can there be a cache block with the same Tag-ID in different Sets?
- is it a way to do a "store" operation without fetching in case of cache miss
- why is there a need to stop prefetching to pages when a write happens to it?
- is it possible that a cpu has several L3 level caches?
- Are 64-byte CPU cache line reads aligned on 64-byte boundaries?
- how cpu cache when physical address is not contiguous
Related Questions in XLA
- Is there a way to suppress STDERR message from tensorflow and XLA
- Is it okay to use python operators for tensorflow tensors?
- Why does tensorflow.function (without jit_compile) speed up forward passes of a Keras model?
- Passing user defined variables to xlam file
- Enable multiprocessing on pytorch XLA for TPU VM
- Get computation cost of running a tensorflow graph
- Is the XLA-HLO different for each GPU device?
- looking for a tool to predict runtime of XLA-HLO computational graph
- Flat Text Import: Field size change
- Jax vmap, in_axes doesn't work if keyword argument is passed
- Suppress an Excel Error message caused by save button before going into `Workbook_BeforeSave` event
- Tensorflow w/ XLA causing a memory leak
- Accumulation in JAX
- Anyone knows how to disable xla compiler on cloud TPUv3 node instance?
- Visualize TensorFlow graphs before and after Grappler passes?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Yes, basically it's what you said.
In general, the more information (or "context") you, as a compiler, have about a set of computations, the better you can optimize them.
As pointed out in the XLA page, the single most important feature of XLA is fusion.
Instead of computing
x + y*zas two separate operations, it can be computed as single fused-multiply-add operation.This is not only faster (generally) but it also avoids intermediate results which may have smaller precision and need to be stored somewhere.
Probably the TensorFlow model works by taking a set of data from memory and performing one of a defined set of kernels on it, storing each partial result back in memory, so the next kernel can consume it.
With XLA, linear algebra patterns are recognized and further optimized by combining one or more kernels together, avoiding an unnecessary back and forth from memory.
Modern mainstream CPUs have support for "vectors" (in jargon: SIMD) and some do support LA operations as the GPUs do.
So yes, it's the same principle (though GPUs can do a lot more LA operations in parallel, so the gain is bigger there).