I have a legacy Windows DLL (written in c++) for which I need to maintain a 32-bit version, along with the 64-bit version. I'm updating the heavy math code with simd using Agner's vector class library, and seeing little or no speed improvements for the 32-bit version when compiling with AVX as compared to SSE4.2. I'm aware that with 32-bit code there are always only 8 vector registers available, but I'm not clear (after much searching) exactly what this means when compiling with AVX, AVX2 or AVX512. Are there compiler options (Microsoft or Clang) that will give me some worthwhile speed improvements over SSE4.2 (for simple loops of floating point operations), or should I just save myself some trouble and compile the 32-bit version with SSE4.2?
Are there any real benefits to compiling a 32-bit version of my DLL with AVX or higher?
294 Views Asked by dts At
1
There are 1 best solutions below
Related Questions in SIMD
- What is Win32 x86-64 CONTEXT::VectorRegister for?
- Avx2 intrinsics don't use all registers available. .NET 8
- How to convert DoubleVector to IntVector in Java Vector API?
- Understanding throughput of simd sum implementation x86
- SIMD method to get all consecutive sums of 4 or 8 DWORD integers (prefix-sum within each vector)
- Convert Variable Width Bitstream (2-bit or 4-bit symbols) into Fixed Width
- How can I adapt my code using Math.round and remainder on integer-valued FP double into a Java code using SIMD instructions?
- What is the benefit of using SIMD to pre-calculate the branching results?
- Extract icons from exe in Rust?
- How to load uint8_t "as" 32 bits integer efficiently into a SIMD register?
- Dot-product groups of 4 bytes against 4 small constants, over an array of bytes (efficiently using SIMD)?
- Intel classic compiler reports non-unit strided load in simple assignment
- Optimizing Mandelbrot Set Calculation in C++ on a High-Performance CPU
- AVX2 vectorization for code similar to prefix sum (decrement by count of preceding matches in short fixed-length arrays)
- SIMD performance does not look right
Related Questions in AVX
- Avx2 intrinsics don't use all registers available. .NET 8
- In a Linux signal handler, will x86 extended state always be in XSAVE format, or can it be in XSAVEC format as well?
- SIMD method to get all consecutive sums of 4 or 8 DWORD integers (prefix-sum within each vector)
- avoid memory errors with AVX intinsics
- AVX intrinsic and matrix multiplication with c language
- AVX2 vectorization for code similar to prefix sum (decrement by count of preceding matches in short fixed-length arrays)
- Can std::replace implementation make redundant writes to the passed array?
- How does MSVC avoid mixing SSE and AVX?
- Run AVX SIMD instruction in VScode on Windows with a WSL
- Parsing integers from string using SIMD
- Is there an ARM Neon Gather Instruction?
- Is it better to assign all the members of an array and then add another array, or to assign each member and immediately add?
- `_mm_pow_ps `and similar functions are not recognized
- Are there several same-effect instructions in SSE/AVX?
- Leveraging and optimizing SIMD for matrix axis looping in cython
Related Questions in VECTOR-CLASS-LIBRARY
- Why performance for this index-of-max function over many arrays of 256 bytes is so slow on Intel i3-N305 compared to AMD Ryzen 7 3800X?
- Looking for an efficient function to find an index of max element in SIMD vector using a library
- I used Agner Fog's vector class but met a serious performance reduction problem when the code was compiled by MSVC
- How to use VCL as a separate namespace?
- How to gather arbitrary indexes in VCL with AVX2 enabled
- How to use Vector Class Library for AVX vectorization together with the openmp #pragma omp parallel for reduction?
- AVX2/VCL : static/dynamic lane scheduling
- Can't get vectorclass library to compile to AVX2 instructions in MSVC2019
- Vector class library: solivng a problem while using vec4d
- How to compile a project which requires SSE2 on MacBook with M1 chip?
- Vector resize function not working properly c++
- Does anyone know of a fix for an MSVC compiler bug/annoyance where SIMD Extension settings get "stuck" on AVX?
- Are there any real benefits to compiling a 32-bit version of my DLL with AVX or higher?
- Is it good or bad (performance-wise) to use std::vector<Vec8d>
- Vector class library for processing speed
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I'm answering this question myself even though the question should arguably just be deleted ... maybe it will help someone, sometime.
By the time I got my simd code punched up (aligning the memory made a big difference) and fiddled around with MSVC compiler options, my 32-bit compile started acting exactly as expected when comparing no simd to SSE4.2, AVX and AVX512. Benchmarking the sample code below showed speed improvement ratios of 48%, 22% and 10% for SSE4.2, AVX, AVX512, respectively, for the 32-bit.
Oddly, the 64-bit compile runs much faster for no simd but slightly SLOWER than the 32-bit for all three simd options (good subject for a new question).
I compiled the code with no /Qpar switch and /Qvec-report:2 /Qpar-report:2 to verify to the extent possible that there was no auto-vectorization or auto-parallelization going on.