I need to do a lexicographic comparison of a small number of small unsigned integers. If there are (for example) 8 8-bit integers, the obvious approach is to byteswap them and do an ordinary integer compare in a GPR. If there are 2 32-bit integers, a 32-bit rotate and an ordinary compare will do the trick. What if there are 4 16-bit integers? Obviously with a vector register it is easy to shuffle them, but is there an efficient approach—either to reversing their order, or to doing the compare without reversing order—using only GPR?
Performantly reverse the order of 16-bit quantities within a 64-bit word
189 Views Asked by Moonchild At
1
There are 1 best solutions below
Related Questions in PERFORMANCE
- Upsert huge amount of data by EFCore.BulkExtensions
- How can I resolve this error and work smoothly in deep learning?
- Efficiently processing many small elements of a collection concurrently in Java
- Theme Preloader for speed optimization in WordPress
- I need help to understand the time wich my simple ''hello world'' is taking to execute
- Non-blocking state update
- Do conditional checks cause bottlenecks in Javascript?
- Performance of sketch drastically decreases outside of the P5 Web Editor
- sample query for review for improvement on big query
- Is there an indexing strategy in Postgres which will operate effectively for JOINs with ORs
- Performance difference between two JavaScript code snippets for comparing arrays of strings
- C++ : Is there an objective universal way to compare the speed of iterative algorithms?
- How to configure api http request with load testing
- the difference in terms of performance two types of update in opensearch
- Sveltekit : really long to send the first page and intense CPU computation
Related Questions in ASSEMBLY
- Is there some way to use printf to print a horizontal list of decrementing hex digits in NASM assembly on Linux
- How to call a C language function from x86 assembly code?
- Binary Bomb Phase 2 - Decoding Assembly
- AVR Assembly Clock Cycle
- Understanding the differences between mov and lea instructions in x86 assembly
- ARM Assembly code is not executing in Vitis IDE
- Which version of ARM does the M1 chip run on?
- Why would %rbp not be equal to the value of %rsp, which is 0x28?
- Move immediate 8-bit value into RSI, RDI, RSP or RBP
- Unable to run get .exe file from assembly NASM
- DOSbox automatically freezes and crashes without any prompt warnings
- Load function written in amd64 assembly into memory and call it
- link.exe unresolved external symbol _mainCRTStartup
- x86 Wrote a boot loader that prints a message to the screen but the characters are completely different to what I expected
- running an imf file using dosbox in parallel to a game
Related Questions in OPTIMIZATION
- Optimize LCP ReactJs
- Efficiently processing many small elements of a collection concurrently in Java
- How to convert the size of the HTML document from 68 Kb to the average of 33 Kb?
- Optimizing Memory-Bound Loop with Indirect Prefetching
- Google or-tools soft constraint issue
- How to find function G(x), and make for every x, G(x) always returns fixed point for another function F(G(x))
- Trying to sort a set of words with the information theory to solve Worlde in Python but my program is way to slow
- Do conditional checks cause bottlenecks in Javascript?
- Hourly and annual optimization problem over matrix
- Sending asynchronous requests without a pre-defined task list
- DBT - Using SELECT * in the staging layer
- Using `static` on a AVX2 counter function increases performance ~10x in MT environment without any change in Compiler optimizations
- Is this a GCC optimiser bug or a feature?
- Performance difference between two JavaScript code snippets for comparing arrays of strings
- Distribute a list of positive numbers into a desired number of sets, aiming to have sums as close as possible between them
Related Questions in X86-64
- What is causing the store latency in this program?
- Move immediate 8-bit value into RSI, RDI, RSP or RBP
- What is Win32 x86-64 CONTEXT::VectorRegister for?
- Why does MSVC never return struct in RAX for member-functions?
- How to change UP (direction) flag in x86 assembly to 1?
- docker inspect splunkImage Container ID: Warining: cannot create \"/opt/splunk/var/log/splunk
- Infinite loop while trying to print numbers 1 to 10 in assembly x86 64 bits
- Get the address and size of a loaded shared object on memory from C
- What a reason for C2148 or similar errors on another compilers?
- In a Linux signal handler, will x86 extended state always be in XSAVE format, or can it be in XSAVEC format as well?
- ASM register-variable from existing register-value in clang
- Smallest possible 64-bit MASM GUI application not working correctly
- How do I fix the jsonobject architecture problem I am having in PyCharm CE when the terminal says the package is installed?
- x86 Assembly: handling exponent 1 in power calculation
- How to navigate to the structure definition for the target architecture when cross-compiling on Ubuntu with VS Code?
Related Questions in SWAR
- Can packing variables or parameters into structures/unions introduce unforseen performance penalties?
- Speed up strlen using SWAR in x86-64 assembly
- How to check if a register contains a zero byte without SIMD instructions
- Add two vectors (uint64_t type) with saturation for each int8_t element
- SIMD-within-a-register version of min/max
- Fastest way to find 16bit match in a 4 element short array?
- Multiplication of two packed signed integers in one
- Performantly reverse the order of 16-bit quantities within a 64-bit word
- bit twiddling to right pack bits
- How to implement SWAR unsigned less-than?
- How to write a SWAR comparison which puts 0xFF in a lane on matches?
- SWAR byte counting methods from 'Bit Twiddling Hacks' - why do they work?
- Can a register hold multiple values at a time?
- Subtracting packed 8-bit integers in an 64-bit integer by 1 in parallel, SWAR without hardware SIMD
- Compare 64-bit integers by segments
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
For the reverse alone, here's my attempt:
It's convenient to be able to use a 32-bit rotate to swap two adjacent words.
We've got two parallel dependency chains of two uops each, followed by one more uop to merge them.