How to call _mm256_mul_ph from rust?

199 Views Asked by At

_mm256_mul_ps is the Intel intrinsic for "Multiply packed single-precision (32-bit) floating-point elements". _mm256_mul_ph is the intrinsic for "Multiply packed half-precision (16-bit) floating-point elements ".

I can call _mm256_mul_ps from using using use std::arch::x86_64::*;, e.g.

#[inline]
fn mul(v: __m256, w: __m256) -> __m256 {
     unsafe { _mm256_mul_ps(v, w) }
}

However, it appears to be difficult to call _mm256_mul_ph. Can one call _mm256_mul_ph in Rust?

2

There are 2 best solutions below

1
kmdreko On BEST ANSWER

What you are looking for requires the AVX-512 FP16 and AVX-512 VL instruction sets - the former of which does not appear to have any support within Rust at the moment.

You can potentially create your own intrinsics as-needed by using the asm! macro. The assembly for _mm256_mul_ph looks like this:

vmulph ymm, ymm, ymm

So the equivalent in Rust would be written like so:

#[cfg(target_feature = "avx2")]
unsafe fn _mm256_mul_ph(a: __m256i, b: __m256i) -> __m256i {
    let dst: __m256i;

    asm!(
        "vmulph {0}, {1}, {2}",
        out(ymm_reg) dst,
        in(ymm_reg) a,
        in(ymm_reg) b,
        options(pure, nomem, nostack),
    );

    dst
}

To make your own intrinsics for other instructions, please ensure you follow the guidelines for Rust's inline assembly as well as careful understanding of what the instructions do. Inline assembly is unsafe and can cause very weird behavior if specified improperly.

Caveats:

  • This only works since the underlying machinery (LLVM 17 as of Rust 1.76) does support this instruction set. If you attempt to try this method on a brand new instruction set (or with an older toolchain) it may not work and fail to compile due to an "invalid instruction".

  • __m256i is used in lieu of a __m256h type since the latter does not exist. Currently __m256i is used as a "bag of bits" (documentation's words) so you must keep track yourself that it is holding f16x16 values.

  • The target_feature = "avx2" condition is woefully inadequate for properly limiting it to targets that can actually run this function. There is no avx512fp16 target feature flag available within Rust (_mm256_mul_ph in particular needs the avx512vl target feature flag as well but that also is not supported).

    You will either need to be careful that you only compile and run this code for architectures that support it - currently Intel Sapphire Rapids CPUs as far as I can tell. Or it might be better to introduce your own compilation flag which, while not perfect, would hopefully limit the scope of things going wrong.

    If compiled and executed on an unsupported architecture, you'll receive an "illegal instruction" error (in the best case).

2
Arseni Kavalchuk On

Another option is using FFI. Basically calling intrinsics from a compiled C code and linking this code with Rust.

Create a helper C file with a wrapper function and include immintrin.h. The :

src/simp-helper.c

#include <immintrin.h>

__m256i call_mm256_mul_ph(__m256i a, __m256i b) {
    return _mm256_mul_ph(a, b);
}

__m512d call_mm512_mul_pd(__m512d a, __m512d b) {
    return _mm512_mul_pd(a, b);
}

In your Rust source declare this helper function as external:

src/main.rs

#![feature(simd_ffi)]
#![feature(stdarch_x86_avx512)]
// Some types and functions are unstable an require 
// adding the features above and using the nightly toolchain
use std::arch::x86_64::{
    __m256, __m256i, __m512d, _mm256_castps_si256, _mm256_loadu_ps, _mm512_set1_pd,
};

// Declare the helper functions as extern "C"
extern "C" {
    fn call_mm256_mul_ph(a: __m256i, b: __m256i) -> __m256i;
    fn call_mm512_mul_pd(a: __m512d, b: __m512d) -> __m512d;
}

fn main() {
    // example 1: works only for CPU with AVX512_FP16
    unsafe {
        let a: [f32; 8] = [1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2];
        let fp32_v1: __m256 = _mm256_loadu_ps(a.as_ptr());
        let fp32_pack_v1: __m256i = _mm256_castps_si256(fp32_v1);

        let b: [f32; 8] = [2.1, 2.1, 2.1, 2.1, 2.1, 2.1, 2.1, 2.1];
        let fp32_v2: __m256 = _mm256_loadu_ps(b.as_ptr());
        let fp32_pack_v2: __m256i = _mm256_castps_si256(fp32_v2);

        let result = call_mm256_mul_ph(fp32_pack_v1, fp32_pack_v2);
        println!("{:?}", result);
    }
    // example 2: for AVX512F, to check that the approach works
    unsafe {
        let a: __m512d = _mm512_set1_pd(1.2);
        let b: __m512d = _mm512_set1_pd(2.1);
        let result = call_mm512_mul_pd(a, b);
        println!("{:?}", result);
    }
}

Notice the module features required for this example to work.

In order to build this, you need to compile C sources with the corresponding AVX flags. You can compile C sources directly in the Rust project with help of cc crate which compiles C/C++ files into a static library that you can link with Rust, and a custom build.rs file:

Cargo.toml

[package]
name = "intrinsic"
version = "0.1.0"
edition = "2021"

[dependencies]

[build-dependencies]
cc = "1.0"

[[bin]]
name = "mul_vec"
path = "src/main.rs"

build.rs

fn main() {
    println!("cargo:rerun-if-changed=src/simd_helper.c");
    // Use the `cc` crate to build a C file and statically link it.
    cc::Build::new()
        .file("src/simd_helper.c")
        .flag("-mavx512f")
        .flag("-mavx512fp16")
        .compile("simd_helper");
    // Tell cargo to link the object file
    println!("cargo:rustc-link-search=native=.");
    println!("cargo:rustc-link-lib=static=simd_helper");
}

Important note that you need to build this with +nightly flag due to using unstable features.

cargo +nightly build

I don't have a machine with AVX512_FP16, so I used _mm512_mul_pd as an example to build and run on my machine.

Hope it would be helpful. Full cargo project on GitHub.