android simd arm
This blog post is the last one of a series exploring SIMD support with Rust on Android. In the previous two posts, I introduced how to compile Rust libraries for Android and detect SIMD instructions supported by the CPU at runtime.
Today, we’ll see how to effectively use the SIMD instructions themselves, and get the most performance out of them. After an introduction on running Rust benchmarks (and unit tests) on Android devices, we’ll measure the performance in various scenarios offered by Rust, and see that the overhead of CPU feature detection can be non-trivial. I’ll then describe various ways to reduce this overhead.
Lastly, I’ll present updated benchmarks on ARM of Horcrux, my Rust implementation of Shamir’s Secret Sharing, and see how they compare to Intel.
We're announcing the start of the Portable SIMD Project Group within the Libs team. This group is dedicated to making a portable SIMD API available to stable Rust users.
In the previous article on auto-vectorization we looked at the different SIMD instruction set families on X86-64. We saw how he target-feature compiler flag and #[target_feature()] attribute gave us more control over the instructions used in the generated assembly.
There is a related compiler flag target-cpu we didn’t touch on, so it’s worth taking a look at how it affects the generated code.
Since the last post about SIMD library plans, I’ve been experimenting. Needless to say, it turned out a bit different than originally planned, but I’ve something I’d like to share. Maybe it’ll be useful for someone or maybe it’ll at least spark some more progress in the area.
If you don’t care about the chatter and just want to use it, it’s called slipstream and is available on crates.io. It’s an early release and will need some more work, but it can be experimented with (it has documentation and probably won’t eat any kittens when used). If you want to help out, scroll down to see what needs help (or decide on your own what part needs improving 😇).
I believe Rust is a great language to make SIMD actually usable for ordinary humans. I’ve played with libraries to making it accessible two years ago (or was it 3?) and my impression was „Whoa! This is cool. I can’t wait until this is usable on stable.“ The libraries back then were stdsimd and faster.
Fast forward to today. I considered using some SIMD operations in a project in work. I have some bitsets and wanted to do operations like bitwise AND on them. If I represent them as bunch of unsigned integers, using SIMD on that makes sense. But for that, I need to compile on stable, I want the code to be readable and I don’t want to deal with writing multiple versions of the code to support multiple levels of SIMD support.
The thing is, while using SIMD on stable is possible, the standard library offers only the intrinsics. These are good enough as the low-level stuff to build a library on top, but none of the current ones quite cut it.
In the previous article on auto-vectorization we treated instructions as either SIMD (Single Instruction Multiple Data) or non-SIMD. We also assumed that SIMD meant four values at a time.
That was true for way we wrote and compiled our code in that article, but we're going to expand beyond that. There is a progression of SIMD instruction families, with new releases of CPU's from Intel and AMD supporting new instructions we can use to increase the performance of our code.
If our goal is to get the best performance we need to take advantage of all the possible SIMD instructions on our hardware.
In this article we're going to:
* Look at the compiler output when targeting the different SIMD instruction set families.
* Benchmark the different instruction sets.
* Look at how we can structure our Rust code to support compiling for multiple instruction sets and then selecting at runtime the one to use.
Recently on a project I wrote some audio processing code in Rust. In the past I've used C++ to write audio processing code for situations where performance was critical. I wanted to take that C++ optimisation experience and see what is possible using Rust.
We're going to take one small piece of audio processing code and take a look at how we can optimize it in Rust. Along the way we're going to learn about optimisation using Single Instruction Multiple Data CPU instructions, how to quickly check the assembler output of the compiler, and simple changes we can make to our Rust code to produce faster programs.
View all tags