Recently I have been working on Rust datastructures once again. In the process I wanted to test how my work performed compared to a standard library RwLock and Mutex. On my home laptop the RwLock was 5 times faster, the Mutex 2 times faster than my work.
So checking out my code on my workplace workstation and running my bench marks I noticed the Mutex was the same - 2 times faster. However, the RwLock was 4000 times slower.
All things high performance Rust.
Recently a colleague of mine told me about a small bottleneck with url quoting since we are quoting a lot of storage keys at least once when loading or storing a dataset. To speed it up, we are going to write a C-Library in Rust and invoke it from Python.
When I first started building the dtparse crate, my intention was to mirror as closely as possible the equivalent Python library. Python, as you may know, is garbage collected. Very rarely is memory usage considered in Python, and I likewise wasn’t paying too much attention when dtparse was first being built.
This lackadaisical approach to memory works well enough, and I’m not planning on making dtparse hyper-efficient. But every so often, I’ve wondered: “what exactly is going on in memory?”
Rust makes writing parallel code safe. Rayon makes it easy.
But today (October 4th, 2018), the pest website featured a very misleading graph. Yes, a pest 2.0 parser that does not convert the input to Rust types is indeed faster than a nom 4.0 parser that does convert the input to Rust types. But what happens if I write a nom 4.0 parser that does not convert its input to Rust types?
lolbench compiles ~350 benchmarks with every Rust nightly. It then runs them and highlights potential performance regressions in the standard library and the output of the compiler. Each toolchain’s run is summarized with a list of likely candidates, as seen in the image below, and we’re now getting started using these to safeguard the performance of Rust programs. Come help!
I reimplemented a body of C software in Rust, and it performed better for the same task; what’s going on? And is there anything broader we can say about these results?
To explore this, I ran some statemap rendering tests on SmartOS on a single-socket Haswell server (Xeon E3-1270 v3) running at 3.50GHz. The C version was compiled with GCC 7.3.0 with -O2 level optimizations; the Rust version was compiled with 1.29.0 with --release. All of the tests were run bound to a processor set containing a single core; all were bound to one logical CPU within that core, with the other logical CPU forced to be idle. cpustat was used to gather CPU performance counter data, with one number denoting one run with pic0 programmed to that CPU performance counter. The input file (~30MB compressed) contains 3.5M state changes, and in the default config will generate a ~6MB SVG.
For a small side project I’m working on, I’m using a Sudoku puzzle solver and puzzle generator that I’ve written in Rust. The experience was fun, so I thought I’d write up a little bit about the algorithm I’ve used and some interesting stats about how it performs.
Today I released ppbert 0.8.4. This release also marks the first time that one of my original test files can be pretty printed in less than a second. I’ll use this occasion to look back on ppbert and how I was able to improve its performance, little by little.
Alacritty, the OpenGL terminal emulator written in Rust, now supports scrollback! Performance has improved, and we've got benchmarks to share.
Rust is all about paying only for what you use, and gives us plenty tools to eliminate unneeded allocation. One of the tools that is used in a lot of crates (crates.io shows 98 dependent crates) is SmallVec. It is also used in the Rust compiler. I recently got around to speed up the operation of getting a SmallVec from a slice of copyable data. In short, they’re awesome.
I already covered some inner-loop optimization tricks for low-level Rust code in mtpng, but how do you check how fast bits of your code are anyway?
The way to go is to use a sampling-based profiler native to your operating system. I’ve done most of my detailed profiling on Linux, using the “perf” tool.
In my last post I wrapped up the patches to improve perceived performance of screenshots on the Linux GNOME desktop. With that done, why not implement my crazy plan for parallel PNG encoding to speed the actual save time?
It’s been a while since I’ve been playing the benchmarksgame with Rust. But I recently found an interesting crate called packed_simd which had a SIMD-ified version of some benchmarks, so as Rust stable now has stdsimd, we should be able to speed up our benchmarks quite a bit.
Aim: Measure how fast a fetch from L1 cache is when compared to a fetch from memory. Instead of writing pure assembly code, we will use Rust's inline assembly feature.
I set out out my goal 9 for Rustnish: Write benchmark code that compares runtime performance of Rustnish against Varnish. Use cargo bench to execute the benchmarks.
The basic idea of a performance test here is to send many HTTP requests to the web service (the reverse proxy in this case) and measure how fast the responses arrive back. Comparing the results from Rustnish and Varnish should give us an idea if our performance expectations are holding up.
This document is a compilation of various benchmarks and comparisons between code counters, namely tokei, cloc, scc, and loc. This document seeks to compare performance, and accuracy of the code counters. polyglot is not currently included as it was unabled to be installed on the machine at the time of writing.
We leverage the elegance of kdb+ and the power of Rust to create data applications that can process data at the rate of tens of GB/second on consumer grade hardware.
With the latest release of 1.27 of Rust (SIMD support) the code counters written in Rust were suddenly a lot faster in Linux. In fact it meant that the fastest one tokei was suddenly faster than my scc for almost all tests. In addition a new project polyglot written in a language I have never heard of ATS popped up which is also now faster than my Go program for any repository when running on a machine with less than 8 cores.
Rust 1.27.0 has brought SIMD (Single Instruction Multiple Data), also known as vectorization, to stable Rust. If you read the announcement, you will see that SIMD should bring performance enhancements to our applications if we learn how to use it properly. But, for that let's first dive into how SIMD works.
In my last post on optimising my Rust path tracer with SIMD I had got withing 10% of my performance target, that is Aras’s C++ SSE4.1 path tracer. From profiling I had determined that the main differences were MSVC using SSE versions of sinf and cosf and differences between Rayon and enkiTS thread pools. The first thing I tried was implement an SSE2 version of sin_cos based off of Julien Pommier’s code that I found via a bit of googling. This was enough to get my SSE4.1 implementation to match the performance of Aras’s SSE4.1 code. I had a slight advantage in that I just call sin_cos as a single function versus separate sin and cos functions, but meh, I’m calling my performance target reached.
The other part of this post is about Rust’s runtime and compile time CPU feature detection and some wrong turns I took along the way.
Following on from path tracing in parallel with Rayon I had a lot of other optimisations I wanted to try. In particular I want to see if I could match the CPU performance of @aras_p’s C++ path tracer in Rust. He’d done a fair amount of optimising so it seemed like a good target to aim for. To get a better comparison I copied his scene and also added his light sampling approach which he talks about here. I also implemented a live render loop mimicking his.
Since my last post, rustc-perf — the benchmark suite, harness and visualizer — has seen some improvements. First, some new benchmarks were added: cargo, ripgrep, sentry-cli, and webrender. Also, the parser benchmark has been removed because it was a toy program and thus not a good benchmark.
It’s often said1 that the slowest code is that which has been optimised without benchmarks. You wouldn’t expect your code to work if you never ran it, so why should you expect it to be fast if you never benchmarked it? Writing good benchmarks is a bit of an art, because it’s really easy to accidentally write benchmarks that make your code seem fast, when really the compiler is applying some optimisations that work in the side-effect-free world of the benchmark but can no longer get applied when you put it out into the wild.
smallvec is a library by the Servo team for reducing the number of allocations for dynamic arrays in the case that most of those arrays are below a certain size. Because malloc is fast, for many cases it’s actually slower to use SmallVec than just using Vec because the one-time cost of the initial allocation is dwarfed by the lifetime cost of SmallVec’s increased complexity. You can see that switching to Vec actually improves speed on many of SmallVec’s own benchmarks.
Recently, a benchmark made it to the top of /r/programming, featuring Rust among other languages, and I was a bit surprised to see that the idiomatic Rust program was not competitive with the best-tuned C++ solution. The benchmark implements a binary tree, and the C++ solution leverages raw pointers while Rust would use an Option
> to represent its tree. Since Option knows that Box is non-nullable, it should compile down to a raw pointer. Quickly inspecting the Rust and C++ versions would not let me find where the performance difference came from.
A few weeks ago, I set out to convert bytecount’s benchmarks to criterion, a statistics-driven benchmarking framework started by Jorge Aparicio and maintained by Brook Heisler.
Before, bytecount used bencher for its benchmarks, which is a straight port of the unstable, nightly-only std::test benchmark framework, extended to work with stable Rust. This was a great benefit compared to std::test, because now we could benchmark on all Rust versions (stable, beta, nightly, some specific version) without needing to fear regressions.
TL;DR: The Rust compiler has gotten 1.06x–4x faster over the past month.
Once upon a time, I wrote an interpreter for Stratego Core in Rust, which I named strs. Stratego Core is the core language that Stratego is compiled to before the compiler goes further (to Java, or previously to C). A core language is an intermediate representation that is a subset of the surface language.
While I optimised that interpreter quite a bit, I noticed that the CTree (Stratego Core Abstract Syntax Tree) that the compiler spit out for me to interpret was very unoptimised. Therefore one the plans I described at the end of the blog post was a little tool for Copy Propagation on CTree files. This post is about that tool, and the optimisations in the interpreter that made it obsolete again.
I’d like to share a quick story about the sheer power of LLVM and the benefits of using higher-level languages over assembly.
I’ve decided to test the second a bit and see how far I could go. I’ve chosen matrix multiplication as a case study, for several reasons. I’ve played with it before (in my master’s thesis), it’s relatively simple and the effects of optimizing it can be great. For simplicity, I’ve decided to multiply only square matrices with power-of-two sizes, but these restrictions can be lifted in a real implementation without significantly loosing performance ‒ only the code gets somewhat more complex and hairy.
Seeing Nick Nethercote’s blog post about speeding up the compiler, I started wondering just how fast could a Rust compiler be? How fast could we compile a simple example? How fast can we compile a Rust hello world?
18 months ago I wrote about some work I did to speed up the Rust compiler (rustc). I’ve recently taken this work up again. Also, in the meantime rustc’s build system has been replaced and its benchmark suite has been overhauled. So it’s a good time for an update.
A few months ago, Bünz, Bootle, Boneh, Poelstra, Wuille, and Maxwell published Bulletproofs, which dramatically improves proof performance both in terms of proof size and verification time. In addition, it allows proving a much wider class of statements than just range proofs.
At Chain, we (Henry de Valence, Cathie Yun and Oleg Andreev) have been working on a pure-Rust Bulletproofs implementation, whose initial version we are publishing today, together with a set of notes.
This new version comes with great performance improvements. We're talking about 3x faster on macos, 2x faster on linux and 3x faster on windows (the benchmarks are at the end of the post).
For one of our customers at Centricular we were working on a quite interesting project. Their use-case was basically to receive an as-high-as-possible number of audio RTP streams over UDP, transcode them, and then send them out via UDP again. Due to how GStreamer usually works, they were running into some performance issues.
This blog post will describe the first set of improvements that were implemented for this use-case, together with a minimal benchmark and the results. My colleague Mathieu will follow up with one or two other blog posts with the other improvements and a more full-featured benchmark.
The short version is that CPU usage decreased by about 65-75%, i.e. allowing 3-4x more streams with the same CPU usage. Also parallelization works better and usage of different CPU cores is more controllable, allowing for better scalability. And a fixed, but configurable number of threads is used, which is independent of the number of streams.
Writing a debugger for C++ on Linux, you spend a lot of time examining pretty-printed DWARF debug information using tools like readelf, objdump or dwarfdump. Unfortunately this can be quite slow.
I decided to try to speed dwarfdump up. TL;DR: I reduced the dump time from 506s to 26s by fixing some simple issues and taking advantage of Rust "fearless parallelism". I think there are interesting opportunities for speeding up many kinds of command-line tools using Rust and parallelism.
Blazing fast, low requirements, computationally intensive operations on Node.js using Rust
When I built Finda, I wanted it to be fast — specifically, to respond to all user input within 16 milliseconds.
Given this goal, you might be surprised to learn that Finda is built with Electron, a framework that’s often decried for being the opposite of fast.
Recently, I came across an ad for a job that had a precondition for application: it required you to first solve a ✨programming challenge✨:
Criterion.rs is a statistics-driven benchmarking library for Rust. It provides precise measurements of changes in the performance of benchmarked code, and gives strong statistical confidence that apparent performance changes are real and not simply noise. Clear output, a simple API and reasonable defaults make it easy to use even for developers without a background in statistics. Unlike the benchmarking harness provided by Rust, Criterion.rs can be used with stable versions of the compiler.
The story about Rust’s async is still a bit in flux. There’s a bunch of libraries with their pros and cons and different approaches. Even I’m a bit to blame for that, as I’m writing one of my own, called Corona.
faster began as a yak shave, created to aid base💯 in its quest to become the fastest meme on Github. Writing an explicit AVX2-accelerated version of base💯's encoder and decoder, then realizing I'd have to do the same thing again to see the speedups on my Ivy Bridge desktop, pushed me to make this library. Months later, it has blossomed into its own project, and has eclipsed base💯 in both popularity and promise.