When optimizing code, one thing I’m always looking for is memory layout and access patterns. One such pattern is an arena: Reserve some sufficiently large space to put your objects in, then allocate by incrementing a pointer. If your objects are of a uniform type, you can basically simplify this to a Vec of that type.
All things high performance Rust.
I recently decided to learn more about Rust, and wrote a high performance RaptorQ (RFC6330) library. RaptorQ is a fountain code, and the core of the algorithm is a lot of matrix math over GF(256) – which translates into lots of XORs and reads from lookup tables. After getting the initial implementation working, I set about optimizing it. Below is a journal of the steps I took to profile and optimize the implementation.
This post covers: Benchmark reports for contributors, Benchmark reports for users, Profiling with valgrind / kcachegrind, Reproducible benchmarks and graphics, and Tips for benchmark behavior and benchmarking other languages.
With Rust, you can do low-level number-crunching and bit-by-bit processing, while enjoying memory safety and concurrency features. With Helix, you can use your Rust code inside of a Rails project.
I recently finished a detailed review of hashbrown, which will likely become the new implementation for rust's std::collections::HashMap. One of the most surprising things I found was in the implementation of insert. It was doing something that was so offensive to people who care about collection performance that we had designed an entire API to help people avoid it: it did two lookups in the map. However, after some more discussion and review, I concluded that this implementation was reasonable. This post will try to cover why that is.
Criterion allows you to benchmark against Rust stable, but it’s also providing a set of awesome features: Statistics: Statistical analysis detects if, and by how much, performance has changed since the last benchmark run. Charts: Uses gnuplot to generate detailed graphs of benchmark results. Stable-compatible: Benchmark your code without installing nightly Rust.
In this post I’ll talk about how to port a short raytracer written in C#/C++ codebase to Rust, then applying some simple optimizations by leveraging some features in Rust.
several models that make it easier to reason about concurrent programs have been envisioned over time. In this article, we'll have a quick look at a few of them, from new to not-so-new languages. I don't intend to give an extensive analysis of each solution, or make a formal comparison between them. My intention is to simply explain the basics of each solution and how they can be used in practice (with code samples that show off what the result of using the models might look like), so that other developers may have an easier time understanding them and deciding which solution, or language, might be better applicable to their particular problems.
A week or so ago, I saw the inferno project mentioned on the Rust subreddit. It was a rewrite of the great FlameGraph library into Rust. All of the work was being livestreamed by Jon Gjengset. I ended up watching some of the livestreams and had the idea of porting the stackcollapse-xdebug.php file to Rust, potentially so it could be included in the project in the future.
This is a follow-up post to Lock-freedom without garbage collection from 2015, which introduced Crossbeam, a Rust library that implements efficient lock-free data structures without relying on a tracing garbage collector.
I’ve been getting into bioinformatics algorithms lately and ran across an interesting pull request that improved performance by changing a Rust match expression to a lookup. This felt quite surprising to me since, well, the match is so simple — why isn’t the compiler already generating optimal code for it?
So, I have this STM32VLDISCOVERY dev board. It has the STM32F100RBT6B MCU, capable of running at 24MHz. On the board, there is a 8MHz crystal. Naturally, when you are new to microcontrollers (like me), you may have a few questions: When we upload a program on this development board, at what speed it is actually running? Is it using this external crystal? Why is this crystal 8MHz if the MCU is capable of 24MHz? If our program is not running at the maximum speed, how do we make it run at the maximum speed?
Algorithms and optimization to solve all Advent of Code 2018 puzzles in under one total second.
I rewrote a Python project in Rust. The rewrite took a fair bit longer than expected, but the results were good (about 9 times faster and ½ the memory usage). In the process, I learned a fair bit about Rust.
You may have recently encountered and/or read this blog post criticising a possible C++20 implementation of Pythagorean triples using ranges. In it the author benchmarks different implemetations of the problem, comparing readability, compile times, run times and binary sizes. My main language these days is D, and given that D also has ranges (and right now, as opposed to a future version of the language), I almost immediately reached for my keyboard. By that time there were already some D and Rust versions floating about as a result of the reddit thread, so fortunately for lazy me “all” I had to next was to benchmark the lot of them.
For the first time, I took part in the Advent of Code this year. If you haven't heard of it, it's a daily programming challenge that can be solved in any programming language. Rust was very present in the Advent of Code community with people contributing a ton of Rust-related content. In the daily solutions thread on the /r/aoc subreddit, there were always several Rust solutions posted. Advent of Code really helps show off the things that make Rust shine, demonstrating the power and utility of many community-created crates as well as the language itself.
A gentle comparison between Rust & Python from multiple perspectives against a small, relatively simple problem.
Previously, I wrote about how Rust parsing is atypically slow comparing Rust's libcore implementation to a rudimentary parser I wrote. However, as others noted, the comparison was fairly limited. It didn't compare Rust's implementation to other implementations, such as glibc's strtod or Go's ParseFloat. The parser I implemented wasn't correct, it led to rounding error for most representations, by using floats for intermediate values. Furthermore, the comparisons used data unlikely to be encountered in real-world datasets, overstating the performance differences by forcing Rust to use slower algorithms. So, naturally, I aimed to address all these concerns. And finally, I forgot to disable CPU scaling, meaning CPU throttling could have led to inconsistent benchmarks.
A while back, there was a discussion comparing the performance of using the hashbrown crate (based on Google’s SwissTable implementation) in the Rust compiler. In the last RustFest, Amanieu was experimenting on integrating his crate into stdlib, which turned out to have some really promising results. As a result, it’s being planned to move the crate into stdlib.
While the integration is still ongoing, there’s currently no blog post out there explaining SwissTable at the moment. So, I thought I’d dig deeper into the Rust implementation to try and explain how its (almost) identical twin hashbrown::HashMap works.
A friend recently told me about a puzzle, which is a great excuse to explore programming craft. My Rust solution was a simple port of my second Clojure solution. The only major difference is that it takes advantage of mutability (which is idiomatic in Rust, unlike in Clojure). The Rust solution runs in about 4.22 ± 0.05 ms, or about 5x faster than the fast Clojure solution.
Niko Matsakis recently blogged about the Rust compiler’s new borrow checker, which implements non-lexical lifetimes (NLL). The new borrow checker is a really nice improvement to Rust, because it accepts many sound programs that the old borrow checker rejected.
Perhaps surprisingly, one of the most challenging things about operating RubyGems.org is the logs. A single day of request logs is usually around 500 gigabytes on disk. So every day, we generate about 500 files that are 85MB on disk, and contain about a million streaming JSON objects that take up 1GB when uncompressed. What we want out of those files is incredibly tiny—a few thousand integers, labelled with names and version numbers. Without any real idea of how to get those counts out of S3, I started by writing a proof of concept Ruby script that could parse one of the 500 log files and print out stats from it. Even on my super-fast laptop, my prototype script would take more than 16 hours to parse 24 hours worth of logs.
Recently I have been working on Rust datastructures once again. In the process I wanted to test how my work performed compared to a standard library RwLock and Mutex. On my home laptop the RwLock was 5 times faster, the Mutex 2 times faster than my work.
So checking out my code on my workplace workstation and running my bench marks I noticed the Mutex was the same - 2 times faster. However, the RwLock was 4000 times slower.
Recently a colleague of mine told me about a small bottleneck with url quoting since we are quoting a lot of storage keys at least once when loading or storing a dataset. To speed it up, we are going to write a C-Library in Rust and invoke it from Python.
When I first started building the dtparse crate, my intention was to mirror as closely as possible the equivalent Python library. Python, as you may know, is garbage collected. Very rarely is memory usage considered in Python, and I likewise wasn’t paying too much attention when dtparse was first being built.
This lackadaisical approach to memory works well enough, and I’m not planning on making dtparse hyper-efficient. But every so often, I’ve wondered: “what exactly is going on in memory?”
Rust makes writing parallel code safe. Rayon makes it easy.
But today (October 4th, 2018), the pest website featured a very misleading graph. Yes, a pest 2.0 parser that does not convert the input to Rust types is indeed faster than a nom 4.0 parser that does convert the input to Rust types. But what happens if I write a nom 4.0 parser that does not convert its input to Rust types?
lolbench compiles ~350 benchmarks with every Rust nightly. It then runs them and highlights potential performance regressions in the standard library and the output of the compiler. Each toolchain’s run is summarized with a list of likely candidates, as seen in the image below, and we’re now getting started using these to safeguard the performance of Rust programs. Come help!
I reimplemented a body of C software in Rust, and it performed better for the same task; what’s going on? And is there anything broader we can say about these results?
To explore this, I ran some statemap rendering tests on SmartOS on a single-socket Haswell server (Xeon E3-1270 v3) running at 3.50GHz. The C version was compiled with GCC 7.3.0 with -O2 level optimizations; the Rust version was compiled with 1.29.0 with --release. All of the tests were run bound to a processor set containing a single core; all were bound to one logical CPU within that core, with the other logical CPU forced to be idle. cpustat was used to gather CPU performance counter data, with one number denoting one run with pic0 programmed to that CPU performance counter. The input file (~30MB compressed) contains 3.5M state changes, and in the default config will generate a ~6MB SVG.
For a small side project I’m working on, I’m using a Sudoku puzzle solver and puzzle generator that I’ve written in Rust. The experience was fun, so I thought I’d write up a little bit about the algorithm I’ve used and some interesting stats about how it performs.
Today I released ppbert 0.8.4. This release also marks the first time that one of my original test files can be pretty printed in less than a second. I’ll use this occasion to look back on ppbert and how I was able to improve its performance, little by little.
Alacritty, the OpenGL terminal emulator written in Rust, now supports scrollback! Performance has improved, and we've got benchmarks to share.
Rust is all about paying only for what you use, and gives us plenty tools to eliminate unneeded allocation. One of the tools that is used in a lot of crates (crates.io shows 98 dependent crates) is SmallVec. It is also used in the Rust compiler. I recently got around to speed up the operation of getting a SmallVec from a slice of copyable data. In short, they’re awesome.
I already covered some inner-loop optimization tricks for low-level Rust code in mtpng, but how do you check how fast bits of your code are anyway?
The way to go is to use a sampling-based profiler native to your operating system. I’ve done most of my detailed profiling on Linux, using the “perf” tool.
In my last post I wrapped up the patches to improve perceived performance of screenshots on the Linux GNOME desktop. With that done, why not implement my crazy plan for parallel PNG encoding to speed the actual save time?
It’s been a while since I’ve been playing the benchmarksgame with Rust. But I recently found an interesting crate called packed_simd which had a SIMD-ified version of some benchmarks, so as Rust stable now has stdsimd, we should be able to speed up our benchmarks quite a bit.
Aim: Measure how fast a fetch from L1 cache is when compared to a fetch from memory. Instead of writing pure assembly code, we will use Rust's inline assembly feature.
I set out out my goal 9 for Rustnish: Write benchmark code that compares runtime performance of Rustnish against Varnish. Use cargo bench to execute the benchmarks.
The basic idea of a performance test here is to send many HTTP requests to the web service (the reverse proxy in this case) and measure how fast the responses arrive back. Comparing the results from Rustnish and Varnish should give us an idea if our performance expectations are holding up.
This document is a compilation of various benchmarks and comparisons between code counters, namely tokei, cloc, scc, and loc. This document seeks to compare performance, and accuracy of the code counters. polyglot is not currently included as it was unabled to be installed on the machine at the time of writing.
We leverage the elegance of kdb+ and the power of Rust to create data applications that can process data at the rate of tens of GB/second on consumer grade hardware.
With the latest release of 1.27 of Rust (SIMD support) the code counters written in Rust were suddenly a lot faster in Linux. In fact it meant that the fastest one tokei was suddenly faster than my scc for almost all tests. In addition a new project polyglot written in a language I have never heard of ATS popped up which is also now faster than my Go program for any repository when running on a machine with less than 8 cores.
Rust 1.27.0 has brought SIMD (Single Instruction Multiple Data), also known as vectorization, to stable Rust. If you read the announcement, you will see that SIMD should bring performance enhancements to our applications if we learn how to use it properly. But, for that let's first dive into how SIMD works.
In my last post on optimising my Rust path tracer with SIMD I had got withing 10% of my performance target, that is Aras’s C++ SSE4.1 path tracer. From profiling I had determined that the main differences were MSVC using SSE versions of sinf and cosf and differences between Rayon and enkiTS thread pools. The first thing I tried was implement an SSE2 version of sin_cos based off of Julien Pommier’s code that I found via a bit of googling. This was enough to get my SSE4.1 implementation to match the performance of Aras’s SSE4.1 code. I had a slight advantage in that I just call sin_cos as a single function versus separate sin and cos functions, but meh, I’m calling my performance target reached.
The other part of this post is about Rust’s runtime and compile time CPU feature detection and some wrong turns I took along the way.
Following on from path tracing in parallel with Rayon I had a lot of other optimisations I wanted to try. In particular I want to see if I could match the CPU performance of @aras_p’s C++ path tracer in Rust. He’d done a fair amount of optimising so it seemed like a good target to aim for. To get a better comparison I copied his scene and also added his light sampling approach which he talks about here. I also implemented a live render loop mimicking his.
Since my last post, rustc-perf — the benchmark suite, harness and visualizer — has seen some improvements. First, some new benchmarks were added: cargo, ripgrep, sentry-cli, and webrender. Also, the parser benchmark has been removed because it was a toy program and thus not a good benchmark.
It’s often said1 that the slowest code is that which has been optimised without benchmarks. You wouldn’t expect your code to work if you never ran it, so why should you expect it to be fast if you never benchmarked it? Writing good benchmarks is a bit of an art, because it’s really easy to accidentally write benchmarks that make your code seem fast, when really the compiler is applying some optimisations that work in the side-effect-free world of the benchmark but can no longer get applied when you put it out into the wild.
smallvec is a library by the Servo team for reducing the number of allocations for dynamic arrays in the case that most of those arrays are below a certain size. Because malloc is fast, for many cases it’s actually slower to use SmallVec than just using Vec because the one-time cost of the initial allocation is dwarfed by the lifetime cost of SmallVec’s increased complexity. You can see that switching to Vec actually improves speed on many of SmallVec’s own benchmarks.
Recently, a benchmark made it to the top of /r/programming, featuring Rust among other languages, and I was a bit surprised to see that the idiomatic Rust program was not competitive with the best-tuned C++ solution. The benchmark implements a binary tree, and the C++ solution leverages raw pointers while Rust would use an Option
> to represent its tree. Since Option knows that Box is non-nullable, it should compile down to a raw pointer. Quickly inspecting the Rust and C++ versions would not let me find where the performance difference came from.
A few weeks ago, I set out to convert bytecount’s benchmarks to criterion, a statistics-driven benchmarking framework started by Jorge Aparicio and maintained by Brook Heisler.
Before, bytecount used bencher for its benchmarks, which is a straight port of the unstable, nightly-only std::test benchmark framework, extended to work with stable Rust. This was a great benefit compared to std::test, because now we could benchmark on all Rust versions (stable, beta, nightly, some specific version) without needing to fear regressions.
TL;DR: The Rust compiler has gotten 1.06x–4x faster over the past month.
Once upon a time, I wrote an interpreter for Stratego Core in Rust, which I named strs. Stratego Core is the core language that Stratego is compiled to before the compiler goes further (to Java, or previously to C). A core language is an intermediate representation that is a subset of the surface language.
While I optimised that interpreter quite a bit, I noticed that the CTree (Stratego Core Abstract Syntax Tree) that the compiler spit out for me to interpret was very unoptimised. Therefore one the plans I described at the end of the blog post was a little tool for Copy Propagation on CTree files. This post is about that tool, and the optimisations in the interpreter that made it obsolete again.
I’d like to share a quick story about the sheer power of LLVM and the benefits of using higher-level languages over assembly.
I’ve decided to test the second a bit and see how far I could go. I’ve chosen matrix multiplication as a case study, for several reasons. I’ve played with it before (in my master’s thesis), it’s relatively simple and the effects of optimizing it can be great. For simplicity, I’ve decided to multiply only square matrices with power-of-two sizes, but these restrictions can be lifted in a real implementation without significantly loosing performance ‒ only the code gets somewhat more complex and hairy.
Seeing Nick Nethercote’s blog post about speeding up the compiler, I started wondering just how fast could a Rust compiler be? How fast could we compile a simple example? How fast can we compile a Rust hello world?
18 months ago I wrote about some work I did to speed up the Rust compiler (rustc). I’ve recently taken this work up again. Also, in the meantime rustc’s build system has been replaced and its benchmark suite has been overhauled. So it’s a good time for an update.
A few months ago, Bünz, Bootle, Boneh, Poelstra, Wuille, and Maxwell published Bulletproofs, which dramatically improves proof performance both in terms of proof size and verification time. In addition, it allows proving a much wider class of statements than just range proofs.
At Chain, we (Henry de Valence, Cathie Yun and Oleg Andreev) have been working on a pure-Rust Bulletproofs implementation, whose initial version we are publishing today, together with a set of notes.
This new version comes with great performance improvements. We're talking about 3x faster on macos, 2x faster on linux and 3x faster on windows (the benchmarks are at the end of the post).
For one of our customers at Centricular we were working on a quite interesting project. Their use-case was basically to receive an as-high-as-possible number of audio RTP streams over UDP, transcode them, and then send them out via UDP again. Due to how GStreamer usually works, they were running into some performance issues.
This blog post will describe the first set of improvements that were implemented for this use-case, together with a minimal benchmark and the results. My colleague Mathieu will follow up with one or two other blog posts with the other improvements and a more full-featured benchmark.
The short version is that CPU usage decreased by about 65-75%, i.e. allowing 3-4x more streams with the same CPU usage. Also parallelization works better and usage of different CPU cores is more controllable, allowing for better scalability. And a fixed, but configurable number of threads is used, which is independent of the number of streams.
Writing a debugger for C++ on Linux, you spend a lot of time examining pretty-printed DWARF debug information using tools like readelf, objdump or dwarfdump. Unfortunately this can be quite slow.
I decided to try to speed dwarfdump up. TL;DR: I reduced the dump time from 506s to 26s by fixing some simple issues and taking advantage of Rust "fearless parallelism". I think there are interesting opportunities for speeding up many kinds of command-line tools using Rust and parallelism.
Blazing fast, low requirements, computationally intensive operations on Node.js using Rust
When I built Finda, I wanted it to be fast — specifically, to respond to all user input within 16 milliseconds.
Given this goal, you might be surprised to learn that Finda is built with Electron, a framework that’s often decried for being the opposite of fast.
Recently, I came across an ad for a job that had a precondition for application: it required you to first solve a ✨programming challenge✨:
Criterion.rs is a statistics-driven benchmarking library for Rust. It provides precise measurements of changes in the performance of benchmarked code, and gives strong statistical confidence that apparent performance changes are real and not simply noise. Clear output, a simple API and reasonable defaults make it easy to use even for developers without a background in statistics. Unlike the benchmarking harness provided by Rust, Criterion.rs can be used with stable versions of the compiler.
The story about Rust’s async is still a bit in flux. There’s a bunch of libraries with their pros and cons and different approaches. Even I’m a bit to blame for that, as I’m writing one of my own, called Corona.
faster began as a yak shave, created to aid base💯 in its quest to become the fastest meme on Github. Writing an explicit AVX2-accelerated version of base💯's encoder and decoder, then realizing I'd have to do the same thing again to see the speedups on my Ivy Bridge desktop, pushed me to make this library. Months later, it has blossomed into its own project, and has eclipsed base💯 in both popularity and promise.