sourmash 3 was released last week, finally landing the Rust backend. But, what changes when developing new features in sourmash? I was thinking about how to best document this process, and since PR #826 is a short example touching all the layers I decided to do a small walkthrough.
In a previous post I introduced the MNIST dataset and the problem of classifying handwritten digits. In this post I’ll be using the code I wrote in that post to port a simple neural network implementation to rust. My goal is to explore performance and ergonomics for data science workflows in Rust.
In 2018, the Mercurial project decided to use Rust to improve performance and maintainability of previous high-performance code. We have faced some interesting challenges when bridging the Python implementation with the new Rust code, and this is one that I have not found any literature about.
This is a subjective, primarily developer-ergonomics-based comparison of the three languages from the perspective of a Python developer, but you can skip the prose and go to the code samples, the performance comparison if you want some hard numbers, the takeaway for the tl;dr, or the Python, Go, and Rust diffimg implementations.
Python is a great programming language but sometimes it can be a bit of slowcoach when it comes to performing certain tasks. That’s why developers have been building C/C++ extensions and integrating them with Python to speed up the performance. However, writing these extensions is a bit difficult because these low-level languages are not type-safe, so doesn’t guarantee a defined behavior. This tends to introduce bugs with respect to memory management. Rust ensures memory safety and hence can easily prevent these kinds of bugs.
Following the previous post, here I am going to introduce the key concepts of Rust — Ownership and Borrowing.
Recently a colleague of mine told me about a small bottleneck with url quoting since we are quoting a lot of storage keys at least once when loading or storing a dataset. To speed it up, we are going to write a C-Library in Rust and invoke it from Python.
It’s not immediately obvious that calling min(squares) modifies squares. If squares were a list or even a range, we would be able to call min and max on it with no problem. It would be nice if the language prevented us from trying to use something twice that can only be used once. Almost all modern languages, both statically and dynamically typed, will fail at runtime in these situations.
Last December I decided to give Rust a run: I spent some time porting the C++ bits of sourmash to Rust. The main advantage here is that it's a problem I know well, so I know what the code is supposed to do and can focus on figuring out syntax and the mental model for the language. I started digging into the symbolic codebase and understanding what they did, and tried to mirror or improve it for my use cases.
I've been writing some code in Rust recently, and I thought it would be cool if I could take some of this Rust code and provide it as a native extension that I can call from Python. It turns out there are some amazing tools like PyO3 that make it easy to write fully featured Python extensions in Rust, with considerably less effort than writing a CPython extension manually.
To test out PyO3 I wrote a small Python extension in Rust, and I thought I would share some of the tips and tricks I encountered in getting this going. This post aims to serve as a quick tutorial showing how to write extensions in Rust, talking about why you might want to use something more powerful than just exposing a C library called using CFFI, and how PyO3 lets you write Python aware extensions in Rust.
What happens when a data collection is copied and then the new copy is changed? Does the original remain the same, or does it change too?
If you think of copying as creating a completely new object, of course you expect that any change to the new copy does not affect the original object. But if you think of copying as creating a new name for the same, single object, then you expect that any change to the object through the new name appears also when you access the same object through the old name.
Prior to adding Python performance monitoring, we'd written monitoring agents for Ruby and Elixir. Our Ruby and Elixir agents had duplicated much of their code between them, and we didn't want to add a third copy of the agent-plumbing code. The overlapping code included things like JSON payload format, SQL statement parsing, temporary data storage and compaction, and a number of internal business logic components.
This plumbing code is about 80% of the agent code! Only 20% is the actual instrumentation of application code.
So, starting with Python, our goal became "how do we prevent more duplication". In order to do that, we decided to split the agent into two components. A language agent and a core agent. The language agent is the Python component, and the core agent is a standalone executable that contains most of the shared logic.
View all tags