BLAKE3 Joins the Family

Created on August 05, 2025

Last updated on August 05, 2025


So it was that time of the day/week/month (take your pick), and I had an itch to scratch for the Blake3 hashing algorithm. Why? Well, when I take over the world I'll post about it here. Blake3 made for a prime target for .NET wrapping because the reference implementation is highly optimized and it has a parallel extension that no one has touched with a 10-foot pole for wrapping in the .NET ecosystem. 

I'll admit I both love and hate projects like this. I really do just want to help people find what they need and hopefully deliver it in a way that is ergonomic to use and correct. I had a slight bug in my Blake2b code that lingered for a couple of weeks, and I want to make sure that doesn't happen again. This time, I was wrapping someone else's code, so I wasn't quite so worried about the algorithm's correctness, but I was laser-focused on getting the interop and memory safety right.

With Blake3, you first need to build oneTBB, which is a library for exploiting parallelism, and one of Blake3's big selling points is that it is fast and parallelizable. So if I was going to wrap this library, I was going to do it correctly from the outset, even if I had to delve deep into the depths of a library I had never heard of before yesterday. Honestly it wasn't too bad. I had to build for my 4 supported platforms (Win-x64, Linux-x64, macOS x64 & arm64), and it went as expected. The nice thing about oneTBB was the cmake build system was on point, allowing me to use msvc, gcc & clang across all 3 operating systems, with no Mingw64 tweaks to get libraries compiled on Windows.

Now, Blake3 on the other hand wasn't quite so nice so I was building mostly manually. The only real snafu I hit was that the dispatcher used c11 features, but the library needed to be built with C++. After racking my brain for a bit and being tempted to turn off oneTBB parallelism to get it to work I remembered that I could just compile that C file ahead of time using the proper standard and that it would link just fine to C++ code (duh). Once I cleared that hurdle it was onto the library loading and native interop features. I've been trying to expose mostly span-based zero-copy APIs to make things go fast and this one is no different. Other than the dispatcher, because BLAKE3 ships both C and ASM SIMD paths, and not all platforms support the same intrinsics or assembler directives, I had to be careful. One platform used the AVX2 C code, another used the .S AVX2 assembler fallback, and so on. I made sure each one was consistent and optimized — but it definitely wasn’t copy-paste build logic..

The only other major issue I hit was that I had fly specced the state struct at 192 bytes with alignment. During testing I was getting some runs that did not provide any results, and code that should have been hit was being missed. Turns out there was internal heap corruption, although it really only manifested itself on Linux and Windows (macOS took it in stride). What the heck could have been corrupting the heap? How about an actual state struct size of 1912 bytes, or roughly 10 times what I was allocating. Truth be told I'm surprised it even ran to completion at all, with me trampling all over the heap like a drunken elephant, but that's a story for another day.

Edit: So I probably wrote too soon here, because I was determined to fix the bug where trying to run things from the Github repo on platforms other than Windows caused library not found errors (like running tests or benchmarks). Turns out my problem was twofold. I had been incorrectly been failing to provide an explicit Library Resolver, and I failed to notice because I thought that's just the way it was with cross-platform binaries. Once I corrected that problem (and my previous blog post on the subject), I encountered a new one here: blake3 shared libraries have a dependence of their own on the tbb library which provides the crazy-fast parallelism. How do you tell .NET to load something it doesn't have direct control over loading (because it's a dependency, .NET is not loading it, libc is)? Turns out through some rpath patching magic on linux and install_name_tool on macOS, I was able to specify where dependencies should load from (the same directory as the original .dll/.so/.dylib - $ORIGIN). Anyway, now dotnet run works everywhere and users on other platforms don't have to copy around the libraries manually. The worst part? Relizing I have 10 or so libraries that used the same buggy Library Loading code that didn't include the Resolver. I got those all fixed, and sent them up to Github and Nuget. Honestly, I feel much better about the whole thing, knowing that it's working for everyone (hopefully).

So let's take stock- approximately 12 14 hours invested, and now we have a new (and the fastest) entrant for doing all sorts of things using Blake3 on cross-platform .NET. A small price to pay if it helps even one developer out of a jam. I won't even talk about the 10 or so hours I spent on bringing SHA2 & SHA3 into the fold this past week. Why would I do such a thing when they're built-in? My versions are faster, and they support a sane streaming interface for large hashes. Are they critical? Nope, I did it just because. But that's also why this article is about Blake3 and not SHA2 & SHA3. I'm proud of this one guys. It really does go above and beyond, and it met all of my expectations.

Hope you guys enjoy!

Github 

Nuget