After reading a comment from Simon on my previous post, I wanted to try out what he suggested. I was both interested to see how it compares to the original results, but more interested to see how it works.
A quick recap of the benchmark, can we improve upon array iteration performance by leveraging things other than the generic
foreach over an array of
long values. The original code can be found on my previous post. Lets take a look back at our original results:
We can see here that using Vectors can really improve the performance of this synthetic benchmark compared to a standard
for loop. The original post I read from Michael Shpilt on How to Beat Array Iteration Performance with Parallelism in C# .NET really breaks down different methods utilizing things like parallelization. Here, I want to focus on single thread performance.
Here is the code Simon suggested:
Simon is making use of
MemoryMarshal.Cast that is able to convert one struct to another at runtime in an unsafe manner. This is because a
Vector<long> is just a memory location of fixed length that is broken up into 4 longs. I had to do some digging into the source of
MemoryMarshal.Cast to really understand how this is working. But lets start with, does it work.
Yes, it does work, and you’ll notice it is actually a little bit faster than using an existing list of Vectors.
So looking at how this works is quite interesting. The key code (extracted from the source) is
What this is code is essentially doing is checking if one type can “fit” into the other. In our case, they all divide evenly:
This means our 1000
long values will fit into 250
Vector<long> values. The last line of the method is actually performing the cast, but this happens inside the
Unsafe class as part of CLR platform intrinsics code.
The reason this cast even works has to do with how the memory for an array of
struct values is allocated. This is not like a linked list, so there are no pointers to the next value, they simply sit right next to each other in memory. If you allocate an array of 1000
long values, 8000 bytes of memory (plus some overhead) will be allocated for it. Normally when accessing the
n-th item in an array of
long, its just a matter of multiplying
n by the length of the
struct and advancing that far from the start location. So converting the array of
long to an array of
Vector<long> is just telling the CLR “I know you thought this was a bunch of
long values, but its really a bunch of
Vector<long>. So accessing each
Vector<long> in the array is the same thing as accessing the
n-th item in an array of
long except when you want the
Vector<long>, you advance 4 x length of a
long (depending on the platform specific size of
It is important to note that this won’t work for just any
Vector<long> had extra fields on it, this would not work. In this way
Vector<T> is a special type.
Because of the nature of this cast, if the input Span does not divide evenly into the destination, there will be data loss, with no error reported.
Here’s a quick example in LINQPad:
No errors, but the cast from the incomplete
long array to a vector resulted in an empty vector.
I learned a lot while researching for this post. A big thanks to Michael for their original post and to Simon for their comment that pushed me down this path.
As always, I hope this helps and happy programming.