How To Beat Array Iteration Performance: Part 2
After reading a comment from Simon on my previous post, I wanted to try out what he suggested. I was both interested to see how it compares to the original results, but more interested to see how it works.
A quick recap of the benchmark, can we improve upon array iteration performance by leveraging things other than the generic for
or foreach
over an array of long
values. The original code can be found on my previous post. Lets take a look back at our original results:
|
|
We can see here that using Vectors can really improve the performance of this synthetic benchmark compared to a standard for
loop. The original post I read from Michael Shpilt on How to Beat Array Iteration Performance with Parallelism in C# .NET really breaks down different methods utilizing things like parallelization. Here, I want to focus on single thread performance.
Here is the code Simon suggested:
|
|
Simon is making use of MemoryMarshal.Cast
that is able to convert one struct to another at runtime in an unsafe manner. This is because a Vector<long>
is just a memory location of fixed length that is broken up into 4 longs. I had to do some digging into the source of MemoryMarshal.Cast
to really understand how this is working. But lets start with, does it work.
|
|
Yes, it does work, and you’ll notice it is actually a little bit faster than using an existing list of Vectors.
So looking at how this works is quite interesting. The key code (extracted from the source) is
|
|
What this is code is essentially doing is checking if one type can “fit” into the other. In our case, they all divide evenly:
|
|
This means our 1000 long
values will fit into 250 Vector<long>
values. The last line of the method is actually performing the cast, but this happens inside the Unsafe
class as part of CLR platform intrinsics code.
The reason this cast even works has to do with how the memory for an array of struct
values is allocated. This is not like a linked list, so there are no pointers to the next value, they simply sit right next to each other in memory. If you allocate an array of 1000 long
values, 8000 bytes of memory (plus some overhead) will be allocated for it. Normally when accessing the n
-th item in an array of long
, its just a matter of multiplying n
by the length of the struct
and advancing that far from the start location. So converting the array of long
to an array of Vector<long>
is just telling the CLR “I know you thought this was a bunch of long
values, but its really a bunch of Vector<long>
. So accessing each Vector<long>
in the array is the same thing as accessing the n
-th item in an array of long
except when you want the n
-th Vector<long>
, you advance 4 x length of a long
(depending on the platform specific size of Vector<long>
).
It is important to note that this won’t work for just any struct
or class
, if Vector<long>
had extra fields on it, this would not work. In this way Vector<T>
is a special type.
Because of the nature of this cast, if the input Span does not divide evenly into the destination, there will be data loss, with no error reported.
Here’s a quick example in LINQPad:
|
|
Output:
|
|
No errors, but the cast from the incomplete long
array to a vector resulted in an empty vector.
I learned a lot while researching for this post. A big thanks to Michael for their original post and to Simon for their comment that pushed me down this path.
As always, I hope this helps and happy programming.