How To Beat Array Iteration Performance: Part 2

After reading a comment from Simon on my previous post, I wanted to try out what he suggested. I was both interested to see how it compares to the original results, but more interested to see how it works.

A quick recap of the benchmark, can we improve upon array iteration performance by leveraging things other than the generic for or foreach over an array of long values. The original code can be found on my previous post. Lets take a look back at our original results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
|                   Method | Count |        Mean |      Error |     StdDev |
|------------------------- |------ |------------:|-----------:|-----------:|
| RegularIterationLocalArr |  1000 |    511.4 ns |  12.572 ns |  21.686 ns |
|             VectorSumFor |  1000 |    152.4 ns |   3.066 ns |   3.149 ns |
|          VectorFromSlice |  1000 |    841.6 ns |  36.061 ns |  33.731 ns |
|------------------------- |------ |------------:|-----------:|-----------:|
| RegularIterationLocalArr |  5000 |  2,500.5 ns |  66.647 ns |  84.287 ns |
|             VectorSumFor |  5000 |    700.6 ns |  13.849 ns |  13.602 ns |
|          VectorFromSlice |  5000 |  4,205.3 ns |  80.670 ns |  99.070 ns |
|------------------------- |------ |------------:|-----------:|-----------:|
| RegularIterationLocalArr | 10000 |  4,988.3 ns |  86.900 ns |  81.286 ns |
|             VectorSumFor | 10000 |  1,408.9 ns |  31.184 ns |  32.023 ns |
|          VectorFromSlice | 10000 |  8,327.3 ns | 160.826 ns | 191.452 ns |

We can see here that using Vectors can really improve the performance of this synthetic benchmark compared to a standard for loop. The original post I read from Michael Shpilt on How to Beat Array Iteration Performance with Parallelism in C# .NET really breaks down different methods utilizing things like parallelization. Here, I want to focus on single thread performance.

Here is the code Simon suggested:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
[Benchmark]
public long VectorFromMemoryMarshal()
{
    var vectorSpan = System.Runtime.InteropServices.MemoryMarshal.Cast<long, Vector<long>>(Values);
    var previousVector = Vector<long>.Zero;
    for (var i = 0; i < vectorSpan.Length; i++)
    {
        previousVector += vectorSpan[i];
    }

    var finalResult = 0L;
    for (var i = vectorSpan.Length * Vector<long>.Count; i < Count; i++)
    {
        finalResult += Values[i];
    }

    for (var i = 0; i < Vector<long>.Count; i++)
    {
        finalResult += previousVector[i];
    }
    return finalResult;
}

Simon is making use of MemoryMarshal.Cast that is able to convert one struct to another at runtime in an unsafe manner. This is because a Vector<long> is just a memory location of fixed length that is broken up into 4 longs. I had to do some digging into the source of MemoryMarshal.Cast to really understand how this is working. But lets start with, does it work.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
|                   Method | Count |       Mean |      Error |     StdDev |
|------------------------- |------ |-----------:|-----------:|-----------:|
| RegularIterationLocalArr |  1000 |   440.2 ns |  2.5355 ns |  2.2476 ns |
|             VectorSumFor |  1000 |   129.2 ns |  0.7036 ns |  0.6582 ns |
|          VectorFromSlice |  1000 |   727.5 ns |  5.0348 ns |  4.7096 ns |
|  VectorFromMemoryMarshal |  1000 |   113.8 ns |  0.6595 ns |  0.5149 ns |
|------------------------- |------ |-----------:|-----------:|-----------:|
| RegularIterationLocalArr |  5000 | 2,183.7 ns | 26.8137 ns | 23.7696 ns |
|             VectorSumFor |  5000 |   626.3 ns |  6.0960 ns |  5.4039 ns |
|          VectorFromSlice |  5000 | 3,613.0 ns | 55.9070 ns | 52.2954 ns |
|  VectorFromMemoryMarshal |  5000 |   553.3 ns |  7.8000 ns |  6.9145 ns |
|------------------------- |------ |-----------:|-----------:|-----------:|
| RegularIterationLocalArr | 10000 | 4,366.9 ns | 56.8970 ns | 47.5116 ns |
|             VectorSumFor | 10000 | 1,237.2 ns | 24.3092 ns | 30.7434 ns |
|          VectorFromSlice | 10000 | 7,494.0 ns | 65.1237 ns | 60.9167 ns |
|  VectorFromMemoryMarshal | 10000 | 1,094.9 ns |  8.6731 ns |  8.1128 ns |

Yes, it does work, and you’ll notice it is actually a little bit faster than using an existing list of Vectors.

So looking at how this works is quite interesting. The key code (extracted from the source) is

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
uint fromSize = (uint)Unsafe.SizeOf<TFrom>();
uint toSize = (uint)Unsafe.SizeOf<TTo>();
uint fromLength = (uint)span.Length;

ulong toLengthUInt64 = (ulong)fromLength * (ulong)fromSize / (ulong)toSize;
int toLength = checked((int)toLengthUInt64);

return new ReadOnlySpan<TTo>(
    ref Unsafe.As<TFrom, TTo>(ref MemoryMarshal.GetReference(span)),
    toLength);

What this is code is essentially doing is checking if one type can “fit” into the other. In our case, they all divide evenly:

1
2
3
4
5
fromSize:        8
toSize:          32
fromLength:      1000
toLengthUInt64:  250
toLength:        250

This means our 1000 long values will fit into 250 Vector<long> values. The last line of the method is actually performing the cast, but this happens inside the Unsafe class as part of CLR platform intrinsics code.

The reason this cast even works has to do with how the memory for an array of struct values is allocated. This is not like a linked list, so there are no pointers to the next value, they simply sit right next to each other in memory. If you allocate an array of 1000 long values, 8000 bytes of memory (plus some overhead) will be allocated for it. Normally when accessing the n-th item in an array of long, its just a matter of multiplying n by the length of the struct and advancing that far from the start location. So converting the array of long to an array of Vector<long> is just telling the CLR “I know you thought this was a bunch of long values, but its really a bunch of Vector<long>. So accessing each Vector<long> in the array is the same thing as accessing the n-th item in an array of long except when you want the n-th Vector<long>, you advance 4 x length of a long (depending on the platform specific size of Vector<long>).

It is important to note that this won’t work for just any struct or class, if Vector<long> had extra fields on it, this would not work. In this way Vector<T> is a special type.

Because of the nature of this cast, if the input Span does not divide evenly into the destination, there will be data loss, with no error reported.

Here’s a quick example in LINQPad:

1
2
3
4
5
6
var arr = new long[] { 1, 2, 3, 4 };
var vector = System.Runtime.InteropServices.MemoryMarshal.Cast<long, Vector<long>>(arr);
var arr2 = new long[] { 1, 2, 3};
var vector2 = System.Runtime.InteropServices.MemoryMarshal.Cast<long, Vector<long>>(arr2);
Console.WriteLine("Full Vector: " + string.Join(",", vector.ToArray()));
Console.WriteLine("Incomplete Vector: " + string.Join(",", vector2.ToArray()));

Output:

1
2
Full Vector: <1, 2, 3, 4>
Incomplete Vector: 

No errors, but the cast from the incomplete long array to a vector resulted in an empty vector.

I learned a lot while researching for this post. A big thanks to Michael for their original post and to Simon for their comment that pushed me down this path.

As always, I hope this helps and happy programming.

comments powered by Disqus