I read an excellent post today by Michael Shpilt on How to Beat Array Iteration Performance with Parallelism in C# .NET and one of the comments got me thinking. .NET Core has platform intrinsics enabling us to use specialized instructions, like SIMD (Single Instruction Multiple Data). In the very hypothetical case of trying to sum values in a large array, can we leverage these to get faster results. The short answer is sort of.
Lets start with some sample code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
|
public class IteratorSum
{
[Params(1000, 5000, 10_000)]
public long Count { get; set; }
public long[] Values;
public Vector<long>[] VectorValues;
private readonly object _lock = new object();
[GlobalSetup]
public void Setup()
{
Values = new long[Count];
var rnd = new Random();
for (long i = 0; i < Count; i++)
{
var val = rnd.Next(1000);
Values[i] = val;
}
VectorValues = Values.Partition(Vector<long>.Count).Select(x => new Vector<long>(x.ToArray())).ToArray();
}
[Benchmark]
public long RegularIterationLocalArr()
{
long total = 0;
for (long i = 0; i < Count; i++)
{
total += Values[i];
}
return total;
}
[Benchmark]
public long VectorSumLinq()
{
var previousVector = VectorValues.Aggregate(Vector<long>.Zero, (current, vector) => Vector.Add(current, vector));
var finalResult = 0L;
for (var i = 0; i < Vector<long>.Count; i++)
{
finalResult += previousVector[i];
}
return finalResult;
}
[Benchmark]
public long VectorSumFor()
{
var previousVector = Vector<long>.Zero;
for (var i = 0; i < VectorValues.Length; i++)
{
previousVector = Vector.Add(previousVector, VectorValues[i]);
}
var finalResult = 0L;
for (var i = 0; i < Vector<long>.Count; i++)
{
finalResult += previousVector[i];
}
return finalResult;
}
}
|
This yields some interesting results (at least interesting for me).
1
2
3
4
5
6
7
8
9
10
11
12
13
|
| Method | Count | Mean | Error | StdDev |
|------------------------- |------ |------------:|-----------:|-----------:|
| RegularIterationLocalArr | 1000 | 511.4 ns | 12.572 ns | 21.686 ns |
| VectorSumLinq | 1000 | 1,408.7 ns | 26.552 ns | 27.267 ns |
| VectorSumFor | 1000 | 152.4 ns | 3.066 ns | 3.149 ns |
|------------------------- |------ |------------:|-----------:|-----------:|
| RegularIterationLocalArr | 5000 | 2,500.5 ns | 66.647 ns | 84.287 ns |
| VectorSumLinq | 5000 | 7,291.5 ns | 144.388 ns | 281.617 ns |
| VectorSumFor | 5000 | 700.6 ns | 13.849 ns | 13.602 ns |
|------------------------- |------ |------------:|-----------:|-----------:|
| RegularIterationLocalArr | 10000 | 4,988.3 ns | 86.900 ns | 81.286 ns |
| VectorSumLinq | 10000 | 15,729.6 ns | 314.279 ns | 641.988 ns |
| VectorSumFor | 10000 | 1,408.9 ns | 31.184 ns | 32.023 ns |
|
First, we have our baseline result of 493.1ns (nanoseconds) per operation. This is the time it takes to sum the array of 1000 items. The next 2 results are from code that should make use of SIMD. These results yield 2 interesting bits of information:
- Using LINQ instead of
for
loops, yield very slow results, over 1000ns longer per iteration than the baseline. For those performance junkies out there, this probably isn’t terribly surprising.
- The results scale in slightly better than linear fashion.
So we’re going to ignore the fancy language features for now and attempt to let the compiler do its best here, we get much faster times. Almost 400% faster.
This seems like a no brainer, if you need to sum a lot of numbers, use Vector<T>
and move on. But these tests are hiding something from you. The key to their speed is the proper creation of the Vectors ahead of time.
1
2
|
VectorValues = Values.Partition(Vector<long>.Count)
.Select(x => new Vector<long>(x.ToArray())).ToArray();
|
What this code does is, ask the platform, how many long values fit in a Vector<long>
, this may be 2, 4, or more, depending on what the processor is capable of executing with a SIMD instruction. In my case, it is 4. So I split the array of random values into groups of 4, then construct Vector<long>
with 4 values each. If this code were included in the test time, it would add a significant amount of overhead to each run. However, since we don’t include the construction of the array in the other benchmark, it is fair to exclude it. The key here is, the operation is only faster if you already have vector data.
But wait, you say, Vector<long>
takes an Span<long>
as a constructor parameter, couldn’t we create the vectors very quickly in a for
loop while slicing the Span
. Why yes, yes you can. Let us see how that works.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
[Benchmark]
public long VectorFromSlice()
{
var vectorSpan = new Span<long>(Values);
var previousVector = Vector<long>.Zero;
for (var i = 0; i < Values.Length; i+=Vector<long>.Count)
{
previousVector = Vector.Add(previousVector,
new Vector<long>(vectorSpan.Slice(i, Vector<long>.Count)));
}
var finalResult = 0L;
for (var i = 0; i < Vector<long>.Count; i++)
{
finalResult += previousVector[i];
}
return finalResult;
}
|
Running this through our trusty benchmark tool, gives us:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
| Method | Count | Mean | Error | StdDev |
|------------------------- |------ |------------:|-----------:|-----------:|
| RegularIterationLocalArr | 1000 | 511.4 ns | 12.572 ns | 21.686 ns |
| VectorSumFor | 1000 | 152.4 ns | 3.066 ns | 3.149 ns |
| VectorFromSlice | 1000 | 841.6 ns | 36.061 ns | 33.731 ns |
|------------------------- |------ |------------:|-----------:|-----------:|
| RegularIterationLocalArr | 5000 | 2,500.5 ns | 66.647 ns | 84.287 ns |
| VectorSumFor | 5000 | 700.6 ns | 13.849 ns | 13.602 ns |
| VectorFromSlice | 5000 | 4,205.3 ns | 80.670 ns | 99.070 ns |
|------------------------- |------ |------------:|-----------:|-----------:|
| RegularIterationLocalArr | 10000 | 4,988.3 ns | 86.900 ns | 81.286 ns |
| VectorSumFor | 10000 | 1,408.9 ns | 31.184 ns | 32.023 ns |
| VectorFromSlice | 10000 | 8,327.3 ns | 160.826 ns | 191.452 ns |
|
As you can see, it is still faster than using LINQ, but it is about half as fast as the regular iterator. Again, this is mainly due to the overhead of constructing all of the vectors. So if we ignore the construction time, we can safely say, using SIMD can greatly improve our performance in the synthetic benchmark.
As always, I hope this helps and happy programming.
I created a followup to this here