High Performance Log parsing in C# - Second(ish) Attempt
Every few months, I would learn something new about .NET through the normal course of my reading. I invested in a copy of R# Ultimate, getting me access to dotTrace and dotMemory. Through profiling my application I learned some very interesting things, which, in hindsight, make perfect sense.
- Regex is slooooooooooooooooooooooooooooooowwwwwwwwwwwwwwwwwww
- Calling
Substring
allocates a completely new string - Garbage Collection can be really expensive, I mean really expensive
Regex
So it turns out, Regular Expressions are really slow, especially when compared to String.Contains
and String.IndexOf
. I won’t get into the nitty gritty of the details, if you want to see numbers, theburningmonk.com has a great blog post on the subject. So I decided, hey, I’m going to get rid of Regex
and start using String.IndexOf
and String.Substring
everywhere.
This was an excellent idea. I improved my parsing performance by ~3x. So instead of 3-5 MB/s, I was getting 9-15 MB/s. I patted myself on the back, called myself a programming genius, and grabbed a glass of whiskey from the bar in the office.
Substring / GC
I settled in and ran my profiler to see if there was anything else I could improve. dotTrace
showed my GC time tripled and I was spending about 10% of execution time in full GC. Well that’s less than ideal. WHAT HAPPENED?!
Looking at the allocations, I could see 100’s of millions of strings. Uhm, well, that seems like a lot. Then it dawned on me, strings are immutable! Every time I sliced up a string, I go a whole new string and all the glorious memory allocations that come with it. No wonder my heap had turned into swiss cheese!
At this point I was a bit baffled. I honestly had no idea what to do about this new problem. But was it really a problem? After all, I got a speed improvement and things were humming along happily.
Yes, it was still a problem. The hacker inside me said, there must be a better way. But what was it…