High Performance Log parsing in C# - Second(ish) Attempt

Every few months, I would learn something new about .NET through the normal course of my reading. I invested in a copy of R# Ultimate, getting me access to dotTrace and dotMemory. Through profiling my application I learned some very interesting things, which, in hindsight, make perfect sense.

  • Regex is slooooooooooooooooooooooooooooooowwwwwwwwwwwwwwwwwww
  • Calling Substring allocates a completely new string
  • Garbage Collection can be really expensive, I mean really expensive

Regex

So it turns out, Regular Expressions are really slow, especially when compared to String.Contains and String.IndexOf. I won’t get into the nitty gritty of the details, if you want to see numbers, theburningmonk.com has a great blog post on the subject. So I decided, hey, I’m going to get rid of Regex and start using String.IndexOf and String.Substring everywhere.

This was an excellent idea. I improved my parsing performance by ~3x. So instead of 3-5 MB/s, I was getting 9-15 MB/s. I patted myself on the back, called myself a programming genius, and grabbed a glass of whiskey from the bar in the office.

Substring / GC

I settled in and ran my profiler to see if there was anything else I could improve. dotTrace showed my GC time tripled and I was spending about 10% of execution time in full GC. Well that’s less than ideal. WHAT HAPPENED?!

Looking at the allocations, I could see 100’s of millions of strings. Uhm, well, that seems like a lot. Then it dawned on me, strings are immutable! Every time I sliced up a string, I go a whole new string and all the glorious memory allocations that come with it. No wonder my heap had turned into swiss cheese!

At this point I was a bit baffled. I honestly had no idea what to do about this new problem. But was it really a problem? After all, I got a speed improvement and things were humming along happily.

Yes, it was still a problem. The hacker inside me said, there must be a better way. But what was it…

comments powered by Disqus