I recently worked on a new UI-less application. The application read some data then sent it to another server over a TCP connection. There could be lots of data and several servers so the application was multithreaded.
Every few hours the application would stop responding. Windows reported the service was running so I know it didn’t crash.
This didn’t make much sense because an exception or deadlock on one thread shouldn’t stop other threads. I pushed that thought to the back of my mind. There was definitely a bug somewhere.
My initial thought was some deadlock in my tcp connection. I was using an asynchronous API that I wasn’t familiar with so that seems like a good starting point. I double checked the documentation and made a few changes. I pushed my changes to the server and… it deadlocked in a few hours.
Next idea – maybe the application isn’t unregistering all the event handlers. Maybe I somehow ended up with thousands of timer events. Sure enough the timers were not stopped correctly. I updated my timer stop methods to unregister the events and… no luck, it still deadlocks.
This time I used Process Explorer to inspect the application while it was in the deadlock state. Cpu usage was 0% and it was using about 100MB of RAM. Neither of those seemed like an issue. Then I noticed the threads. It processes owned over 2000 threads. This seemed outrageous. Each thread showed the start address as a timer elapsed event. This isn’t too helpful because most of my logic happened in the timer event.
Using Process Explorer I was able to see over 2000 threads but I couldn’t see a stack track. I needed a different tool to do that.
I thought about using Process Monitor. It is a really powerful tool, but I don’t use it often and it’s usually difficult for me to filter through everything.
Then I thought about creating a minidump and digging through that. I’ve only done that once before. I couldn’t remember the exact steps but I remember it being somewhat difficult to get setup.
Then I remembered a new tool I just got a license for – dotTrace. It is part of the Resharper Ultimate package. I decided to give that a try. In minutes I had a beautiful snapshot that show the stack trace for all 2000 of my threads. I started looking through it.
I started expanding the stack traces for each thread. Finally it made sense! All threads were stuck at the same spot. It wasn’t my timer. It wasn’t my event handlers. It wasn’t the TCP connections.
It was the logging library it was using! LOGS! Logs are supposted to save the devs! At that moment I had to laugh because all I can think about was Obi-Wan talking to the logging server.
It was said that you would destroy the bugs, not create them!-I’m pretty sure that’s what Obi-Wan said
The logging library I was using is an open source project on github. I got lucky with this one. There was an issue fix two months ago that fixed my exact issue! All I had to do was update my nuget package and hope for the best.
…And it worked! After days of dissecting, optimizing and reworking code my solution was to click one button in Visual Studio.
As a developer, you need to be able to laugh at these situations. People go down the wrong path all the time. Going down the wrong path seems like a good blog post…
Why didn’t you update your nugets in the first place?
Great question. The logging library I am using was wrapped in another private my company uses to standardize logs across the enterprise. That was on the latest version. To Visual Studio everything was up to date.