Thursday, July 23, 2009

Windows Workflows that restart themselves?

A few weeks ago I got assigned a fun bug. We have a WF-based order system and it seems that occasionally orders that had been send earlier would start all over by themselves.
And sure enough, when looking at the logging I saw that the order would start over. So how did this happen?

First off I noticed it only happened with orders that required a callback in their process. So orders that, at one point or another, would be idle and waiting for an external event. All other orders never gave a problem.
Then I also noticed that the orders restarted themselves after the system itself had been idle for around 30 minutes.
So workflows that were idle were restarting after 30 minutes of system inactivity. They were re-processed when a new order was sent (you'd then see a whole slew of old orders restarting).

Whenever something happens on IIS after around 30 minuted of inactivity I assume it's always related to the fact that after 20 minutes of inactivity (by default) IIS will stop the web service/application to spare resources. The next request that comes in will cause IIS to startup the web service/application again. So it's pretty much safe to assume the problem is related to the web service stopping and restarting.

So why would an order start all over again, after the web service is being restarted. The most likely reason would be that the Workflow Runtime was unable to save its state. So without knowing where the workflow left off, but knowing that the worklfow does exist, it seems logical the workflow would just restart.

We checked the config and it did have the WorkflowPersistenceService configured and loaded. However, it's "UnloadOnIdle" setting was missing. Meaning it defaults to false. Meaning workflows don't unload when they are idle. And more importantly: a workflow is only persisted when you explicitly tell it to, or when it is unloaded.
Since neither happened on our system the workflows never stored their state and restarted when the web service restarted.

Of course this was not figured out that quickly. I assumed the operational engineers would use the configuration we'd send them, so I never bothered to check it. If I had, I would have solved this bug in a matter of minutes. Now it took us days of prodding, testing and praying. Until another developer mentioned he had noticed the config looked almost the same, but not quite the same. *sigh*.
Bugtracking rule #1: Always check the config!