Monday, October 19, 2015

[IndieDev] How a Series of Sane Decisions Created An Esoteric Bug

If you follow me on Facebook or Twitter, you may have noticed me crowing last week about finally nailing a bug that's been vexing me since over a year ago. The bug itself is annoying and confusing, but not game breaking so we never really put a bunch of time into hunting it down until recently (since I'm finally nearly caught up on all of my critical bugs for Eon Altar, yay!). But it was a bug that I've looked at on and off over the past year and continually came up confused.

You can see the bug in action in the animated gif above this paragraph. Simply put, sometimes, when an enemy died, they would get up partially and then get stuck between laying on the ground and standing straight. Sometimes the enemy would get all the way back up off the ground.

As you can imagine, this was confusing in gameplay as it made it rather difficult at times to tell if an enemy was dead or alive. It also looked pretty amateurish, though I've been told by a number of our Early Access players that they've seen similar bugs in shipped games so I guess it wasn't the worst bug in the world. Still, I really wanted to fix it.

For the longest time we thought this was an animation bug of sorts. Our death animations were set to loop, and changing them to not loop did sort of almost solve the issue in some cases. We also had issues with the animation system we were using not having animation events, so we couldn't fire events based on the animation frame, meaning we had to guess as to when the animation was complete based on the running time and elapsed time. But even setting that to cut it off at 98% or even 95% of total time still didn't fix the issue.

The other problem was that the bug didn't seem to be deterministic. That is, we hadn't found a repro case where we get get the bug to show up 100% of the time. Anybody who's had to debug something that doesn't repro consistently is probably shuddering right now. Non-deterministic bugs are terrible to figure out. Sometimes the death animation would work, sometimes it wouldn't and they'd get back up. The only clue we had on that was the issue affected human NPCs almost exclusively.

Getting More Clues

Over the course of the project, I'd probably spent a good 5 programmer days hunting this damn thing down, investigating a bunch of systems that I knew and some I didn't know offhand to try and figure this out. In the end, it turned out to be a confluence of a number of different design/engineering decisions coming together and creating a vexing environment where it became very difficult to troubleshoot the issue.

The decisions in question were:
  • The "stance" animation system (what determined the default idle state of the animation) was a data-driven state machine. When certain data points on the actor changed (down, concealed, combat, exploration), the stance animation system would switch the default idle animation based on a priority system, Down/Dead->Concealed->Combat->Exploration. The highest priority active state would dictate what the idle animation was.
  • Down/Dead was data driven, rather than state driven. It was purely based on whether the actor had more damage than they had maximum health.
  • To create a "corpse", once an NPC actor is dead, a set of scripts run and delete everything off the actor, except for the actor's visual. This would leave a model with no AI, no animator, and so on.
  • Status effects would clear on down/death, so HoTs, DoTs, buffs, traits, etc.
  • Design created an enemy spawning system where, to create variety in enemies, it started with a template and randomly added buffs/debuffs for differentiation. For example, a Wounded Sellsword would have -5 Fortitude compared to a Sellsword Initiate (the base template). Design currently only uses this differentiation on a subset of NPCs.
Using all of that data, can you take a guess at what the actual bug was?

If you guessed that a differentiation status effect with a health reduction getting cleared on death--"healing" the actor--was causing the stance animation system to think the actor was alive and made the actor start standing until the concurrent death script deleted the animator thus stopping the enemy mid-stand, then congrats, you nailed it :) If not, don't feel too badly, since it took me a while to piece everything together.

The Solution

The "easy" solution from there was to allow certain buffs/debuffs to persist through down/death. For the human and dog NPCs, their randomized differentiation buffs/debuffs need to use this option so that the actor's maximum health doesn't change when they die.

Another possible solution might have been to modify down/dead to be state based rather than data driven (ie: regardless of your health pool, you can still be dead or alive), but frankly that comes with a whole different set of issues, not to mention the amount of risk changing something so fundamental would create in the project at this juncture.

I hope that was an interesting look at how a bunch of different systems can act together in ways one might not expect. Even when you try to keep your systems isolated, at the end of the day they still need to interact somehow, and those interactions are where bugs tend to crop up.

And bonus, in our next patch, actors should now stay dead when they die, rather than trying to re-enact Thriller. #IndieDev, #EonAltar, #GameDesign


  1. Well, I guessed it, but the way you wrote things helped.
    Even if I'm not a professional programmer, I know that having two things take care of/derive from one set of things is dangerous (and in the case of multiple threads, suicidal). The two dangerous things I noticed in your list: first is the "scripts which delete everything except the visual": I don't know if you're object-oriented or what, but I tend to encapsulate everything as much as possible. The actor's methods should take care of all the data (and visuals), as otherwise you risk that decision about the animation state are taken based on uninitialized/unallocated data. Second, storing information in two places is dangerous: if you have a state machine, ALL the information should be in the state, because while thinking "ok, if it's alive but with health < 0 then it's not really alive, so I'll save a state by using a single one for dead/alive and checking health", sound nice until someone changes the code of one of the two parts, being oblivious (or having forgotten) that the consequence is spreading beyond the few lines which have been changed.

    I've had my share of this kind of bugs, since my code is rarely planned :) (but it's usually much much smaller than a "real" project). This is why some kind of debug log which contains EVERYTHING (like all state transitions) can be very handy in nailing down the problem, even if when actvated makes the code run 10 times slower and fills your disk.....

    1. I was being a little misleading in that the vast majority of the game code is on the core loop. All of the game logic is on the core loop, in fact (standard practice, current CPU throughput is more than sufficient to handle it), so when I say "concurrently" it's only pseudo-concurrently. More accurately, the process is dead->flag a bunch of things to start handling death, and when they get to their own update loops they handle their own logic, because each component is encapsulated ;)

      Each component is its own object (animator, AI, visual, etc.), but the actor object is the owner of all of these components, so it controls the lifetime of the objects. There's no risk of unallocated/uninitialized shenanigans.

      As to the animator state machine, it can't hold the state. It's driven by the actor's data, deliberately. Regardless of patterns that we use to keep each component encapsulated, at the end of the day the purpose of the animator state machine is to set the idle animation based on data from the actor. It itself doesn't check for the health of the actor, it just asks the owning actor object "Are you dead?". The actor itself, in a single spot in the code, determines what "dead" is.

      Also, agreed entirely on logging. Our logging system is incredibly thorough and robust, sometimes too thorough, but that's how I realized eventually that the actor was coming back to life. The animation state machine logged that it was going from down->combat, which shouldn't have been possible. That's what got me started on the correct path to finally nailing the bug.

    2. Basically, at the risk of sounding condescending (which I don't mean to), I am quite familiar with object oriented coding practices. I just use a lot of short hand in the description of the problem itself because the implementation details of our actor object hierarchy is largely immaterial to the description of this particular issue. However, that said, I do agree with most of your points!

    3. I suppose I should clarify.

      We have two options for organizing the object hierarchy. The actor can have a mega-state machine that drives everything directly, or the actor can just inform objects it owns of changes they may be interested in, and each object handles it's own logic. The former gets unwieldy incredibly quickly (a single actor has something like 30+ components it owns: exploration AI, combat AI, vitals, attributes, skills, animator, visuals, movement marker interaction, actions, possibly networking and playable character-specific stuff, etc.).

      It's far neater just to ensure that you have a strict hierarchy for object lifetime, and let each sub-component register for events of interest in the owning actor, and have them handle their own logic. Yeah, deleting everything and orphaning the visual on death breaks that model a fair bit, but since that's an end-of-actor lifetime moment, that ends up being fine in practice.