Saturday, November 15, 2014

[WoW] Server Traffic: Queues, Phasing, and Some Numbers

Recently Blizzard released a little product known as Warlords of Draenor, and it turns out their engineers were taken by surprise by the number of people logging into the game. Between that and the ongoing DDoS attack that pretty well every MMO launch in the past year has dealt with (because apparently there're some folks out there who just hate fun, I guess), Blizzard's servers have basically melted.

So why is this such a big deal? Why couldn't Blizzard just throw more hardware at the issue? Would an extra 1,000 people on each server make that big a difference?

The answer, it turns out, is yes. Even a 10% increase over what they expected can result in server meltage. Why, though?

Let's pretend I've made a new MMO, Talarian World, and I just propped my servers up and some folks start logging in. For every player, I have about 100 bytes of data coming in to my server every second (to be honest, the exact amount is largely immaterial once we hit a certain number of people, but just play along for a moment), That data contains information about my new world location, what abilities I used, commands, whatever.

Now, to prevent cheating, my server is Authoritative. This means that any action I take in my game client has to be vetted by the server, and confirmed. So the server responds with information to tell me the actual world state. For the sake of simplicity, we'll say the server returns to me 100 bytes. So we have for a single player logged in, 100 bytes in, 100 bytes out. Maybe per second, or every half second, or whatever.

Other developers are probably yelling at me right now, as 100 bytes in/100 bytes out is terrible, but again, simplicity. Hold your horses for a moment.

So now we have a second player logged in, and they have their own 100 bytes in, and their own 100 bytes out, but we're an MMO; we need to let other players see you. So both player 1 and player 2 now have 100 bytes in, 200 bytes out, for a total of 200 in, 400 out. For 3 players, you get 300 in, 900 out (updates for 3 players, being communicated to 3 players). When you start scaling that number up, well, the results are dramatic:

Yes, I know there's 1024 bytes in a kilobyte. I only care about magnitude, not precision in this case.
For 1,000 players, you're looking at ~100 KB in, but ~100 MB out! And for 5,000 players you're now looking at 2.5 GB. Heck, going from 5,000 to 6,000 (+20%) players means an increase of ~1.1 GB (+44%) of outgoing data per communication cycle.

That's clearly unsustainable, and totally insane. Note that CPU and RAM usage also goes up significantly. The more players, the more database hits you need to make, the more RAM needs to be used, and the more CPU used for maintaining all of the information and communicating it out.

Now, Talarian World was implemented naively. World of Warcraft is not. They have coping strategies for dealing with that number of people:

Locality

The first is, of course, locality. There's no point in sending information about a given player in Stormwind when you're in Ironforge. Granted, there's still processing to be done to figure out if you're in the same locality or not, so it's not free per se, but it's a lot cheaper than our exponential data graph above. Also note that Blizzard clearly is capable of scaling their locality checks. When I was out at Blasted Lands, people would disappear from my screen if I was more than 40 yards away from them (which isn't very far in-game).

Phasing/Instancing

Still, that's not enough. When everyone is swarming a single spot, you're all going to be in the same locality, even if it's as small as 20 yards. So the next trick is something that other game companies have done for a long time, and that Blizzard just finally rolled out to Draenor in general, which is separate instances of the same area. Break the population up into their own sub-worlds such that they don't see each other. Blizzard calls their version of this tech phasing, but it's not really new (though to be fair, Blizzard's ability to do it seamlessly and dynamically rather than having a drop-down to select your instance is actually super-slick).

Using our chart above, if we have 1,000 people in an area, we have ~100 MB of data going out. Splitting that in half into two 500 chunks of population that cannot interact in the world means we have ~25 MB x 2, so ~50 MB of data total. By splitting the population, we've reduced the amount of data by half! Splitting it into 10 chunks of 100 people instead reduces that further, to 10MB total. Again, like calculations for locality, this isn't free computationally, so we're still chewing up some RAM and CPU, but there's likely an inflection point somewhere where the chunks of people cost more to maintain than just leaving them in the same instance.

Of course, instancing/phasing breaks immersion to an extent, because now you might not be able to see your friend nearby. Though, to be frank, a bajillion people and 2 - 5 seconds of server lag breaks immersion even more, so the trade-off is probably fine.

Level Design

Another sort of technique that one could do is level design. Ironically, the folks in the World of Warcraft Looking For Group documentary mention that putting a tonne of people in the same spot in the world causes all sorts of issues, and then they went and did it again anyhow by having a single point of entry into Draenor with Khadgar. They mitigated this by posting Khadgar in three other spots, but then they had the exact same problem on the other side of Tanaan Jungle, where everyone was doing the same quests to get their garrisons started.

It's a bit strange that they did all this (really great!) work with Tanaan, instancing it like they did. Once you got into Tanaan, the experience was quite smooth. Then once on Draenor itself, bam, complete meltdown. Especially on an Alliance or Horde-heavy server, where 80% of your population is now in the same zone. Previous expacs had new races or classes to roll so the population was spread out. This expac, there was only Tanaan, then Shadowmoon Valley or Frostfire Ridge.

And it makes me wonder if their level designers even told the engineers what they were doing with the Khadgar thing. Did a server engineer have a meltdown but was ignored anyhow? In a product as large as WoW, I'd not be surprised if either communication internally wasn't sufficient, or someone's concerns were waved off. In a team as small as 15 I've seen that happen, let alone a team of hundreds.

Other level design mitigations can include, say, removing all the squirrels in Nagrand to save on CPU/RAM.


Queues

And finally, we have queues. When all else fails and you've maxed your server resources, just limit the number of people who can be on at a time. Most of the above techniques still cost resources in the form of CPU time and RAM. Eventually you'll still hit some limits of your hardware/software, and while you can solve some of it by adding hardware, there's still a ceiling where it just cannot help. So at that point, like the local club, you put up a line. Not the happiest of solutions, but probably the most immediately effective. People are sad because they can't get in, but likely better than frustrated because they're constantly being disconnected or every ability takes 5 to 10 seconds to go off.



Did Blizzard drop the ball? 


Yes and no.

As per Lore's tweet above, Blizzard did expect more people to come back, but then were surprised by the actual number. As per the chart way above, it's not a linear increase, and going from 100 to 200 players is manageable, but going from 5000 to 5100 is not an equivalent increase; it's far, far worse in terms of resources consumed (hooray exponents!).

But they had the technology in Tanaan to apply phasing to an area to break the population up. They knew that was a bottleneck, yet didn't consider either the Khadgar scenario, nor the Garrison-creation quest scenario. So instead, in an emergency fix, they applied the tech to all of Draenor. If I were a dev on that team when that fix went out, I'd be shitting my pants, to put it mildly. On the other hand, it was either that or watch the servers melt, so not like they had much of a choice.

They also had the chops to realize when they're designing a bottleneck in the experience (as indicated by the Ahn'Qiraj comments in the Looking For Group documentary), so putting Khadgar in a single spot just absolutely flabbergasted me.

On the other hand, the sustained DDoS attack made a bad situation worse. It's hard to account for that kind of malicious traffic, which also eats resources (network, CPU, RAM, etc.) trying to figure out what's legit and what is not, not to mention clogging up the routers and such on the way to Blizzard's servers.

In some ways, this was both the smoothest launch they've had, and also one of the worst in a long time. Tanaan was executed beautifully; phasing, few bottleneck quests, and withstood the lag storms amazingly in practice. Everything else? Bollocks. Hopefully next launch they'll take these lessons and apply them (in some cases, apply them again). #Blizzard, #WorldOfWarcraft, #SoftwareDevelopment

1 comment:

  1. Very nice to see a writeup of launch issues from a tech savvies perspective. Explained so the rest of us can understand. Thanks

    ReplyDelete