Saturday, November 15, 2014

[WoW] Server Traffic: Queues, Phasing, and Some Numbers

Recently Blizzard released a little product known as Warlords of Draenor, and it turns out their engineers were taken by surprise by the number of people logging into the game. Between that and the ongoing DDoS attack that pretty well every MMO launch in the past year has dealt with (because apparently there're some folks out there who just hate fun, I guess), Blizzard's servers have basically melted.

So why is this such a big deal? Why couldn't Blizzard just throw more hardware at the issue? Would an extra 1,000 people on each server make that big a difference?

The answer, it turns out, is yes. Even a 10% increase over what they expected can result in server meltage. Why, though?

Let's pretend I've made a new MMO, Talarian World, and I just propped my servers up and some folks start logging in. For every player, I have about 100 bytes of data coming in to my server every second (to be honest, the exact amount is largely immaterial once we hit a certain number of people, but just play along for a moment), That data contains information about my new world location, what abilities I used, commands, whatever.

Now, to prevent cheating, my server is Authoritative. This means that any action I take in my game client has to be vetted by the server, and confirmed. So the server responds with information to tell me the actual world state. For the sake of simplicity, we'll say the server returns to me 100 bytes. So we have for a single player logged in, 100 bytes in, 100 bytes out. Maybe per second, or every half second, or whatever.

Other developers are probably yelling at me right now, as 100 bytes in/100 bytes out is terrible, but again, simplicity. Hold your horses for a moment.

So now we have a second player logged in, and they have their own 100 bytes in, and their own 100 bytes out, but we're an MMO; we need to let other players see you. So both player 1 and player 2 now have 100 bytes in, 200 bytes out, for a total of 200 in, 400 out. For 3 players, you get 300 in, 900 out (updates for 3 players, being communicated to 3 players). When you start scaling that number up, well, the results are dramatic:

Yes, I know there's 1024 bytes in a kilobyte. I only care about magnitude, not precision in this case.
For 1,000 players, you're looking at ~100 KB in, but ~100 MB out! And for 5,000 players you're now looking at 2.5 GB. Heck, going from 5,000 to 6,000 (+20%) players means an increase of ~1.1 GB (+44%) of outgoing data per communication cycle.

That's clearly unsustainable, and totally insane. Note that CPU and RAM usage also goes up significantly. The more players, the more database hits you need to make, the more RAM needs to be used, and the more CPU used for maintaining all of the information and communicating it out.

Now, Talarian World was implemented naively. World of Warcraft is not. They have coping strategies for dealing with that number of people:


The first is, of course, locality. There's no point in sending information about a given player in Stormwind when you're in Ironforge. Granted, there's still processing to be done to figure out if you're in the same locality or not, so it's not free per se, but it's a lot cheaper than our exponential data graph above. Also note that Blizzard clearly is capable of scaling their locality checks. When I was out at Blasted Lands, people would disappear from my screen if I was more than 40 yards away from them (which isn't very far in-game).


Still, that's not enough. When everyone is swarming a single spot, you're all going to be in the same locality, even if it's as small as 20 yards. So the next trick is something that other game companies have done for a long time, and that Blizzard just finally rolled out to Draenor in general, which is separate instances of the same area. Break the population up into their own sub-worlds such that they don't see each other. Blizzard calls their version of this tech phasing, but it's not really new (though to be fair, Blizzard's ability to do it seamlessly and dynamically rather than having a drop-down to select your instance is actually super-slick).

Using our chart above, if we have 1,000 people in an area, we have ~100 MB of data going out. Splitting that in half into two 500 chunks of population that cannot interact in the world means we have ~25 MB x 2, so ~50 MB of data total. By splitting the population, we've reduced the amount of data by half! Splitting it into 10 chunks of 100 people instead reduces that further, to 10MB total. Again, like calculations for locality, this isn't free computationally, so we're still chewing up some RAM and CPU, but there's likely an inflection point somewhere where the chunks of people cost more to maintain than just leaving them in the same instance.

Of course, instancing/phasing breaks immersion to an extent, because now you might not be able to see your friend nearby. Though, to be frank, a bajillion people and 2 - 5 seconds of server lag breaks immersion even more, so the trade-off is probably fine.

Level Design

Another sort of technique that one could do is level design. Ironically, the folks in the World of Warcraft Looking For Group documentary mention that putting a tonne of people in the same spot in the world causes all sorts of issues, and then they went and did it again anyhow by having a single point of entry into Draenor with Khadgar. They mitigated this by posting Khadgar in three other spots, but then they had the exact same problem on the other side of Tanaan Jungle, where everyone was doing the same quests to get their garrisons started.

It's a bit strange that they did all this (really great!) work with Tanaan, instancing it like they did. Once you got into Tanaan, the experience was quite smooth. Then once on Draenor itself, bam, complete meltdown. Especially on an Alliance or Horde-heavy server, where 80% of your population is now in the same zone. Previous expacs had new races or classes to roll so the population was spread out. This expac, there was only Tanaan, then Shadowmoon Valley or Frostfire Ridge.

And it makes me wonder if their level designers even told the engineers what they were doing with the Khadgar thing. Did a server engineer have a meltdown but was ignored anyhow? In a product as large as WoW, I'd not be surprised if either communication internally wasn't sufficient, or someone's concerns were waved off. In a team as small as 15 I've seen that happen, let alone a team of hundreds.

Other level design mitigations can include, say, removing all the squirrels in Nagrand to save on CPU/RAM.


And finally, we have queues. When all else fails and you've maxed your server resources, just limit the number of people who can be on at a time. Most of the above techniques still cost resources in the form of CPU time and RAM. Eventually you'll still hit some limits of your hardware/software, and while you can solve some of it by adding hardware, there's still a ceiling where it just cannot help. So at that point, like the local club, you put up a line. Not the happiest of solutions, but probably the most immediately effective. People are sad because they can't get in, but likely better than frustrated because they're constantly being disconnected or every ability takes 5 to 10 seconds to go off.

Did Blizzard drop the ball? 

Yes and no.

As per Lore's tweet above, Blizzard did expect more people to come back, but then were surprised by the actual number. As per the chart way above, it's not a linear increase, and going from 100 to 200 players is manageable, but going from 5000 to 5100 is not an equivalent increase; it's far, far worse in terms of resources consumed (hooray exponents!).

But they had the technology in Tanaan to apply phasing to an area to break the population up. They knew that was a bottleneck, yet didn't consider either the Khadgar scenario, nor the Garrison-creation quest scenario. So instead, in an emergency fix, they applied the tech to all of Draenor. If I were a dev on that team when that fix went out, I'd be shitting my pants, to put it mildly. On the other hand, it was either that or watch the servers melt, so not like they had much of a choice.

They also had the chops to realize when they're designing a bottleneck in the experience (as indicated by the Ahn'Qiraj comments in the Looking For Group documentary), so putting Khadgar in a single spot just absolutely flabbergasted me.

On the other hand, the sustained DDoS attack made a bad situation worse. It's hard to account for that kind of malicious traffic, which also eats resources (network, CPU, RAM, etc.) trying to figure out what's legit and what is not, not to mention clogging up the routers and such on the way to Blizzard's servers.

In some ways, this was both the smoothest launch they've had, and also one of the worst in a long time. Tanaan was executed beautifully; phasing, few bottleneck quests, and withstood the lag storms amazingly in practice. Everything else? Bollocks. Hopefully next launch they'll take these lessons and apply them (in some cases, apply them again). #Blizzard, #WorldOfWarcraft, #SoftwareDevelopment

Friday, November 7, 2014

Overwatch: Diversity Done Better

Blizzard announced their first new IP in 17 years. Seventeen! That's older than a good chunk of their fanbase--I know when I see comments about how people grew up on WoW it makes me feel old. But the last new IP Blizzard had was StarCraft in 1998. Though they've beefed up their franchises with spin-offs (Hearthstone and Heroes of the Storm), at some point the Diablo/StarCraft/WarCraft triumvirate was going to give out. So to see a new thing is pretty sweet.

I'm not a huge fan of modern FPS games. I was big into Goldeneye, Perfect Dark, and Quake 3 Arena back in the day, but more recent titles like Halo or Gears of War haven't really interested me. I played Mass Effect despite the shooter aspect of it (though to be fair, Mass Effect 3 did a really, really good job of making it much more fun). But Overwatch seems like it's an interesting enough take on the team-based shooter genre that I want to give it a whirl.

Between the superhero-esque powers, a little faster-paced gameplay than many other shooters, and a strong aesthetic all really solidify it in my head as something I want to play. What also helps, however, is that the characters are actually quite interesting.

Twelve characters have been revealed so far, and there are more to be shown, but of the ones we've seen, we have two robots, nine humans, and a gorilla. Of those nine humans, five are women. And not only that, but we have people of different skin tones beyond white (which is typically either European, Australian, or North American representation in games). Symmetra is Indian, Pharah is Egyptian, and Honzo is Japanese.

Mind you, some folks have already accused Blizzard of "appropriating stereotypical aspects of other cultures to layer on top of its white-dude-fantasy-world." I'm not really in much of a position to argue for or against that, mind you, being pasty white and all, but at the same time I'm finding it hard to think of other major games where an Indian or Middle Eastern character is shown in a positive light, or at all, so it's probably a positive step overall.

While they mostly have similar body types, there's still a fair bit of diversity within the set of ladies.
As for the ladies in the game, while I think they could use more diversity in their silhouettes (as they largely all have the exact same body type), @Moxiedoodle summed it better than I could:

So kudos, Blizzard. Folks took you to task, and then you stepped up to the plate and maybe not hit a home run, but frankly still did a lot better than you have in the past. And a lot better than many other developers do today. And you did so in a way that shows the game as not being any worse for wear by being inclusive.

There are still white characters, and male-power-fantasy characters, and boob plate and fan service, but there's also covered characters, folks who aren't white, and lots of ladies. As @Moxiedoodle mentions above: there's something for maybe not everyone, but a lot more than there was in prior games. #Blizzard, #Overwatch, #Diversity

BlizzCon Imminent

I've nothing to add other than re-posting my previous prediction from a month and a half ago:

We'll likely see if I'm correct. Setting aside some space here for me to post links to interesting information as it happens.

In the meantime, I have some networked finite state machines to build, rather than being at BlizzCon :(

Monday, November 3, 2014

Luck versus Skill

A fascinating concept in game design is that of luck versus skill. A lot of folks tend to present them as opposing ends of a dichotomy. More random means less skill, and more skill means there can't be as much random. This is a false dichotomy; the relationship between the two is far more interesting than that.

Richard Garfield, the creator of Magic: The Gathering, has a talk where he pontificates about luck in games and how they affect the outcomes of these games. One of the examples he uses to show that skill and luck aren't diametrically opposed is "Rando-Chess".

Imagine a game of chess where at the end of the game you roll a die, and if it's a 1, the winner becomes the loser and the loser actually wins. Now, ignore that voice in your head screaming that's unfair for a moment. Does it actually reduce the skill required to play the game? Everything about chess is still applicable: opening gambits, strategies, knowledge of the rules. Having all of that skill still increases your win-rate over time. It didn't make skill useless at all; however, it does moderate skill disparities.

If you have a game that's all skill, if you're equally skilled you'd expect to win 50% of the time. If you were more skilled, you'd expect to win most, if not all, the time. Adding a random roll at the end in the Rando-Chess game means that the weaker player now has a chance at winning, despite being the poorer player.

Now, Rando-Chess wouldn't be very satisfying to play. I'm pretty sure I'd flip a table or two when I lost due to the direct result of the random roll. Instead, most game designers embed the randomness in their games in other ways. Accuracy in TRPGs or tabletop strategy games, so even if you play the most perfect tactical game ever, you can still get hosed by missing that 99% chance to hit shot. Starting positions in games like Civilization, where you may end up with different resources necessitating different strategies. Deck building in games, where even if you've built the deck and know what's in it, you won't get the cards in the order you necessarily want them in.

But here's a twist where that interesting relationship between the two comes into play again: you can often reduce the effects of randomness by applying skill.

Here are some prime examples.

In a game like Magic: The Gathering or Hearthstone, when building your deck, you want to make sure said deck is as relatively focused as possible. When you draw your next card, you want to increase the chances that said card will be applicable to your overall strategy. This is also often why Card Advantage is very, very important to these kinds of games: because you're cycling through your deck faster, there's a higher chance you'll get the cards that you want, thus reducing the effect of randomness.

Purple has 5, 6, 9, and 10, whereas Red has 3, 4, 6, 8, 9, 10, and 11. Despite Purple having more points than Red, Red is arguably in the stronger long term position because they're less at the mercy of luck.
In Settlers of Catan, by diversifying the values your cities are adjacent to, you can reduce the effects of luck on your resource intake. If you only ever get resources on a 3, 6, and 8, you'll get hosed if the dice continually roll 9s and 4s. If you have as many different numbers as you can possibly get, then no matter what numbers show up on the dice, you're getting resources, which can be traded in for other resources.

In an MMORPG with trilogy-based real-time combat, like World of Warcraft, the things that usually kill tanks are unpredictable spikes of damage. As a tank or a healer, the best thing you could ever do for your tank's survival is the reduce the effects of randomness as much as possible and ensure the rate of incoming damage is as smooth as possible. Things like Active Mitigation and external cooldowns such as Hand of Sacrifice allow the tanks and healers to have some control over the variability of incoming damage.

Spreading out is a pretty standard response to "bosses use targeted AoE abilities"
A more visual example is the case where you know that a boss will use abilities at seemingly random. Let's assume you have 10 players facing a boss who will sometimes put a big circle of doom under a player. To reduce the amount of risk to the group as a whole, obviously the solution (skill) to apply is spreading out. At worst, only a single player will get nailed. 
Our left-most player gets hit by a circle of doom. Of course, she can just walk out of it here, but since nobody else was around, the potential for damage is reduced.
Make that harder by having the circles stay for the duration of combat, and now where you move to becomes more important. If you position yourself in a way that doesn't allow you an escape route (or another player does that), then you've left yourself at the mercy of RNG rather than using you skills to pick a better position to wait for the incoming attack.

Our left-most player gets boxed in by someone else moving nearby them. If either player had more awareness of their surroundings (a skill), they could have prevented trapping the left-most player.
Interestingly, this is why I'm extremely hesitant to say I got screwed by RNG in most boss fights. That's not to say there aren't badly designed fights out there where it is truly the case where randomness can screw you, but the careful application of skill can often mitigate or remove that "bad luck" entirely.

On the other side of the equation, you have games that are entirely luck: Chutes and Ladders (or Snakes and Ladders depending on where you live) has precisely zero decisions and precisely zero factors that are influenced by the individual.

If you made a game where the point was to kick a soccer ball the furthest, you have both physical skills (such as strength and accuracy), as well as mental skills (which way is the prevailing wind headed? What spin should I put on the ball?), so while at first blush it might not seem like there's any "skill" involved because it's a feat of physical prowess, there are definitely decisions occurring that could make or break a win even if the players aren't physically equally capable. Basically, a skilled player would use the wind to their advantage. An unskilled player would say they lost because they were unlucky due to the wind. To be fair, however, a strong gust of wind might actually alter the outcome of the match.

Randomness is a tool like any other in a designer's kit. It can be used to muddy the skill disparities between players, or to ensure that players don't get stuck in a rut where the exactly same strategy applies every single time. Players can fight randomness by applying skill, but ultimately they will likely never overcome it entirely (it's possible, if unlikely, for that 11 to be rolled in Settlers over and over again and it's the only number you don't have), so the skill muddying effects can still apply.