How Much Do Dominant Players Affect StarCraft Balance Data?

How much can one player skew the data?

May 25, 2024

About a year back, I wrote on Protoss’s multi-year underperformance in high-level professional StarCraft II. I really enjoyed writing that article, in part because it made me realize how challenging it is to define a trustworthy metric around underperformance. This is especially true in StarCraft II, which (regrettably) does not have an infinite number of professional players and professional tournaments. One or two players retiring, or one player dominating, or a streak of bad luck, in an already sparse professional calendar, and the data can end up lopsided.

From my perspective, this doesn’t make the data useless; and anyway, I’m open about my view that both design and balance require more qualitative judgment than hard-nosed data-driven decision making. But the data does has meaningful asterisks, and I was curious to find out how big those asterisks actually are. And so today I thought I’d take a stab at digging into one of ‘em.

The Serral Factor

An interesting train of thought that emerged from the Protoss underperformance discussion is whether lopsided tournament results are partly the result of one or two players being really, really good. Maybe the game is balanced - but one race appears unstoppable because that’s what the best player in the world is playing.

I think if we had enough professional players, enough professional tournaments, and enough competitive stability, we could solve this problem in reverse, by simulating the probability of various outcomes:

Given the 2022 calendar year and final tournament results, what’s the probability of arriving at these results if the three races were perfectly balanced?

That would be a cool statistical analysis! But I don’t think there’s enough data for this to be meaningful; the margin of error is too large.

I figured I would try something else. I created a forward-looking simulation that distributes a bunch of random professional players across ELO ratings, with the option of throwing in 1 or more players with an order of magnitude higher rating. And I added some simple hooks to compute the round-of-16 racial distribution (the preferred metric of my previous analysis), as well as racial distribution among tournament winners.

Methodology

I ran simulations of 128-player tournaments featuring randomly distributed players and ratings, with up to N players with unusually high ratings. I decided to use the ELO rating system due to its simplicity and plentiful documentation; I also decided early on not to get into the weeds of computing how ratings move up or down after each game, assuming that player skill level would not meaningfully change within the context of a single tournament. “Normal” (i.e. non-dominant) players were distributed between ELO ratings of 100 and 200, offering a range of single-game win probabilities from 36% to 64% depending on the two players’ ratings.

The tournament structure was a tricky question. To use two strawman examples - a randomly shuffled 128-player bracket consisting of only best-of-1s is going to produce more random results, whereas a 128-player round-robin consisting of only best-of-7s is going to produce more predictable results. To try to avoid making this an overly academic question, I strove for a tournament format somewhat similar to real professional tournaments, which I think offer a fairly decent blend of skill and random chance:

A round-of-128, consisting of 16 groups with 8 players each, best-of-1 round-robin, with 4 players advancing from each group.
A round-of-64 followed by a round-of-32, consisting of groups of 4 players (randomly shuffled from the prior round), GSL-style groups with 2 players advancing from each group.
A 16-player seeded bracket based on win counts from the previous rounds, consisting of best-of-3s (round-of-16), best-of-3s (round of 8), best-of-5s (round of 4), and a best-of-7 (finals).

Each time I simulated a tournament, I stored which race won, and the racial representation in the round of 16. (Not rocket science for sure, and plenty of room for more work in this area).

Sanity Test - No Dominant Player

The first thing I simulated was a random distribution of players and races across 1000 tournaments with no single dominant player. This is mostly a gut check to verify that my code isn’t doing anything crazy:

Winners - 337 Zerg, 346 Protoss, 317 Terran
Overrepresented count - 122 Zerg, 116 Protoss, 127 Terran
Underrepresented count - 141 Zerg, 153 Protoss, 179 Terran

(I picked a stricter bar for under- or over-representation than in my previous analysis, because now I have a lot more data - if a race has 3 or fewer or 8 or more representatives in the round-of-16 bracket, it’s considered under- or over-represented, respectively.)

It’s worth noting that under- and over-representation are quite common, appearing in a little over half of brackets. (The counts add up to more because there’s double-counting - once one race is over-represented in a bracket, it’s more likely that one or both of the other two races is under-represented). But over a large data set, they even out.

One Nearly Unbeatable Player

Next, I ran the opposite simulation, again mostly to sanity test the code. I added a Zerg player at 600 ELO, giving them a 91% win probability against the next highest-rated player in the pool (with an ELO of at most 200):

Winners - 973 Zerg, 16 Protoss, 11 Terran
Overrepresented count - 206 Zerg, 88 Protoss, 82 Terran
Underrepresented count - 78 Zerg, 191 Protoss, 195 Terran

Not super surprisingly, Zerg wins 97.3% of the tournaments. Notably this is higher than the dominant player’s highest possible single game win probability against any other player (91%), because other Zerg players win, too.

Notably, the over- and under-represented counts don’t end up skewed that much. It’s still the case that only about half of all brackets feature under- or over-representation. I think this makes sense - the nearly unbeatable player reserves themself one slot in the bracket, but they can only eliminate so many other players, and those eliminations are evenly distributed. The other 15 slots are still up for grabs.

One Dominant Player

Next, I ran the more interesting case of one player at an ELO of 300, giving them a 64-75% win rate against the other players in the pool. This time I ran 10,000 trials, because, why not:

Winners - 4994 Zerg, 2482 Protoss, 2524 Terran
Overrepresented Count - 1763 Zerg, 1031 Protoss, 1000 Terran
Underrepresented Count - 1096 Zerg, 1924 Protoss, 1905 Terran

I was surprised by this, so much so that I spent quite a bit of time double checking the code. (Yes, this is my latest rationalization for not publishing anything for weeks). I also ran another simulation at 250 ELO, offering a 57-70% single game win rate against other players:

Winners - 4093 Zerg, 2988 Protoss, 2919 Terran
Overrepresented Count - 1647 Zerg, 1082 Protoss, 1071 Terran
Underrepresented Count - 1282 Zerg, 1828 Protoss, 1831 Terran

It’s still skewed, although the dominant player’s tournament win rate went down significantly (winning just 11% of tournaments instead of 25.3%). Notably, while that player can be upset, they’re just as likely to be upset by a player of their own race as they are by a player of a different race. The result is that Zerg comes out on top around 41% of the time, and is over-represented in the round-of-16 about 60% more often than the other two races.

Three Noticeably Better Players

I want to highlight at this point that these win rates are fairly realistic. Aligulac predicts that the current #1 player (Clem) has a 57% win rate against the current #10 player (GuMiho). Serral (who’s no longer listed as active, despite winning the last major tournament) would have an 83% win rate!

Nonetheless, I wanted to see how skewed the data ends up with only modest differences. I put in three “noticeably better” players, at 225 ELO each - 2 Zergs, 1 Terran. This trio (Serral, Reynor, and Maru) are frequently cited as the reason that Protoss underperforms:

Winners - 3864 Zerg, 2793 Protoss, 3343 Terran
Overrepresented Count - 1646 Zerg, 871 Protoss, 1239 Terran
Underrepresented Count - 1192 Zerg, 2128 Protoss, 1597 Terran

The skew is noticeable; Protoss is under-represented about twice as often as Zerg in the round-of-16.

Analysis

I’ll be the first to say that this analysis has limitations. One important factor it excludes is players with particularly good or bad match-ups (e.g. a TvT expert) and match-ups with specific balance problems (e.g. TvP) that may not extend to the races’ other asymmetric match-ups (TvZ and ZvP, respectively). It also uses a randomized ELO distribution instead of sourcing its data from a canonical source like Aligulac; real-life rating distribution of professionals is likely not as even.

All of this and more is stuff that I’d love to generalize and throw in a web app somewhere so folks can play around with the data. But specifically with regards to this article’s goals - assessing the impact of a dominant player - I think one of the things that surprised me as I ran simulations is the extent to which dominance can skew the data. Serral’s predicted win rates against other top 10 players suggest he would perform at least as well (likely better) than the slightly dominant player (+50-150 ELO), which produces a ~41% tournament win rate for Zerg. Lower his advantage slightly, but add in a couple other dominant players, and you still see a large skew.

The degree of underperformance, to be fair, is less significant than in real life, where Protoss’s over-to-underperformance ratio was 1:3 and 1:6 in previous years, respectively, in my analysis. But the amount of skew is larger than my intuition. I’ll be honest here and say that I didn’t give this notion enough credit, particularly the impact of a dominant player of one race retiring.

From my perspective (and feel free to call it confirmation bias), this doesn’t change the conclusions of my last article:

And as Protoss goes, I think many people will look at this data and conclude that the tough years they had beginning in 2021 - and continuing on to this day - has more to do with player retirements than anything else, given that things were probably OK back in 2020. And I think others will quibble with the choice to exclude the EPT Circuit. Hey, at least Protoss does well in mid- to high-GM, right?
But another way of framing that is that multiple years of underperformance at the game’s highest-level global tournaments for one of the game’s three races is a perfectly normal and acceptable thing.
Which, well… maybe not?

If you believe that racial representation is important (as I do), and you believe in the notion that a handful of players can skew it significantly (as I now do, too), then the need for proactive re-designs and re-balances becomes more rather than less urgent, because good racial representation becomes more challenging to correctly land with just minor tweaking. I find this especially convincing when viewed as a human problem of perception - if professional players think they are at a disadvantage, then they may respond by practicing less, and underperformance becomes a self-fulfilling prophecy. The region locking debate featured a similar train of thought, and I concluded at the time (sadly, the link is broken now) that the system had a meaningful impact on foreigner performance in professional StarCraft II.

To be sure, I think an alternative fair interpretation is that the ability of one or two dominant players to skew things significantly is that we ought to be more conservative and not read too much into trends in the data. And I think especially in games with a really large professional player pool, the right answer may often be to assume that dominance and retirements across races will cancel each other out over time. And I don’t want to discount that perspective! Part of the reason I offer up the data first is to be transparent as to what it shows.

But personally, I see this exercise also as highlighting the limitations within the data; that while the data can provide useful signal, it shouldn’t override qualitative judgment about the state of the meta. If we’re able to game out build orders and reactions that on paper look very hard on one match-up, and in practice the meta aligns with that theorycrafting and isn’t moving for a significant amount of time, is it sufficient to rest on the laurels of incomplete data and hope someone figures it out? My gut still says no.

StarCraft II is an incredibly competitive game. And yet, it still has its Serrals and its Marus. The impact of such players is larger than I previously thought; and that means (to me, at least) that it’s even more important to shake things up to keep the meta fresh and dynamic. The art of tuning, as far as I’m concerned, still has its place in the design and balance of professional play.

Until next time,
brownbear

If you’d like, you can follow me on Twitter, Facebook, and Instagram, and check out my YouTube and Twitch channels.

P.S. As always, apologies for the delays in publication! My pride refuses to let me remove the “weekly” modifier from the subscription prompt. I’ll keep on plugging away to get back on track.

Brown Bear RTS

Discussion about this post