If you’ve ever participated in some sort of competitive game, you’ve probably seen terms like ‚Elo‘ being mentioned, referring to a player’s performance rating in a competition. Furthermore, you might have heard that a lot of those competitions (especially online games!) don’t actually use the original Elo algorithm. Why is that? What are the different advantages and disadvantages of rating algorithms, and which one should you use?
TL;DR – Advantages and disadvantages of rating algorithms
Elo | Glicko-1 | Glicko-2 |
|
|
|
|
|
|
What are rating algorithms and why are they important?
In 1970, the world chess organisation FIDE introduced the Elo rating. FIDE had already awarded titles like „grandmaster“ or „international master“ according to players‘ accomplishments, but using a rating system allowed for a mathematical approach of ranking all competitors in relation to each other, and Elo did this better than other algorithms at the time. Over 50 years later, Elo is most commonly referred to in the context of online games that offer ranked matchmaking – a mechanism that allows players to play against others of similar skill. However, almost all of these games use more complicated algorithms for their rating instead of actual Elo. Why is that the case?
The Elo algorithm and why you probably shouldn’t use it
In Elo, the rating difference between two players serves to calculate the predicted outcome of a match. A player whose rating is 100 or 200 points greater than their opponent’s is expected to „score“ 64% or 76% respectively. This doesn’t necessarily mean that they will win that percentage of their games if the game allows for ties. Wins, ties and losses are represented by a 100%, 50% and 0% score each. After every game, ratings are adjusted according to the difference between the actual scores of the players, and their predicted scores. A player with a predicted score of 64% will gain less rating from a win than their opponent with a predicted score of 36% would if they win. At the same time, that first player will lose more rating if they lose, and the second would lose less if they lose.
Since its introduction, a couple of practical issues with this relatively simple algorithm have emerged. Some conceptual ones, like assumed transitivity, as well as some cases in which ratings have simply proved too inaccurate. If John Doe joins as a new player, his rating is always assumed to be 1500, which is average. But there’s absolutely no indication that John Doe actually deserves a 1500 rating. Realistically, because he is new, his true rating is likely lower. During the first couple rating periods, his rating will probably decrease, and the ratings of players winning against him will increase more than they should.
Now, let’s say John keeps at it and improves significantly, eventually reaching a rating of 2000, but then he takes a break and doesn’t play chess for a year. It’s very unlikely that he will still play at the level of a 2000-rated player on his return; it’s much more likely that he’s rusty and will underperform for at least a couple of rating periods. Again, his true rating is lower than Elo represents, yielding the same inaccuracies as with a bad new player.
FIDE still uses actual Elo. Switching to a different algorithm would likely be a political nightmare, as it would directly affect the ranking and, depending on the algorithm used, some additional indicators would have to be arbitrarily picked for existing players. At that point, the added precision and consistency of a better algorithm might not be worth the hassle. But especially talking about video game matchmaking, it usually is worth it.
Why Glicko?
If the rating is used for matchmaking, having actually precise and consistent rating becomes much more important than in the FIDE example. The quality of the players‘ matches is directly impacted by the matchmaking. Bad ratings will result in unfair games in which one side has no reasonable chance to win. This is especially frustrating in games that take a while to complete, causing the player to be stuck in a game they cannot win, and/or in team games, where overrated players will also ruin other players‘ games. A better algorithm will create better matches, especially if there’s a large pool of players trying to match for a game at any given time.
Elo’s k-value – responsible for the amount of rating change that a single result can cause – is commonly adjusted to alleviate some of Elo’s worst issues. It’s usually increased for new players to cause them to reach the ballpark of their actual rating faster. In some cases, like in the FIDE ratings, it’s also reduced for very good players to prevent their ratings from fluctuating too much. This crutch is not necessary if you simply use Glicko instead of Elo because it automates something similar to this in a more precise manner.
In Glicko, aside from ratings, each player also has a deviation value which represents the certainty that the player’s rating is actually correct. This deviation value is set to be very high for new players (usually 350), representing that it’s rather unlikely that they play similarly well as a rightfully 1500-rated player would. When players play, their rating gets adjusted more if their deviation is very high; meanwhile the deviation itself decreases. Therefore, new players will be rated better more quickly. This has a significant impact on a game’s new player experience, especially if smurfing (-> experienced players making new accounts) is a common occurrence. Players who are very far off average will be matched accordingly after relatively few matches. Deviation can also increase again if a player doesn’t participate in rating periods. The longer a player doesn’t play, the less likely it is that they’re still of the playing strength represented by their rating – they might have gotten rusty, or even improved because they’ve been taking part in other competitions outside your system. Therefore, returning players will be rated better more quickly.
Apart from that, Glicko works very similarly to Elo. But is it actually better in every case?
Pitfalls!
So far, we’ve only been talking about Glicko-1. The more complicated iteration, Glicko-2, adds yet another indicator for each player: Their volatility. It measures the degree of expected fluctuation in a player’s rating, based on how erratic the player’s performances are. This affects the development of rating deviations and further increases rating precision over time. Most rating systems you will encounter in online games are based on Glicko-2 or attempt to do something similar.
However, you shouldn’t base leaderboards, ranks and matchmaking algorithms purely on a player’s current Glicko-2 rating. Players of the mobile game Pokémon Go have found out that losing games on purpose can make it easier to achieve very high ratings by overinflating the player’s volatility rating, and therefore weighing following games too heavily during rating calculation. If players end up with a win streak after intentionally losing some games, they would often end up with a rating higher than they could achieve by just playing normally.
Glicko-2 also amplifies the transparency issues that Glicko-1 already comes with. Players might not understand why they gain/lose more/less rating from certain games than from others. In the popular MOBA League of Legends, players are ranked in leagues and divisons as an abstraction layer upon their actual matchmaking rating. Systems like this have become the norm whenever relatively complicated rating algorithms are used. For once, it’s much easier for a player to grasp the concept of being „Gold“ vs. being, let’s say, „somewhere between 1610 and 1780 rating“. Also, an abstraction layer can alleviate the previously mentioned issue that Pokemon Go players have run into – the system would simply not rank you up from Silver to Gold while your volatility and/or rating deviation values are too high. Instead, you’d have to play some more games to prove that your unusual performance wasn’t a fluke, stabilizing leaderboards. Still, a system like this can feel artificial, especially if a player’s rating and their displayed rank are unusually far apart. Your players might ask themselves: „Why does the system not simply rank me where I truly belong, if it already knows my proper rating?„
It’s important to note that the complexity of League of Legends‘ rating system likely goes much further than this. Making players feel like they’re „climbing“ the ranks motivates them to play more. Seasonal resets, placement and promotional matches could, first and foremost, be tools to increase player engagement.
Takeaways
Rating systems are complicated. Things can go wrong. However, if you want to set up a rating system for a game you’re making, or a competition you’re managing, it’s very likely that you’ll be fine using Glicko-1. Using Elo has no real benefit except being dead simple. Glicko-2 can yield even better precision, but you’ll also end up with more work and things you have to watch out for, which might or might not be worth it depending on your usecase.
If you don’t want to start from scratch, consider using the Competier App to manage your competition. It features Elo, Glicko-1 and Glicko-2 calculations out of the box, as well as full entry management, invites, result rollbacks and more. Benefit from already implemented best practices and solutions to common challenges in rating systems, for example support for games with more than 2 players. All of this is completely free! And if you’re a developer, take a look at the Competier API documentation and use it to power your next project.