What Statistics Never Told You About Ratings

9 min readMar 23, 2022

In the past few days, I’ve been playing Mario Bros, specifically Super Mario 64. That was a good game in the 90s, but nowadays, it could be seen as a relic for gamer people.

Anyway, while I was playing, I remembered my old days when I used to play not only Super Mario 64 but also Pokemon from Game Boy Color.

Suddenly, I asked myself: Are Mario Games better or worse than Pokemon Games for gamer communities?

That question stayed in my head until I decided to know if there was a significant difference between Mario and Pokemon.

Actually, at first glance, it could be subjective to ask anyone if a game is better than another because you could attach memories or feelings when referring to a game, trip, or anything.

Fortunately, we have tons of data nowadays, and we can give a shot to solve this question with ratings.

How much data do we need to solve our current issue? For example, is a sample enough to detect if a game is better than another? And last but not least, are ratings the best way to validate if something is better or not?

My first guess is that Mario Bros has better ratings than Pokemon just because it’s older and because I’m biased.

I will use Python and R (Yes, I’m not concerned by using the best of both worlds) to figure out if Mario or Pokemon is rated higher using the API from RAWG.io.

RAWG is one of the largest video games databases and, as far as I know, is used by gaming communities to discuss games.

Thankfully, RAWG has a giant Video Games Database API. I quote from their website: “And we are gladly sharing our 500,000+ games, search, and machine learning recommendations with the world. Learn what the RAWG games database API can do and build something cool with it!”

So, first, I will use Python to load the packages and load my API key:

# Import libraries
import os
import requests
import pandas as pd
import time
import json
from dotenv import load_dotenvload_dotenv() # take environment variables from .env# API from dot env
API_KEY = os.getenv("API_KEY")

If you wonder how and where to get your API key from RAWG, click here.

After loading your libraries and API key, I have to test if everything was working correctly:

url = f"https://api.rawg.io/api/games?key={API_KEY}&search=mario"
df = requests.get(url).json() # Grab data

If you print out the data, you will get it in an ugly format, but luckily, there are ways to see information prettier. Because I’m using Rstudio as an IDE for my data analysis, I can see how is shown the requested data.

As you can see, there’s a list called “results” where you can obtain data like id, name, platforms, genres, etc.

How can you access a piece of specific information? First, you must pay attention to the keys to navigate and parse through the data with JSON data. For example, if you want to obtain the background image:

df["results"][0]["background_image"]

Which game is this one? I can’t call myself a Mario Fan :(

So with that in mind, I need to create a for loop of the data I need. But, first, let’s test if a simple loop is working correctly:

Everything is working as expected. Once I’ve tested enough, I was able to create a function for grabbing all the data, which basically:

Keeps looking for data (page 1, 2, and so on) until there’s no data left
Stores the data into a pandas data frame
Grabs a “sample” by indicating the title of the game and creator(s)

# Function for game details (Second API call inside loop)
def get_game_details(game_id):
  
  # URL and request
  detail_url = f"https://api.rawg.io/api/games/{game_id}?key={API_KEY}"
  r = requests.get(detail_url).json()
  
  # Variables
  playtime = r["playtime"]
  description = r["description"]
  updated = r["updated"]
  
  return playtime, description, updated
# =========================
# Main Function
# =========================
def get_games(game_data, creator, title_game):# Make RAWG API call & storing game info
  url = f"https://api.rawg.io/api/games?key={API_KEY}&creators={creator}&search={title_game}&ordering=released"
  df = requests.get(url).json() # Grab data
  time.sleep(1) # Just a sec 
  game_info = df["results"]
  
  
  # Reach until last page
  while df["next"]:
    df = requests.get(df["next"]).json()
    time.sleep(1) # Just a sec
    game_info.extend(df["results"])
  
  # Data For Loop
  for game in game_info:
    
    # Basic data
    game_id = game["id"]
    slug = game["slug"]
    released = game["released"]
    image = game["background_image"]
    rating = game["rating"]
    rating_top = game["rating_top"]
    ratings_count = game["ratings_count"]
    
    # Details per game (second function)
    playtime, description, updated = get_game_details(game_id)
    
    # Save data in pandas data frame
    game_data = game_data.append({
    "game_id":game_id,
    "slug":slug,
    "released":released,
    "image":image,
    "rating":rating,
    "rating_top":rating_top,
    "ratings_count":ratings_count,
    "playtime":playtime,
    "description":description,
    "updated":updated
    }, ignore_index=True
    )return game_data# =========================
# Storing Data frame
# =========================# Build our data frame
game_data = pd.DataFrame(
  columns = ["game_id", "slug", "released", "image", "rating",
  "rating_top", "ratings_count", "playtime","description", "updated"]
  )# Call function
game_data_mario = 
  get_games(game_data,
            "shigeru-miyamoto,hideki-konno,shinji-hatano",
            "mario")
            
game_data_pokemon = 
  get_games(game_data,
            "shigeru-miyamoto,junichi-masuda,hitoshi-yamagami",
            "pokemon")

For the scope of this article, I did a quick research on Google to look at which creators were the most involved in Mario and Pokemon games and took that as “sample” data.

Now I finally have the data, so the next step is to merge both data sets, and I decided to use R for this and the rest of this article.

library(tidyverse)
library(reticulate)
library(flextable)
library(infer)# Bind data set
df <- mario_data |>
  mutate(title_game = "Mario") |> # Mario data
  mutate(
    across(
      where(is.list),
      function(x) ifelse(x == "NULL", NA, x) |> unlist(x)
    )
  ) |> 
  # Union all
  bind_rows(
    pokemon_data |> 
      mutate(title_game = "Pokemon") |> # Pokemon data
      mutate(
        across(
          where(is.list),
          function(x) ifelse(x == "NULL", NA, x) |> unlist(x)
          )
        )
    )

When I print the data in the console, this is the output

Before diving into solving the current question, you might be wondering why I decided to use statistics instead of machine learning. Why not use a fancy decision tree algorithm instead of going with college concepts?

The key distinction between machine learning and statistics is that machine learning is based on facts. In contrast, statistics create inference based on assumptions such as normality.

Machine learning is concerned with utilizing mathematical models to get a general knowledge of the data to make predictions. In contrast, statistics is concerned with creating a representation of the data and then doing analysis to uncover insights.

For this occasion, I will use Hypothesis Testing because the current question from our data is if Mario games are rated higher than Pokemon games.

We have to consider whether Mario is significantly better or worse than Pokemon.

Hypothesis Testing, also known as Statistical Hypothesis Testing, is a technique for comparing two data sets. Because it’s a statistical inference approach, you’ll conclude what you are reaching.

Ok, back to our question. Is there any reason to suppose that the mean rating for Mario games differs significantly from the mean rating for Pokemon games?

ggplot(data = df, aes(x = title_game, y = rating)) +
  geom_boxplot() +
  labs(y = "Game Rating")

The difference in average ratings is 3.70–2.86 = 0.84. As a result, Mario games appear to have a 0.84-star advantage. But, more importantly, are these findings indicative of a real difference across all Mario and Pokemon games? Is it possible that this discrepancy is due to random sample variation?

Note that the box plot for Pokemon is so much bigger than Mario’s, and the reason is that there are more Pokemon games with 0 rating values.

Next, I will generate 1,000 repetitions of the data to test the independence by “shuffling” and by assuming the null hypothesis that both Mario and Pokemon games, on average, have the same ratings on RAWG.

If you wonder why “shuffling,” let me explain.

Imagine a world where there’s no difference in ratings between Mario and Pokemon games. So ratings would be irrelevant, right? So if I shuffle the data, there wouldn’t be any consequence.

I’m trying to shuffle the data randomly because I want to test our hypothesis of no difference between ratings.

With this in mind, I’ll have 1,000 replicated “shuffles” to compute the proper summary statistic for these 1,000 repeated shuffles, given the null hypothesis that both Mario and Pokemon games on average have the same RAWG ratings.

Because the difference in population means it is the unknown population parameter of interest, the test statistic of relevance here is the difference in sample means.

df |> 
  specify(formula = rating ~ title_game) |> 
  hypothesise(null = "independence") |> 
  generate(reps = 1000, type = "permute") |> 
  calculate(stat = "diff in means", order = c("Mario", "Pokemon")) |> 
  visualize(bins = 10) +
  shade_p_value(obs_stat = 0.84, direction = "both")

Let’s go through the plot’s aspects one by one.

First, the null distribution is represented by the histogram.

Second, the solid line represents the observed test statistic or the sample mean difference.

Third, the p-value is formed by the two shaded sections of the histogram, which is the chance of getting a test statistic as extreme as or more extreme than the observed test statistic if the null hypothesis is true.

What is the numerical value of the p-value? To compute this value, we utilize the get_p_value() function:

The p-value of 0.012 is small. In other words, in a hypothetical universe where there was no difference in ratings, there is a minimal possibility that we would see a difference of 3.70–2.86 = 0.84.

As a result, we may conclude that a difference exists in Mario and Pokemon game ratings, on average, for all RAWG.io.

Hold on a second, can we conclude there’s a difference?

I hope you have detected the mistake, but the first thing you have to do for hypothesis testing is to know what distribution follows your data.

In addition, are we sure which statistical test to use? I stumbled upon this superb article from Carolina Bento that explains absolutely everything.

Conclusion

Probability and statistics are not so easy to interpret. There are a lot of heuristics and biases, and most importantly, we do not have tangible feedback.

If you are driving or learning to go, you know that if you turn to the right, the car will do it as well. So sooner or later, you will understand, and your self-machine learning (a.k.a your brain) will learn how to drive a car.

Unfortunately, the output is different for environments where we have a lot of uncertainty. As humans, we give a lot of energy, and our energy is a limited resource.

I’m mentioning this because statistics might show you something different, while your brain shows you another thing. Think about it, you go to an eCommerce website and see a couple of reviews, and without data, you could infer which product is good or not.

Statistics could help you, but it is not always the solution. However, by the law of large numbers, you will be more selective about what you are dealing with more data.

In conclusion, if you want to be more objective in ratings, give a shot to statistics and hypothesis testing, but if you don’t care, go with the game you love and be biased.

What Statistics Never Told You About Ratings

Conclusion

Written by thedatafixer