written by Dinakar Chappa, Prashant Krishnan and Omkar Konaraddi
In this tutorial, we’ll glean rich insights from the Pokedex, a database of all Pokemon and their attributes, using various techniques and tools from CMSC320.
First we’ll scrape data on Pokemon from https://pokemondb.net/pokedex/all
library(dplyr)
library(rvest)
url <- "https://pokemondb.net/pokedex/all"
pokedex.scraped <- url %>%
read_html() %>%
html_node("table#pokedex") %>%
html_table() %>%
as.data.frame() %>%
rename("Sp.Atk" = "Sp. Atk") %>%
rename("Sp.Def" = "Sp. Def")
pokedex.scraped %>% head()
## # Name Type Total HP Attack Defense Sp.Atk
## 1 1 Bulbasaur GrassPoison 318 45 49 49 65
## 2 2 Ivysaur GrassPoison 405 60 62 63 80
## 3 3 Venusaur GrassPoison 525 80 82 83 100
## 4 3 VenusaurMega Venusaur GrassPoison 625 80 100 123 122
## 5 4 Charmander Fire 309 39 52 43 60
## 6 5 Charmeleon Fire 405 58 64 58 80
## Sp.Def Speed
## 1 65 45
## 2 80 60
## 3 100 80
## 4 120 80
## 5 50 65
## 6 65 80
We’ll filter out the Mega pokemon because they’re modified versions of existing pokemon and are almost duplicate pokemon.
pokedex.scraped <- pokedex.scraped %>%
filter(grepl("Mega", Name) == FALSE)
pokedex.scraped %>% head()
## # Name Type Total HP Attack Defense Sp.Atk Sp.Def Speed
## 1 1 Bulbasaur GrassPoison 318 45 49 49 65 65 45
## 2 2 Ivysaur GrassPoison 405 60 62 63 80 80 60
## 3 3 Venusaur GrassPoison 525 80 82 83 100 100 80
## 4 4 Charmander Fire 309 39 52 43 60 50 65
## 5 5 Charmeleon Fire 405 58 64 58 80 65 80
## 6 6 Charizard FireFlying 534 78 84 78 109 85 100
Some pokemon have two types so we’ll split the Type column into Type1 and Type2. How do we split up the Type column into two columns? Note that whenever a pokemon has two types, they’re types are concatenated and both types are capitalized. We’ll use RegEx and tidyr package’s extract
function to split Type into Type1 and Type2.
library(tidyr)
pokedex.table <- pokedex.scraped %>%
extract(Type, into = c("Type1", "Type2"),
"([A-Z][a-z]+)([A-Z][a-z]+)")
pokedex.table %>% head()
## # Name Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
## 2 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
## 3 3 Venusaur Grass Poison 525 80 82 83 100 100 80
## 4 4 Charmander <NA> <NA> 309 39 52 43 60 50 65
## 5 5 Charmeleon <NA> <NA> 405 58 64 58 80 65 80
## 6 6 Charizard Fire Flying 534 78 84 78 109 85 100
Oh no, some of our pokemon have only one Type and they have NA listed for both Type1 and Type2. We’ll fix this by iterating over our pokedex table and add the Type1 back from our initially scraped data.
for(i in 1:nrow(pokedex.table)) {
if (is.na(pokedex.table$Type1[i])) {
pokedex.table$Type1[i] <- pokedex.scraped$Type[i]
}
}
pokedex.table$Type1 <- factor(pokedex.table$Type1)
pokedex.table$Type2 <- factor(pokedex.table$Type2)
pokedex.table %>% head()
## # Name Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
## 2 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
## 3 3 Venusaur Grass Poison 525 80 82 83 100 100 80
## 4 4 Charmander Fire <NA> 309 39 52 43 60 50 65
## 5 5 Charmeleon Fire <NA> 405 58 64 58 80 65 80
## 6 6 Charizard Fire Flying 534 78 84 78 109 85 100
We’ll also need the pokemon height, weight, and capture rate. Let’s scrape the height and weight data from https://pokemondb.net/pokedex/stats/height-weight
url <- "https://pokemondb.net/pokedex/stats/height-weight"
height.weight.scraped <- url %>%
read_html() %>%
html_node("table.data-table") %>%
html_table() %>%
as.data.frame() %>%
# we can get rid of Type because we already have Type1 and Type2
within(rm(Type)) %>%
rename("Height_m" = "Height (m)") %>%
rename("Weight_kgs"= "Weight (kgs)")
height.weight.scraped %>% head()
## # Name Height (ft) Height_m Weight (lbs) Weight_kgs
## 1 1 Bulbasaur 2′04″ 0.7 15.2 6.9
## 2 2 Ivysaur 3′03″ 1.0 28.7 13.0
## 3 3 Venusaur 6′07″ 2.0 220.5 100.0
## 4 3 VenusaurMega Venusaur 7′10″ 2.4 342.8 155.5
## 5 4 Charmander 2′00″ 0.6 18.7 8.5
## 6 5 Charmeleon 3′07″ 1.1 41.9 19.0
## BMI
## 1 14.1
## 2 13.0
## 3 25.0
## 4 27.0
## 5 23.6
## 6 15.7
Next, we’ll join the height and weight columns to our existing pokedex table.
pokedex.table <- height.weight.scraped %>%
inner_join(pokedex.table, by = "Name") %>%
within(rm("#.y")) %>%
rename("#" = "#.x")
pokedex.table %>% head()
## # Name Height (ft) Height_m Weight (lbs) Weight_kgs BMI Type1
## 1 1 Bulbasaur 2′04″ 0.7 15.2 6.9 14.1 Grass
## 2 2 Ivysaur 3′03″ 1.0 28.7 13.0 13.0 Grass
## 3 3 Venusaur 6′07″ 2.0 220.5 100.0 25.0 Grass
## 4 4 Charmander 2′00″ 0.6 18.7 8.5 23.6 Fire
## 5 5 Charmeleon 3′07″ 1.1 41.9 19.0 15.7 Fire
## 6 6 Charizard 5′07″ 1.7 199.5 90.5 31.3 Fire
## Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed
## 1 Poison 318 45 49 49 65 65 45
## 2 Poison 405 60 62 63 80 80 60
## 3 Poison 525 80 82 83 100 100 80
## 4 <NA> 309 39 52 43 60 50 65
## 5 <NA> 405 58 64 58 80 65 80
## 6 Flying 534 78 84 78 109 85 100
We’ll also want to get the capture rate of each pokemon which we can obtain from https://bulbapedia.bulbagarden.net/wiki/List_of_Pokémon_by_catch_rate
url <- "https://bulbapedia.bulbagarden.net/wiki/List_of_Pokemon_by_catch_rate"
capture.rate.scraped <- url %>%
read_html() %>%
html_node("table.sortable") %>%
html_table() %>%
as.data.frame() %>%
rename("Catch_rate" = "Catch rate")
capture.rate.scraped$Catch_rate <- as.numeric(capture.rate.scraped$Catch_rate)
## Warning: NAs introduced by coercion
# we'll keep only the columns we need
capture.rate.scraped <- capture.rate.scraped[c("Name", "Catch_rate")]
capture.rate.scraped %>% head()
## Name Catch_rate
## 1 Bulbasaur 45
## 2 Ivysaur 45
## 3 Venusaur 45
## 4 Charmander 45
## 5 Charmeleon 45
## 6 Charizard 45
Next, we’ll join our capture rate with our pokedex table
pokedex.table <- capture.rate.scraped %>%
inner_join(pokedex.table, by = "Name")
pokedex.table %>% head()
## Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1 Bulbasaur 45 1 2′04″ 0.7 15.2 6.9
## 2 Ivysaur 45 2 3′03″ 1.0 28.7 13.0
## 3 Venusaur 45 3 6′07″ 2.0 220.5 100.0
## 4 Charmander 45 4 2′00″ 0.6 18.7 8.5
## 5 Charmeleon 45 5 3′07″ 1.1 41.9 19.0
## 6 Charizard 45 6 5′07″ 1.7 199.5 90.5
## BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed
## 1 14.1 Grass Poison 318 45 49 49 65 65 45
## 2 13.0 Grass Poison 405 60 62 63 80 80 60
## 3 25.0 Grass Poison 525 80 82 83 100 100 80
## 4 23.6 Fire <NA> 309 39 52 43 60 50 65
## 5 15.7 Fire <NA> 405 58 64 58 80 65 80
## 6 31.3 Fire Flying 534 78 84 78 109 85 100
We’ll also add the Generation Number for each Pokemon. We can use this to explore the differences between newer and older Pokemon generations. We can figure out which Generation each Pokemon belongs to by using each Pokemon’s Pokedex number and check which Generation it belongs in.
Source: https://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon#Detailed_lists_by_generation
ranges = c(1,152,252,387,494,650,722)
for (i in 1:nrow(pokedex.table)) {
for (g in length(ranges):1) {
if (pokedex.table$`#`[i] >= ranges[g]) {
pokedex.table$Gen[i] <- g
break
}
}
}
pokedex.table$Gen <- factor(pokedex.table$Gen)
pokedex.table %>% head()
## Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1 Bulbasaur 45 1 2′04″ 0.7 15.2 6.9
## 2 Ivysaur 45 2 3′03″ 1.0 28.7 13.0
## 3 Venusaur 45 3 6′07″ 2.0 220.5 100.0
## 4 Charmander 45 4 2′00″ 0.6 18.7 8.5
## 5 Charmeleon 45 5 3′07″ 1.1 41.9 19.0
## 6 Charizard 45 6 5′07″ 1.7 199.5 90.5
## BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison 318 45 49 49 65 65 45 1
## 2 13.0 Grass Poison 405 60 62 63 80 80 60 1
## 3 25.0 Grass Poison 525 80 82 83 100 100 80 1
## 4 23.6 Fire <NA> 309 39 52 43 60 50 65 1
## 5 15.7 Fire <NA> 405 58 64 58 80 65 80 1
## 6 31.3 Fire Flying 534 78 84 78 109 85 100 1
Great! We have everything we need for the rest of this tutorial.
Are Pokemon across all generations similar to one another? Do newer generations have better stats than previous generations? It’s intuitive to think that newer Pokemon would be better than older Pokemon but we’ll test our intuition using an ANOVA/F hypothesis test.
We can treat the Pokedex as a sample of the population because the in-game Pokemon have some variation in their stats (due to items they hold, what level they are at, where they are caught, how many battles they have won, etc.) and most Pokemon are not unique in the Pokemon games (i.e. you can catch multiple Pokemon of the same Pokedex name and number).
Do newer generations have more HP (Health Points) than previous generations?
First, let’s take a look at the average HP for each generation:
pokedex.table %>%
group_by(Gen) %>%
summarise(average_hp = mean(HP))
## # A tibble: 7 x 2
## Gen average_hp
## <fct> <dbl>
## 1 1 64.2
## 2 2 70.5
## 3 3 65.8
## 4 4 72.2
## 5 5 69.5
## 6 6 68.6
## 7 7 71.4
We can see some variation in the HP across generations; Generation IV Pokemon have, on average, around 8 more HP than the Generation I Pokemon. But is this statistically significant? Let’s use the ANOVA/F-test at 5% level of significance.
\(H_o =\) no difference between true average HP across generations
\(H_a =\) at least two generations’ average HP are different
res.aov <- aov(HP ~ Gen, data = pokedex.table)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Gen 6 6452 1075.3 1.584 0.149
## Residuals 779 528799 678.8
We can see that there’s a 14.9% of us seeing values as extreme as the above which is greater than our 5% level of significance. Therefore, we do not reject the null hypothesis (\(H_o\)). We conclude that the generations do not vary in HP at 5% level of significance.
Do newer generations have more Attack than previous generations?
First, let’s take a look at the average attack for each generation:
library(dplyr)
pokedex.table %>%
group_by(Gen) %>%
summarise(average_attack = mean(Attack))
## # A tibble: 7 x 2
## Gen average_attack
## <fct> <dbl>
## 1 1 72.9
## 2 2 67.5
## 3 3 72.5
## 4 4 80.0
## 5 5 79.9
## 6 6 72.1
## 7 7 85.7
We can already tell that there seems to be some variation in the average attack across generations. We can see that average attack of generation VII are more than 10 than the average of Gen I. At the same time however, we see that gen I, III, and VI are almost exactly the same with their average attack. The Question remains however if this variation is statistically significant? Let’s use the ANOVA/F-test at 5% level of significance.
\(H_o =\) no difference between true average attack across generations
\(H_a =\) at least two generations’ average attack are different
res.aov <- aov(Attack ~ Gen, data = pokedex.table)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Gen 6 22901 3817 4.525 0.000161 ***
## Residuals 779 657084 843
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see here that we have received a F-value of 4.525 which generates a p-value extremely small (0.000161. Therefore, we have sufficient evidence to reject the null hypothesis (\(H_o\)). We conclude that at least two generations vary with their attack.
Do newer generations have more Defense than previous generations?
First, let’s take a look at the average Defense for each generation:
pokedex.table %>%
group_by(Gen) %>%
summarise(average_Defense = mean(Defense))
## # A tibble: 7 x 2
## Gen average_Defense
## <fct> <dbl>
## 1 1 68.2
## 2 2 69.2
## 3 3 69.1
## 4 4 74.4
## 5 5 71.1
## 6 6 73.0
## 7 7 79.4
We can already tell that there seems to be some variation in the average Defense across generations. Unlike the other two ANOVA tests we ran, however, the difference in defense seems to be somewhat marginal. The largest gap we see is between generation VII and gen I (again) where we see a defense gap of almost 10 points. The Question remains however if this variation is statistically significant? Let’s use the ANOVA/F-test at 5% level of significance.
\(H_o =\) no difference between true average Defense across generations
\(H_a =\) at least two generations’ average Defense are different
res.aov <- aov(Defense ~ Gen, data = pokedex.table)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Gen 6 9254 1542.3 1.748 0.107
## Residuals 779 687530 882.6
We can see here that we have received a F-value of 0.107 which generated a p-value of 0.107. Since our p-value is greater than our level of significance at 0.05, we have insufficient evidence to reject the null hypothesis. We conclude that the average defense amongst generations remains relatively consistent.
Now, we are gonna answer the following question: How does height and weight (in other words, BMI) of a Pokemon correlate with its various base stats?
To do so, we’re gonna use a linear regression model that plots the BMI vs. the Total. Right now, the total is defined as the sum of all the base stats, which include HP, Attack, Defense, Sp. Atk, Sp. Def, and speed. BMI, or Body Mass Index, is calculated as the kg/m^2. Let’s plot the graph of BMI vs. the Total Base Stats using ggplot().
library(tidyverse)
library(broom)
pokedex <- pokedex.table
pokedex %>%
ggplot(aes(x=as.numeric(BMI), y=Total)) +
geom_point() +
geom_smooth(method=lm) +
labs(title="BMI vs. Total Base Stats",
x = "BMI",
y = "Total Base Stats")
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
As you can see here, a lot of the data points are clustered to the left side, and there appears to be no linear trend among pokemon. The BMI of a Pokemon does not has little to no correlation on what its Total Base Stats may be. However, let’s see what we get as the linear regression equation, we’ll worry about whether it is accurate or not later.
auto_fit <- lm(Total~as.numeric(BMI), data=pokedex.table)
## Warning in eval(predvars, data, env): NAs introduced by coercion
auto_fit
##
## Call:
## lm(formula = Total ~ as.numeric(BMI), data = pokedex.table)
##
## Coefficients:
## (Intercept) as.numeric(BMI)
## 425.6167 -0.1614
As you can see, the linear regression equation is 45.6167 - 0.1614(x). How much confidence we can put in this equation, let’s set up a 95% confidence interval.
auto_fit_stats <- auto_fit %>%
tidy() %>%
select(term, estimate, std.error)
auto_fit_stats
## # A tibble: 2 x 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 (Intercept) 426. 5.38
## 2 as.numeric(BMI) -0.161 0.0912
confidence_interval_offset <- 1.95 * auto_fit_stats$std.error[2]
confidence_interval <- round(c(auto_fit_stats$estimate[2] - confidence_interval_offset,
auto_fit_stats$estimate[2],
auto_fit_stats$estimate[2] + confidence_interval_offset), 4)
Given our confidence interval, we would say that on average, for every k/m^2, we are 95% confident that the base stats will lie within (-0.3392, 0.0164) fewer.
Our goal now is to investigate how height and weight (BMI) of a pokemon correlate to its capture rate.
We will assess this correlation by fitting a linear regression model that will plot BMI vs Capture rate. BMI (body mass index) will be calculated as kg/m^2. We will start by plotting the graph of BMI vs capture rate using the ggplot() function.
Setup:
library(tidyr)
library(dplyr)
library(rvest)
library(tidyverse)
library(broom)
knitr::opts_chunk$set(echo = TRUE)
pokedex <- pokedex.table
pokedex %>% ggplot(aes(x=as.numeric(BMI), y=Catch_rate)) + geom_point() + geom_smooth(method=lm) +
labs(title="BMI vs. Capture Rate", x="BMI", y = "Captue Rate")
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning: Removed 20 rows containing non-finite values (stat_smooth).
## Warning: Removed 20 rows containing missing values (geom_point).
As can be seen by the above plot, there really does not seem to be a strong interaction between either two variables (BMI vs Catch rate). It is hard to distinguish any linear trend amongst the data set, hence the almost horizontal line of best-fit generated by the above image.
Let us still see what we get for the equation after generating a linear regression model.
auto_fit1 <- lm(Catch_rate ~ as.numeric(BMI), data=pokedex.table)
## Warning in eval(predvars, data, env): NAs introduced by coercion
auto_fit1
##
## Call:
## lm(formula = Catch_rate ~ as.numeric(BMI), data = pokedex.table)
##
## Coefficients:
## (Intercept) as.numeric(BMI)
## 98.87623 0.04687
The linear regression equation generated is y = 98.88 - 0.047(x). Where, once again, x represents the BMI of the pokemon (our explanatory variable) and y is the catch rate of the pokemon, (our response variable).
Now we will be plotting the 95% confidence interval.
auto_fit_stats2 <- auto_fit1 %>% tidy() %>% select(term, estimate, std.error)
auto_fit_stats2
## # A tibble: 2 x 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 (Intercept) 98.9 3.77
## 2 as.numeric(BMI) 0.0469 0.0634
Now we will be plotting the 95% confidence interval.
confidence_interval_offset <- 1.95 * auto_fit_stats2$std.error[2]
confidence_interval <- round(c(auto_fit_stats2$estimate[2] - confidence_interval_offset, auto_fit_stats2$estimate[2],
auto_fit_stats2$estimate[2] +
confidence_interval_offset), 4)
confidence_interval
## [1] -0.0768 0.0469 0.1705
Given the confidence interval, we would conclude that, on average for each k/m^2, we can have 95% confidence that the catch rate will lie within (-0.0768, 0.1705) interval.
Here we will be picking the optimal team; best set of pokemon to use in battle. Each pokemon team is comprised of six pokemon. In order to get the best team, we will be picking the pokemons with the highest “total” stat score based on different categories.
Our first three pokemon will be based on the three standard types, Fire, Water and Grass. The next three will be variable and will simply depend on the next pokemon with the highest base total stats. We are adding the additional requirement that every pokemon needs to be a different type. Although pokemon can have duel type, for the purposes of this analysis, we will be focusing on Type1 primarily, But before we start filtering our lists, we must prepare dataframe first. Additionally we are choosing to omit the legendary pokemons. In this section of R code, we are marking each pokemon with a new column boolean column called isLegendary. We have maintained a previously built list of legendary pokemon to help us with this.
library(tidyverse)
library(broom)
knitr::opts_chunk$set(echo = TRUE)
pokedex <- pokedex.table
legendaries = c("Zapdos", "Articuno", "Moltres", "Mewtwo",
"Mew", "Raikou", "Entei", "Suicune",
"Lugia", "Celebi",
"Regirock", "Regice", "Registeel",
"Latias", "Latios", "Kyogre", "Groudon", "Rayquaza",
"Jirachi", "Uxie", "Mesprit", "Azelf", "Dialga",
"Palkia", "Heatran", "Regigigas", "Giratina",
"Cresselia", "Phione", "Manaphy", "Darkrai",
"Arceus", "Victini", "Reshiram", "Zekrom",
"Kyurem", "Genesect",
"Cobalion" , "Terrakion", "Virizion", "Volcanion", "Solgaleo", "Tapu Fini", "Tapu Bulu", "Xerneas", "Yveltal")
df <- pokedex
df$isLegendary <- FALSE
for (i in 1:length(legendaries)) {
df$isLegendary[df$Name == legendaries[i]] <- TRUE
}
df$isLegendary <- factor(df$isLegendary)
df %>% head()
## Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1 Bulbasaur 45 1 2′04″ 0.7 15.2 6.9
## 2 Ivysaur 45 2 3′03″ 1.0 28.7 13.0
## 3 Venusaur 45 3 6′07″ 2.0 220.5 100.0
## 4 Charmander 45 4 2′00″ 0.6 18.7 8.5
## 5 Charmeleon 45 5 3′07″ 1.1 41.9 19.0
## 6 Charizard 45 6 5′07″ 1.7 199.5 90.5
## BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison 318 45 49 49 65 65 45 1
## 2 13.0 Grass Poison 405 60 62 63 80 80 60 1
## 3 25.0 Grass Poison 525 80 82 83 100 100 80 1
## 4 23.6 Fire <NA> 309 39 52 43 60 50 65 1
## 5 15.7 Fire <NA> 405 58 64 58 80 65 80 1
## 6 31.3 Fire Flying 534 78 84 78 109 85 100 1
## isLegendary
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
In this section, we begin the process of filtering the dataframe. The first step is the remove all the isLegendary pokemon we marked in the beginning of the list.
The first thing we do is filter away all the legendary pokemon. We do this by filtering all pokemon marked isLegendery with FALSE.
We then begin by creating a new dataframe everytime by filtering out certain pokemon. For the first three types these new dataframes are labeled by their type - such as fire_dex. For the last three dataframes, these are named after ‘typeless_pokedex’ - as they will be the resulting pokedex after getting rid of Fire, Grass, Water as well as the previous type of the pokemon added to our team.
In each of these ‘type’ dataframes, we arrange them by Total stats in descending order, and then slice off the first entry (i.e the pokemon with the highest stats of that type). We then combine it together to form a new dataframe.
We display the final team below. The final team is stored in the variable called Total.
unlegendary_pokedex <- df %>% filter(isLegendary == FALSE)
fire_dex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 == "Fire") %>% arrange(desc(Total))
fire_pokemon <- fire_dex %>% slice(1)
water_dex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 == "Water") %>% arrange(desc(Total))
water_pokemon <- water_dex %>% slice(1)
grass_dex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 == "Grass") %>% arrange(desc(Total))
grass_pokemon <- grass_dex %>% slice(1)
total <- rbind(fire_pokemon, water_pokemon)
total <- rbind(total, grass_pokemon)
typeless_pokedex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 != "Water")%>% filter(Type1 != "Fire")%>% filter(Type1 != "Grass") %>% arrange(desc(Total))
total <- rbind(total, slice(typeless_pokedex, 1))
temp <- slice(typeless_pokedex, 1)
previous_type1 <- temp$Type1[1]
typeless_pokedex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 != "Water")%>% filter(Type1 != "Fire")%>% filter(Type1 != "Grass") %>% filter(Type1 != previous_type1) %>% arrange(desc(Total))
total <- rbind(total, slice(typeless_pokedex, 1))
temp <- slice(typeless_pokedex, 1)
previous_type2 <- temp$Type1[1]
typeless_pokedex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 != "Water")%>% filter(Type1 != "Fire")%>% filter(Type1 != "Grass") %>% filter(Type1 != previous_type1) %>% filter(Type1 != previous_type2) %>% arrange(desc(Total))
total <- rbind(total, slice(typeless_pokedex, 1))
temp <- slice(typeless_pokedex, 1)
previous_type <- temp$Type1[1]
total
## Name Type1 Total
## 1 Blacephalon Fire 570
## 2 Gyarados Water 540
## 3 Kartana Grass 570
## 4 Lunala Psychic 680
## 5 Slaking Normal 670
## 6 Dragonite Dragon 600
In the Pokemon games, our strongest opponents often have these Pokemon. Dragonite is particularly popular amongst trainers for it’s stats.
Can we use a Random Forest to predict a Pokemon’s Type1? We’ll use various attributes of each Pokemon to see if we can predict their Type1.
First we split our data into training and testing sets. Our training set will be Generations 2, 4, and 6. Our test set will be Generations 1, 3, and 5.
df <- pokedex.table
trainingData <- df[df$Gen == 2 | df$Gen == 4 | df$Gen == 6, ]
trainingData %>% head()
## Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 152 Chikorita 45 152 2′11″ 0.9 14.1 6.4
## 153 Bayleef 45 153 3′11″ 1.2 34.8 15.8
## 154 Cyndaquil 45 155 1′08″ 0.5 17.4 7.9
## 155 Quilava 45 156 2′11″ 0.9 41.9 19.0
## 156 Typhlosion 45 157 5′07″ 1.7 175.3 79.5
## 157 Totodile 45 158 2′00″ 0.6 20.9 9.5
## BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 152 7.9 Grass <NA> 318 45 49 65 49 65 45 2
## 153 11.0 Grass <NA> 405 60 62 80 63 80 60 2
## 154 31.6 Fire <NA> 309 39 52 43 60 50 65 2
## 155 23.5 Fire <NA> 405 58 64 58 80 65 80 2
## 156 27.5 Fire <NA> 534 78 84 78 109 85 100 2
## 157 26.4 Water <NA> 314 50 65 64 44 48 43 2
testData <- df[df$Gen == 1 | df$Gen == 3 | df$Gen == 5, ]
testData %>% head()
## Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1 Bulbasaur 45 1 2′04″ 0.7 15.2 6.9
## 2 Ivysaur 45 2 3′03″ 1.0 28.7 13.0
## 3 Venusaur 45 3 6′07″ 2.0 220.5 100.0
## 4 Charmander 45 4 2′00″ 0.6 18.7 8.5
## 5 Charmeleon 45 5 3′07″ 1.1 41.9 19.0
## 6 Charizard 45 6 5′07″ 1.7 199.5 90.5
## BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison 318 45 49 49 65 65 45 1
## 2 13.0 Grass Poison 405 60 62 63 80 80 60 1
## 3 25.0 Grass Poison 525 80 82 83 100 100 80 1
## 4 23.6 Fire <NA> 309 39 52 43 60 50 65 1
## 5 15.7 Fire <NA> 405 58 64 58 80 65 80 1
## 6 31.3 Fire Flying 534 78 84 78 109 85 100 1
Our random forest will take into consideration all variables except for BMI (since we’re already considering Height and Weight), Type2 (we’re only trying to predict Type1 and some Pokemon don’t have a Type2), and we’re not using Total since we’ll be using each individual base stat.
library(randomForest)
set.seed(1234)
rf <- randomForest(Type1 ~ Catch_rate + Height_m + Weight_kgs + HP + Attack + Defense + Sp.Atk+ Sp.Def + Speed + Gen, importance = TRUE, mtry=1, data = trainingData, na.action=na.exclude)
After we train the model on our training data, we check whether we can predict the Type1 in our test set.
actuals <- testData[, "Type1"]
predictions <- predict(rf, testData)
pre.act.df <- as.data.frame(table(predictions, actuals))
pre.act.df %>% filter(Freq > 0) %>% head()
## predictions actuals Freq
## 1 Bug Bug 7
## 2 Fairy Bug 4
## 3 Fire Bug 10
## 4 Flying Bug 1
## 5 Grass Bug 3
## 6 Normal Bug 8
correct.predictions <- nrow(pre.act.df[pre.act.df$predictions == pre.act.df$actuals,])
accuracy <- correct.predictions / nrow(pre.act.df[pre.act.df$Freq > 0,])
accuracy
## [1] 0.1666667
Our random forest correctly guesses the Type1 for 15-17% of the Pokemon. It seems like we can’t consistently predict a Pokemon’s Type1 using a random forest on its attributes.
We’re gonna use the Random Forest Model to see if we can use the various different stats of each Pokemon to see if we can predict its Type2. To do so we’re gonna filter out all of the Pokemon that don’t have a Type2, so that the model doesn’t get confused at all the NA Values.
pokedex.table %>% head()
## Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1 Bulbasaur 45 1 2′04″ 0.7 15.2 6.9
## 2 Ivysaur 45 2 3′03″ 1.0 28.7 13.0
## 3 Venusaur 45 3 6′07″ 2.0 220.5 100.0
## 4 Charmander 45 4 2′00″ 0.6 18.7 8.5
## 5 Charmeleon 45 5 3′07″ 1.1 41.9 19.0
## 6 Charizard 45 6 5′07″ 1.7 199.5 90.5
## BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison 318 45 49 49 65 65 45 1
## 2 13.0 Grass Poison 405 60 62 63 80 80 60 1
## 3 25.0 Grass Poison 525 80 82 83 100 100 80 1
## 4 23.6 Fire <NA> 309 39 52 43 60 50 65 1
## 5 15.7 Fire <NA> 405 58 64 58 80 65 80 1
## 6 31.3 Fire Flying 534 78 84 78 109 85 100 1
df <- pokedex.table
df <- df %>% filter(!is.na(Type2))
df %>% head()
## Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1 Bulbasaur 45 1 2′04″ 0.7 15.2 6.9
## 2 Ivysaur 45 2 3′03″ 1.0 28.7 13.0
## 3 Venusaur 45 3 6′07″ 2.0 220.5 100.0
## 4 Charizard 45 6 5′07″ 1.7 199.5 90.5
## 5 Butterfree 45 12 3′07″ 1.1 70.5 32.0
## 6 Weedle 255 13 1′00″ 0.3 7.1 3.2
## BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison 318 45 49 49 65 65 45 1
## 2 13.0 Grass Poison 405 60 62 63 80 80 60 1
## 3 25.0 Grass Poison 525 80 82 83 100 100 80 1
## 4 31.3 Fire Flying 534 78 84 78 109 85 100 1
## 5 26.4 Bug Flying 395 60 45 50 90 80 70 1
## 6 35.6 Bug Poison 195 40 35 30 20 20 50 1
Now that we’ve got that settled, we’re gonna use our training data set to be Generations 2, 4, and 6. The main reason for picking these generations is because they ensure that all types of pokemon are being looked at – for example, generation 2 introduces Dark and Silver type Pokemon, while Generation 6 introduces Fairy type Pokemon. To test, we’ll set our Pokemon data to be Generations 1, 3 and 5.
library(randomForest)
trainingData <- df[df$Gen == 2 | df$Gen == 4 | df$Gen == 6, ]
trainingData %>% head()
## Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 68 Hoothoot 255 163 2′04″ 0.7 46.7 21.2
## 69 Noctowl 90 164 5′03″ 1.6 89.9 40.8
## 70 Ledyba 255 165 3′03″ 1.0 23.8 10.8
## 71 Ledian 90 166 4′07″ 1.4 78.5 35.6
## 72 Spinarak 255 167 1′08″ 0.5 18.7 8.5
## 73 Ariados 90 168 3′07″ 1.1 73.9 33.5
## BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 68 43.3 Normal Flying 262 60 30 30 36 56 50 2
## 69 15.9 Normal Flying 452 100 50 50 86 96 70 2
## 70 10.8 Bug Flying 265 40 20 30 40 80 55 2
## 71 18.2 Bug Flying 390 55 35 50 55 110 85 2
## 72 34.0 Bug Poison 250 40 60 40 40 40 30 2
## 73 27.7 Bug Poison 400 70 90 70 60 70 40 2
testData <- df[df$Gen == 1 | df$Gen == 3 | df$Gen == 5, ]
testData %>% head()
## Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1 Bulbasaur 45 1 2′04″ 0.7 15.2 6.9
## 2 Ivysaur 45 2 3′03″ 1.0 28.7 13.0
## 3 Venusaur 45 3 6′07″ 2.0 220.5 100.0
## 4 Charizard 45 6 5′07″ 1.7 199.5 90.5
## 5 Butterfree 45 12 3′07″ 1.1 70.5 32.0
## 6 Weedle 255 13 1′00″ 0.3 7.1 3.2
## BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison 318 45 49 49 65 65 45 1
## 2 13.0 Grass Poison 405 60 62 63 80 80 60 1
## 3 25.0 Grass Poison 525 80 82 83 100 100 80 1
## 4 31.3 Fire Flying 534 78 84 78 109 85 100 1
## 5 26.4 Bug Flying 395 60 45 50 90 80 70 1
## 6 35.6 Bug Poison 195 40 35 30 20 20 50 1
set.seed(1234)
Here, we’re gonna tell the model to look at Catch Rate, Height, Weight, HP, Attack, Defense, Sp. Atk, Sp. Def, Generation, and its Type to try and classify its Type 2.
rf <- randomForest(Type2 ~ Catch_rate + Height_m + Weight_kgs + HP + Attack + Defense + Sp.Atk+ Sp.Def + Speed + Gen + Type1, importance = TRUE, mtry=1, data = trainingData, na.action=na.exclude)
actuals <- testData[, "Type2"]
predictions <- predict(rf, testData)
pre.act.df <- as.data.frame(table(predictions, actuals))
pre.act.df %>% filter(Freq > 0) %>% head()
## predictions actuals Freq
## 1 Dragon Bug 1
## 2 Flying Bug 1
## 3 Fighting Dark 1
## 4 Flying Dark 7
## 5 Steel Dark 1
## 6 Flying Dragon 4
correct.predictions <- nrow(pre.act.df[pre.act.df$predictions == pre.act.df$actuals,])
accuracy <- correct.predictions / nrow(pre.act.df[pre.act.df$Freq > 0,])
accuracy
## [1] 0.28125
Our random forest correctly guesses the Type2 for about 28-32% of the Pokemon. It seems like we can’t consistently predict a Pokemon’s Type2 based on its stats, although we have a much higher rate then predicting Type 1. Oddly enough, it appears that it got 35-38 out of 43 on flying types, so it may seem like some data is correlated.
Can we use a Random Forest to classify legendaries?
First, we’ll label our data set with legendaries using the list here: https://bulbapedia.bulbagarden.net/wiki/Legendary_Pokémon
legendaries = c("Zapdos", "Articuno", "Moltres", "Mewtwo",
"Mew", "Raikou", "Entei", "Suicune",
"Lugia", "Celebi",
"Regirock", "Regice", "Registeel",
"Latias", "Latios", "Kyogre", "Groudon", "Rayquaza",
"Jirachi", "Uxie", "Mesprit", "Azelf", "Dialga",
"Palkia", "Heatran", "Regigigas", "Giratina",
"Cresselia", "Phione", "Manaphy", "Darkrai",
"Arceus", "Victini", "Reshiram", "Zekrom",
"Kyurem", "Genesect",
"Cobalion" , "Terrakion", "Virizion")
df <- pokedex.table
df$isLegendary <- FALSE
for (i in 1:length(legendaries)) {
df$isLegendary[df$Name == legendaries[i]] <- TRUE
}
df$isLegendary <- factor(df$isLegendary)
Next, we’ll split our data set into training and testing data. We’ll train the random forest on Generations 2, 4, and 6 then test it out on 1, 3, and 5.
library(randomForest)
trainingData <- df[df$Gen == 2 | df$Gen == 4 | df$Gen == 6, ]
trainingData %>% head()
## Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 152 Chikorita 45 152 2′11″ 0.9 14.1 6.4
## 153 Bayleef 45 153 3′11″ 1.2 34.8 15.8
## 154 Cyndaquil 45 155 1′08″ 0.5 17.4 7.9
## 155 Quilava 45 156 2′11″ 0.9 41.9 19.0
## 156 Typhlosion 45 157 5′07″ 1.7 175.3 79.5
## 157 Totodile 45 158 2′00″ 0.6 20.9 9.5
## BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 152 7.9 Grass <NA> 318 45 49 65 49 65 45 2
## 153 11.0 Grass <NA> 405 60 62 80 63 80 60 2
## 154 31.6 Fire <NA> 309 39 52 43 60 50 65 2
## 155 23.5 Fire <NA> 405 58 64 58 80 65 80 2
## 156 27.5 Fire <NA> 534 78 84 78 109 85 100 2
## 157 26.4 Water <NA> 314 50 65 64 44 48 43 2
## isLegendary
## 152 FALSE
## 153 FALSE
## 154 FALSE
## 155 FALSE
## 156 FALSE
## 157 FALSE
testData <- df[df$Gen == 1 | df$Gen == 3 | df$Gen == 5, ]
testData %>% head()
## Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1 Bulbasaur 45 1 2′04″ 0.7 15.2 6.9
## 2 Ivysaur 45 2 3′03″ 1.0 28.7 13.0
## 3 Venusaur 45 3 6′07″ 2.0 220.5 100.0
## 4 Charmander 45 4 2′00″ 0.6 18.7 8.5
## 5 Charmeleon 45 5 3′07″ 1.1 41.9 19.0
## 6 Charizard 45 6 5′07″ 1.7 199.5 90.5
## BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison 318 45 49 49 65 65 45 1
## 2 13.0 Grass Poison 405 60 62 63 80 80 60 1
## 3 25.0 Grass Poison 525 80 82 83 100 100 80 1
## 4 23.6 Fire <NA> 309 39 52 43 60 50 65 1
## 5 15.7 Fire <NA> 405 58 64 58 80 65 80 1
## 6 31.3 Fire Flying 534 78 84 78 109 85 100 1
## isLegendary
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
Our random forest will use the Total
and Catch_rate
columns to predict whether a Pokemon is a legendary Pokemon. We use Total
and Catch_rate
because we know legendaries are usually very powerful Pokemon that are difficult to catch.
set.seed(1234)
rf <- randomForest(isLegendary ~ Total + Catch_rate, importance = TRUE, mtry=2, data = trainingData, na.action=na.exclude)
rf
##
## Call:
## randomForest(formula = isLegendary ~ Total + Catch_rate, data = trainingData, importance = TRUE, mtry = 2, na.action = na.exclude)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 2.26%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 247 4 0.01593625
## TRUE 2 13 0.13333333
Now we’ll check out the performance of our random forest model
test.labels <- testData[, "isLegendary"]
predictions <- predict(rf, testData)
table(predictions, test.labels)
## test.labels
## predictions FALSE TRUE
## FALSE 404 1
## TRUE 3 16
Our random forest was able to predict 16/17 legendaries in the test set. That’s pretty good! Let’s see which legendary Pokemon we missed, our only false negative, and which Pokemon were our 3 false positives.
testData$Prediction <- predictions
testData.subset <- testData %>%
select(Name, Total, Catch_rate, isLegendary, Prediction)
# filter for false negative
testData.subset %>%
filter(Prediction == FALSE & isLegendary == TRUE)
## Name Total Catch_rate isLegendary Prediction
## 1 Mew 600 45 TRUE FALSE
# filter for false positives
testData.subset %>%
filter(Prediction == TRUE & isLegendary == FALSE)
## Name Total Catch_rate isLegendary Prediction
## 1 Beldum 300 3 FALSE TRUE
## 2 Metang 420 3 FALSE TRUE
## 3 Metagross 600 3 FALSE TRUE
Mew was our false negative. This is reasonably hard to predict because Mew is unusually easy to catch for being a legendary Pokemon. This is probably why the random forest couldn’t predict it was legendary.
Beldum, Metang, and Metagross are the first, second, and third evoluation stages of the same Pokemon. Metagross, the final evolution, has battle stats that are on par with legendary Pokemon but is technically not considered legendary. Metagross is one of the pseudo-legendary Pokemon.
Overall, our random forest proved to be good at classifying legendaries based on the Total
and Catch_rate
.