Scraping the data
Testing Similarity across Generations with the ANOVA/F-test
- HP
- Attack
- Defense
Linear Regression
- BMI vs Total Base Stat
- BMI vs Catch Rate
The Optimal Team
Random Forest

written by Dinakar Chappa, Prashant Krishnan and Omkar Konaraddi

In this tutorial, we’ll glean rich insights from the Pokedex, a database of all Pokemon and their attributes, using various techniques and tools from CMSC320.

Scraping the data

First we’ll scrape data on Pokemon from https://pokemondb.net/pokedex/all

library(dplyr)
library(rvest)

url <- "https://pokemondb.net/pokedex/all"

pokedex.scraped <- url %>%
  read_html() %>%
  html_node("table#pokedex") %>%
  html_table() %>%
  as.data.frame() %>%
  rename("Sp.Atk" = "Sp. Atk") %>%
  rename("Sp.Def" = "Sp. Def")

pokedex.scraped %>% head()

##   #                  Name        Type Total HP Attack Defense Sp.Atk
## 1 1             Bulbasaur GrassPoison   318 45     49      49     65
## 2 2               Ivysaur GrassPoison   405 60     62      63     80
## 3 3              Venusaur GrassPoison   525 80     82      83    100
## 4 3 VenusaurMega Venusaur GrassPoison   625 80    100     123    122
## 5 4            Charmander        Fire   309 39     52      43     60
## 6 5            Charmeleon        Fire   405 58     64      58     80
##   Sp.Def Speed
## 1     65    45
## 2     80    60
## 3    100    80
## 4    120    80
## 5     50    65
## 6     65    80

We’ll filter out the Mega pokemon because they’re modified versions of existing pokemon and are almost duplicate pokemon.

pokedex.scraped <- pokedex.scraped %>%
  filter(grepl("Mega", Name) == FALSE)
pokedex.scraped %>% head()

##   #       Name        Type Total HP Attack Defense Sp.Atk Sp.Def Speed
## 1 1  Bulbasaur GrassPoison   318 45     49      49     65     65    45
## 2 2    Ivysaur GrassPoison   405 60     62      63     80     80    60
## 3 3   Venusaur GrassPoison   525 80     82      83    100    100    80
## 4 4 Charmander        Fire   309 39     52      43     60     50    65
## 5 5 Charmeleon        Fire   405 58     64      58     80     65    80
## 6 6  Charizard  FireFlying   534 78     84      78    109     85   100

Some pokemon have two types so we’ll split the Type column into Type1 and Type2. How do we split up the Type column into two columns? Note that whenever a pokemon has two types, they’re types are concatenated and both types are capitalized. We’ll use RegEx and tidyr package’s extract function to split Type into Type1 and Type2.

library(tidyr)
pokedex.table <- pokedex.scraped %>% 
  extract(Type, into = c("Type1", "Type2"), 
          "([A-Z][a-z]+)([A-Z][a-z]+)")

pokedex.table %>% head()

##   #       Name Type1  Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed
## 1 1  Bulbasaur Grass Poison   318 45     49      49     65     65    45
## 2 2    Ivysaur Grass Poison   405 60     62      63     80     80    60
## 3 3   Venusaur Grass Poison   525 80     82      83    100    100    80
## 4 4 Charmander  <NA>   <NA>   309 39     52      43     60     50    65
## 5 5 Charmeleon  <NA>   <NA>   405 58     64      58     80     65    80
## 6 6  Charizard  Fire Flying   534 78     84      78    109     85   100

Oh no, some of our pokemon have only one Type and they have NA listed for both Type1 and Type2. We’ll fix this by iterating over our pokedex table and add the Type1 back from our initially scraped data.

for(i in 1:nrow(pokedex.table)) {
    if (is.na(pokedex.table$Type1[i])) {
      pokedex.table$Type1[i] <- pokedex.scraped$Type[i]
    }
}

pokedex.table$Type1 <- factor(pokedex.table$Type1)
pokedex.table$Type2 <- factor(pokedex.table$Type2)

pokedex.table %>% head()

##   #       Name Type1  Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed
## 1 1  Bulbasaur Grass Poison   318 45     49      49     65     65    45
## 2 2    Ivysaur Grass Poison   405 60     62      63     80     80    60
## 3 3   Venusaur Grass Poison   525 80     82      83    100    100    80
## 4 4 Charmander  Fire   <NA>   309 39     52      43     60     50    65
## 5 5 Charmeleon  Fire   <NA>   405 58     64      58     80     65    80
## 6 6  Charizard  Fire Flying   534 78     84      78    109     85   100

We’ll also need the pokemon height, weight, and capture rate. Let’s scrape the height and weight data from https://pokemondb.net/pokedex/stats/height-weight

url <- "https://pokemondb.net/pokedex/stats/height-weight"

height.weight.scraped <- url %>%
  read_html() %>%
  html_node("table.data-table") %>%
  html_table() %>%
  as.data.frame() %>%
  # we can get rid of Type because we already have Type1 and Type2
  within(rm(Type))  %>%
  rename("Height_m" = "Height (m)") %>%
  rename("Weight_kgs"= "Weight (kgs)") 

height.weight.scraped %>% head()

##   #                  Name Height (ft) Height_m Weight (lbs) Weight_kgs
## 1 1             Bulbasaur       2′04″      0.7         15.2        6.9
## 2 2               Ivysaur       3′03″      1.0         28.7       13.0
## 3 3              Venusaur       6′07″      2.0        220.5      100.0
## 4 3 VenusaurMega Venusaur       7′10″      2.4        342.8      155.5
## 5 4            Charmander       2′00″      0.6         18.7        8.5
## 6 5            Charmeleon       3′07″      1.1         41.9       19.0
##    BMI
## 1 14.1
## 2 13.0
## 3 25.0
## 4 27.0
## 5 23.6
## 6 15.7

Next, we’ll join the height and weight columns to our existing pokedex table.

pokedex.table <- height.weight.scraped %>%
  inner_join(pokedex.table, by = "Name") %>%
  within(rm("#.y")) %>%
  rename("#" = "#.x")

pokedex.table %>% head()

##   #       Name Height (ft) Height_m Weight (lbs) Weight_kgs  BMI Type1
## 1 1  Bulbasaur       2′04″      0.7         15.2        6.9 14.1 Grass
## 2 2    Ivysaur       3′03″      1.0         28.7       13.0 13.0 Grass
## 3 3   Venusaur       6′07″      2.0        220.5      100.0 25.0 Grass
## 4 4 Charmander       2′00″      0.6         18.7        8.5 23.6  Fire
## 5 5 Charmeleon       3′07″      1.1         41.9       19.0 15.7  Fire
## 6 6  Charizard       5′07″      1.7        199.5       90.5 31.3  Fire
##    Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed
## 1 Poison   318 45     49      49     65     65    45
## 2 Poison   405 60     62      63     80     80    60
## 3 Poison   525 80     82      83    100    100    80
## 4   <NA>   309 39     52      43     60     50    65
## 5   <NA>   405 58     64      58     80     65    80
## 6 Flying   534 78     84      78    109     85   100

We’ll also want to get the capture rate of each pokemon which we can obtain from https://bulbapedia.bulbagarden.net/wiki/List_of_Pokémon_by_catch_rate

url <- "https://bulbapedia.bulbagarden.net/wiki/List_of_Pokemon_by_catch_rate"

capture.rate.scraped <- url %>%
  read_html() %>%
  html_node("table.sortable") %>%
  html_table() %>%
  as.data.frame() %>%
  rename("Catch_rate" = "Catch rate")
  
  
capture.rate.scraped$Catch_rate <- as.numeric(capture.rate.scraped$Catch_rate)

## Warning: NAs introduced by coercion

# we'll keep only the columns we need
capture.rate.scraped <- capture.rate.scraped[c("Name", "Catch_rate")]
capture.rate.scraped %>% head()

##         Name Catch_rate
## 1  Bulbasaur         45
## 2    Ivysaur         45
## 3   Venusaur         45
## 4 Charmander         45
## 5 Charmeleon         45
## 6  Charizard         45

Next, we’ll join our capture rate with our pokedex table

pokedex.table <- capture.rate.scraped %>%
  inner_join(pokedex.table, by = "Name")
pokedex.table %>% head()

##         Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1  Bulbasaur         45 1       2′04″      0.7         15.2        6.9
## 2    Ivysaur         45 2       3′03″      1.0         28.7       13.0
## 3   Venusaur         45 3       6′07″      2.0        220.5      100.0
## 4 Charmander         45 4       2′00″      0.6         18.7        8.5
## 5 Charmeleon         45 5       3′07″      1.1         41.9       19.0
## 6  Charizard         45 6       5′07″      1.7        199.5       90.5
##    BMI Type1  Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed
## 1 14.1 Grass Poison   318 45     49      49     65     65    45
## 2 13.0 Grass Poison   405 60     62      63     80     80    60
## 3 25.0 Grass Poison   525 80     82      83    100    100    80
## 4 23.6  Fire   <NA>   309 39     52      43     60     50    65
## 5 15.7  Fire   <NA>   405 58     64      58     80     65    80
## 6 31.3  Fire Flying   534 78     84      78    109     85   100

We’ll also add the Generation Number for each Pokemon. We can use this to explore the differences between newer and older Pokemon generations. We can figure out which Generation each Pokemon belongs to by using each Pokemon’s Pokedex number and check which Generation it belongs in.

Generation I: #001-151
Generation II: #152-251
Generation III: #252-386
Generation IV: #387-493
Generation V: #494-649
Generation VI: #650-721
Generation VII: #722-809

Source: https://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon#Detailed_lists_by_generation

ranges = c(1,152,252,387,494,650,722)

for (i in 1:nrow(pokedex.table)) {
  for (g in length(ranges):1) {
    if (pokedex.table$`#`[i] >= ranges[g]) {
      pokedex.table$Gen[i] <- g
      break
    }
  }
}
pokedex.table$Gen <- factor(pokedex.table$Gen)

pokedex.table %>% head()

##         Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1  Bulbasaur         45 1       2′04″      0.7         15.2        6.9
## 2    Ivysaur         45 2       3′03″      1.0         28.7       13.0
## 3   Venusaur         45 3       6′07″      2.0        220.5      100.0
## 4 Charmander         45 4       2′00″      0.6         18.7        8.5
## 5 Charmeleon         45 5       3′07″      1.1         41.9       19.0
## 6  Charizard         45 6       5′07″      1.7        199.5       90.5
##    BMI Type1  Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison   318 45     49      49     65     65    45   1
## 2 13.0 Grass Poison   405 60     62      63     80     80    60   1
## 3 25.0 Grass Poison   525 80     82      83    100    100    80   1
## 4 23.6  Fire   <NA>   309 39     52      43     60     50    65   1
## 5 15.7  Fire   <NA>   405 58     64      58     80     65    80   1
## 6 31.3  Fire Flying   534 78     84      78    109     85   100   1

Great! We have everything we need for the rest of this tutorial.

Testing Similarity across Generations with the ANOVA/F-test

Are Pokemon across all generations similar to one another? Do newer generations have better stats than previous generations? It’s intuitive to think that newer Pokemon would be better than older Pokemon but we’ll test our intuition using an ANOVA/F hypothesis test.

We can treat the Pokedex as a sample of the population because the in-game Pokemon have some variation in their stats (due to items they hold, what level they are at, where they are caught, how many battles they have won, etc.) and most Pokemon are not unique in the Pokemon games (i.e. you can catch multiple Pokemon of the same Pokedex name and number).

HP

Do newer generations have more HP (Health Points) than previous generations?

First, let’s take a look at the average HP for each generation:

pokedex.table %>%
  group_by(Gen) %>%
  summarise(average_hp = mean(HP))

## # A tibble: 7 x 2
##   Gen   average_hp
##   <fct>      <dbl>
## 1 1           64.2
## 2 2           70.5
## 3 3           65.8
## 4 4           72.2
## 5 5           69.5
## 6 6           68.6
## 7 7           71.4

We can see some variation in the HP across generations; Generation IV Pokemon have, on average, around 8 more HP than the Generation I Pokemon. But is this statistically significant? Let’s use the ANOVA/F-test at 5% level of significance.

\(H_o =\) no difference between true average HP across generations

\(H_a =\) at least two generations’ average HP are different

res.aov <- aov(HP ~ Gen, data = pokedex.table)
summary(res.aov)

##              Df Sum Sq Mean Sq F value Pr(>F)
## Gen           6   6452  1075.3   1.584  0.149
## Residuals   779 528799   678.8

We can see that there’s a 14.9% of us seeing values as extreme as the above which is greater than our 5% level of significance. Therefore, we do not reject the null hypothesis (\(H_o\)). We conclude that the generations do not vary in HP at 5% level of significance.

Attack

Do newer generations have more Attack than previous generations?

First, let’s take a look at the average attack for each generation:

library(dplyr)

pokedex.table %>%
  group_by(Gen) %>%
  summarise(average_attack = mean(Attack))

## # A tibble: 7 x 2
##   Gen   average_attack
##   <fct>          <dbl>
## 1 1               72.9
## 2 2               67.5
## 3 3               72.5
## 4 4               80.0
## 5 5               79.9
## 6 6               72.1
## 7 7               85.7

We can already tell that there seems to be some variation in the average attack across generations. We can see that average attack of generation VII are more than 10 than the average of Gen I. At the same time however, we see that gen I, III, and VI are almost exactly the same with their average attack. The Question remains however if this variation is statistically significant? Let’s use the ANOVA/F-test at 5% level of significance.

\(H_o =\) no difference between true average attack across generations

\(H_a =\) at least two generations’ average attack are different

res.aov <- aov(Attack ~ Gen, data = pokedex.table)
summary(res.aov)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Gen           6  22901    3817   4.525 0.000161 ***
## Residuals   779 657084     843                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see here that we have received a F-value of 4.525 which generates a p-value extremely small (0.000161. Therefore, we have sufficient evidence to reject the null hypothesis (\(H_o\)). We conclude that at least two generations vary with their attack.

Defense

Do newer generations have more Defense than previous generations?

First, let’s take a look at the average Defense for each generation:

pokedex.table %>%
  group_by(Gen) %>%
  summarise(average_Defense = mean(Defense))

## # A tibble: 7 x 2
##   Gen   average_Defense
##   <fct>           <dbl>
## 1 1                68.2
## 2 2                69.2
## 3 3                69.1
## 4 4                74.4
## 5 5                71.1
## 6 6                73.0
## 7 7                79.4

We can already tell that there seems to be some variation in the average Defense across generations. Unlike the other two ANOVA tests we ran, however, the difference in defense seems to be somewhat marginal. The largest gap we see is between generation VII and gen I (again) where we see a defense gap of almost 10 points. The Question remains however if this variation is statistically significant? Let’s use the ANOVA/F-test at 5% level of significance.

\(H_o =\) no difference between true average Defense across generations

\(H_a =\) at least two generations’ average Defense are different

res.aov <- aov(Defense ~ Gen, data = pokedex.table)
summary(res.aov)

##              Df Sum Sq Mean Sq F value Pr(>F)
## Gen           6   9254  1542.3   1.748  0.107
## Residuals   779 687530   882.6

We can see here that we have received a F-value of 0.107 which generated a p-value of 0.107. Since our p-value is greater than our level of significance at 0.05, we have insufficient evidence to reject the null hypothesis. We conclude that the average defense amongst generations remains relatively consistent.

Linear Regression

BMI vs Total Base Stat

Now, we are gonna answer the following question: How does height and weight (in other words, BMI) of a Pokemon correlate with its various base stats?

To do so, we’re gonna use a linear regression model that plots the BMI vs. the Total. Right now, the total is defined as the sum of all the base stats, which include HP, Attack, Defense, Sp. Atk, Sp. Def, and speed. BMI, or Body Mass Index, is calculated as the kg/m^2. Let’s plot the graph of BMI vs. the Total Base Stats using ggplot().

  library(tidyverse)
  library(broom)
  pokedex <- pokedex.table
  
  pokedex %>%
  ggplot(aes(x=as.numeric(BMI), y=Total)) +
    geom_point() +
    geom_smooth(method=lm) +
    labs(title="BMI vs. Total Base Stats",
         x = "BMI",
         y = "Total Base Stats")

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

As you can see here, a lot of the data points are clustered to the left side, and there appears to be no linear trend among pokemon. The BMI of a Pokemon does not has little to no correlation on what its Total Base Stats may be. However, let’s see what we get as the linear regression equation, we’ll worry about whether it is accurate or not later.

  auto_fit <- lm(Total~as.numeric(BMI), data=pokedex.table)

## Warning in eval(predvars, data, env): NAs introduced by coercion

  auto_fit

## 
## Call:
## lm(formula = Total ~ as.numeric(BMI), data = pokedex.table)
## 
## Coefficients:
##     (Intercept)  as.numeric(BMI)  
##        425.6167          -0.1614

As you can see, the linear regression equation is 45.6167 - 0.1614(x). How much confidence we can put in this equation, let’s set up a 95% confidence interval.

auto_fit_stats <- auto_fit %>%
  tidy() %>%
  select(term, estimate, std.error)
auto_fit_stats

## # A tibble: 2 x 3
##   term            estimate std.error
##   <chr>              <dbl>     <dbl>
## 1 (Intercept)      426.       5.38  
## 2 as.numeric(BMI)   -0.161    0.0912

confidence_interval_offset <- 1.95 * auto_fit_stats$std.error[2]
confidence_interval <- round(c(auto_fit_stats$estimate[2] - confidence_interval_offset,
                               auto_fit_stats$estimate[2],
                               auto_fit_stats$estimate[2] + confidence_interval_offset), 4)

Given our confidence interval, we would say that on average, for every k/m^2, we are 95% confident that the base stats will lie within (-0.3392, 0.0164) fewer.

BMI vs Catch Rate

Our goal now is to investigate how height and weight (BMI) of a pokemon correlate to its capture rate.

We will assess this correlation by fitting a linear regression model that will plot BMI vs Capture rate. BMI (body mass index) will be calculated as kg/m^2. We will start by plotting the graph of BMI vs capture rate using the ggplot() function.

Setup:

library(tidyr)
library(dplyr)
library(rvest)
library(tidyverse)
library(broom)
knitr::opts_chunk$set(echo = TRUE)

pokedex <- pokedex.table
pokedex %>% ggplot(aes(x=as.numeric(BMI), y=Catch_rate)) + geom_point() + geom_smooth(method=lm) + 
  labs(title="BMI vs. Capture Rate", x="BMI", y = "Captue Rate")

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning: Removed 20 rows containing non-finite values (stat_smooth).

## Warning: Removed 20 rows containing missing values (geom_point).

As can be seen by the above plot, there really does not seem to be a strong interaction between either two variables (BMI vs Catch rate). It is hard to distinguish any linear trend amongst the data set, hence the almost horizontal line of best-fit generated by the above image.

Let us still see what we get for the equation after generating a linear regression model.

auto_fit1 <- lm(Catch_rate ~ as.numeric(BMI), data=pokedex.table)

## Warning in eval(predvars, data, env): NAs introduced by coercion

auto_fit1

## 
## Call:
## lm(formula = Catch_rate ~ as.numeric(BMI), data = pokedex.table)
## 
## Coefficients:
##     (Intercept)  as.numeric(BMI)  
##        98.87623          0.04687

The linear regression equation generated is y = 98.88 - 0.047(x). Where, once again, x represents the BMI of the pokemon (our explanatory variable) and y is the catch rate of the pokemon, (our response variable).

Now we will be plotting the 95% confidence interval.

auto_fit_stats2 <- auto_fit1 %>% tidy() %>% select(term, estimate, std.error)
auto_fit_stats2

## # A tibble: 2 x 3
##   term            estimate std.error
##   <chr>              <dbl>     <dbl>
## 1 (Intercept)      98.9       3.77  
## 2 as.numeric(BMI)   0.0469    0.0634

Now we will be plotting the 95% confidence interval.

confidence_interval_offset <- 1.95 * auto_fit_stats2$std.error[2]
confidence_interval <- round(c(auto_fit_stats2$estimate[2] - confidence_interval_offset, auto_fit_stats2$estimate[2],
auto_fit_stats2$estimate[2] + 
confidence_interval_offset), 4)
confidence_interval

## [1] -0.0768  0.0469  0.1705

Given the confidence interval, we would conclude that, on average for each k/m^2, we can have 95% confidence that the catch rate will lie within (-0.0768, 0.1705) interval.

The Optimal Team

Here we will be picking the optimal team; best set of pokemon to use in battle. Each pokemon team is comprised of six pokemon. In order to get the best team, we will be picking the pokemons with the highest “total” stat score based on different categories.

Our first three pokemon will be based on the three standard types, Fire, Water and Grass. The next three will be variable and will simply depend on the next pokemon with the highest base total stats. We are adding the additional requirement that every pokemon needs to be a different type. Although pokemon can have duel type, for the purposes of this analysis, we will be focusing on Type1 primarily, But before we start filtering our lists, we must prepare dataframe first. Additionally we are choosing to omit the legendary pokemons. In this section of R code, we are marking each pokemon with a new column boolean column called isLegendary. We have maintained a previously built list of legendary pokemon to help us with this.

library(tidyverse)
library(broom)
knitr::opts_chunk$set(echo = TRUE)
pokedex <- pokedex.table

legendaries = c("Zapdos", "Articuno", "Moltres", "Mewtwo",
                "Mew", "Raikou", "Entei", "Suicune", 
                "Lugia", "Celebi", 
                "Regirock", "Regice", "Registeel",
                "Latias", "Latios", "Kyogre", "Groudon", "Rayquaza",
                "Jirachi", "Uxie", "Mesprit", "Azelf", "Dialga", 
                "Palkia", "Heatran", "Regigigas", "Giratina", 
                "Cresselia", "Phione", "Manaphy", "Darkrai",
                "Arceus", "Victini", "Reshiram", "Zekrom", 
                "Kyurem", "Genesect", 
                "Cobalion" , "Terrakion", "Virizion", "Volcanion", "Solgaleo", "Tapu Fini", "Tapu Bulu", "Xerneas", "Yveltal")

df <- pokedex
df$isLegendary <- FALSE
for (i in 1:length(legendaries)) {
  df$isLegendary[df$Name == legendaries[i]] <- TRUE
}
df$isLegendary <- factor(df$isLegendary)
df %>% head()

##         Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1  Bulbasaur         45 1       2′04″      0.7         15.2        6.9
## 2    Ivysaur         45 2       3′03″      1.0         28.7       13.0
## 3   Venusaur         45 3       6′07″      2.0        220.5      100.0
## 4 Charmander         45 4       2′00″      0.6         18.7        8.5
## 5 Charmeleon         45 5       3′07″      1.1         41.9       19.0
## 6  Charizard         45 6       5′07″      1.7        199.5       90.5
##    BMI Type1  Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison   318 45     49      49     65     65    45   1
## 2 13.0 Grass Poison   405 60     62      63     80     80    60   1
## 3 25.0 Grass Poison   525 80     82      83    100    100    80   1
## 4 23.6  Fire   <NA>   309 39     52      43     60     50    65   1
## 5 15.7  Fire   <NA>   405 58     64      58     80     65    80   1
## 6 31.3  Fire Flying   534 78     84      78    109     85   100   1
##   isLegendary
## 1       FALSE
## 2       FALSE
## 3       FALSE
## 4       FALSE
## 5       FALSE
## 6       FALSE

In this section, we begin the process of filtering the dataframe. The first step is the remove all the isLegendary pokemon we marked in the beginning of the list.

The first thing we do is filter away all the legendary pokemon. We do this by filtering all pokemon marked isLegendery with FALSE.

We then begin by creating a new dataframe everytime by filtering out certain pokemon. For the first three types these new dataframes are labeled by their type - such as fire_dex. For the last three dataframes, these are named after ‘typeless_pokedex’ - as they will be the resulting pokedex after getting rid of Fire, Grass, Water as well as the previous type of the pokemon added to our team.

In each of these ‘type’ dataframes, we arrange them by Total stats in descending order, and then slice off the first entry (i.e the pokemon with the highest stats of that type). We then combine it together to form a new dataframe.

We display the final team below. The final team is stored in the variable called Total.

unlegendary_pokedex <- df %>% filter(isLegendary == FALSE)

fire_dex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 == "Fire") %>% arrange(desc(Total))

fire_pokemon <- fire_dex %>% slice(1)


water_dex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 == "Water") %>% arrange(desc(Total))

water_pokemon <- water_dex %>% slice(1)


grass_dex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 == "Grass") %>% arrange(desc(Total))

grass_pokemon <- grass_dex %>% slice(1)

total <- rbind(fire_pokemon, water_pokemon)
total <- rbind(total, grass_pokemon)

typeless_pokedex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 != "Water")%>% filter(Type1 != "Fire")%>% filter(Type1 != "Grass") %>% arrange(desc(Total))

total <- rbind(total, slice(typeless_pokedex, 1))
temp <-  slice(typeless_pokedex, 1)
previous_type1 <- temp$Type1[1]

typeless_pokedex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 != "Water")%>% filter(Type1 != "Fire")%>% filter(Type1 != "Grass") %>% filter(Type1 != previous_type1) %>% arrange(desc(Total))

total <- rbind(total, slice(typeless_pokedex, 1))
temp <-  slice(typeless_pokedex, 1)
previous_type2 <- temp$Type1[1]

typeless_pokedex <- unlegendary_pokedex %>% select(Name, Type1, Total) %>% filter(Type1 != "Water")%>% filter(Type1 != "Fire")%>% filter(Type1 != "Grass") %>% filter(Type1 != previous_type1) %>% filter(Type1 != previous_type2) %>% arrange(desc(Total))

total <- rbind(total, slice(typeless_pokedex, 1))
temp <-  slice(typeless_pokedex, 1)
previous_type <- temp$Type1[1]
total

##          Name   Type1 Total
## 1 Blacephalon    Fire   570
## 2    Gyarados   Water   540
## 3     Kartana   Grass   570
## 4      Lunala Psychic   680
## 5     Slaking  Normal   670
## 6   Dragonite  Dragon   600

In the Pokemon games, our strongest opponents often have these Pokemon. Dragonite is particularly popular amongst trainers for it’s stats.

Random Forest

Predicting each Pokemon’s Type1

Can we use a Random Forest to predict a Pokemon’s Type1? We’ll use various attributes of each Pokemon to see if we can predict their Type1.

First we split our data into training and testing sets. Our training set will be Generations 2, 4, and 6. Our test set will be Generations 1, 3, and 5.

df <- pokedex.table
trainingData <- df[df$Gen == 2 | df$Gen == 4 | df$Gen == 6, ]
trainingData %>% head()

##           Name Catch_rate   # Height (ft) Height_m Weight (lbs) Weight_kgs
## 152  Chikorita         45 152       2′11″      0.9         14.1        6.4
## 153    Bayleef         45 153       3′11″      1.2         34.8       15.8
## 154  Cyndaquil         45 155       1′08″      0.5         17.4        7.9
## 155    Quilava         45 156       2′11″      0.9         41.9       19.0
## 156 Typhlosion         45 157       5′07″      1.7        175.3       79.5
## 157   Totodile         45 158       2′00″      0.6         20.9        9.5
##      BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 152  7.9 Grass  <NA>   318 45     49      65     49     65    45   2
## 153 11.0 Grass  <NA>   405 60     62      80     63     80    60   2
## 154 31.6  Fire  <NA>   309 39     52      43     60     50    65   2
## 155 23.5  Fire  <NA>   405 58     64      58     80     65    80   2
## 156 27.5  Fire  <NA>   534 78     84      78    109     85   100   2
## 157 26.4 Water  <NA>   314 50     65      64     44     48    43   2

testData <- df[df$Gen == 1 | df$Gen == 3 | df$Gen == 5, ]
testData %>% head()

##         Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1  Bulbasaur         45 1       2′04″      0.7         15.2        6.9
## 2    Ivysaur         45 2       3′03″      1.0         28.7       13.0
## 3   Venusaur         45 3       6′07″      2.0        220.5      100.0
## 4 Charmander         45 4       2′00″      0.6         18.7        8.5
## 5 Charmeleon         45 5       3′07″      1.1         41.9       19.0
## 6  Charizard         45 6       5′07″      1.7        199.5       90.5
##    BMI Type1  Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison   318 45     49      49     65     65    45   1
## 2 13.0 Grass Poison   405 60     62      63     80     80    60   1
## 3 25.0 Grass Poison   525 80     82      83    100    100    80   1
## 4 23.6  Fire   <NA>   309 39     52      43     60     50    65   1
## 5 15.7  Fire   <NA>   405 58     64      58     80     65    80   1
## 6 31.3  Fire Flying   534 78     84      78    109     85   100   1

Our random forest will take into consideration all variables except for BMI (since we’re already considering Height and Weight), Type2 (we’re only trying to predict Type1 and some Pokemon don’t have a Type2), and we’re not using Total since we’ll be using each individual base stat.

library(randomForest)
set.seed(1234)

rf <- randomForest(Type1 ~ Catch_rate + Height_m + Weight_kgs + HP + Attack + Defense + Sp.Atk+ Sp.Def + Speed + Gen, importance = TRUE, mtry=1, data = trainingData, na.action=na.exclude)

After we train the model on our training data, we check whether we can predict the Type1 in our test set.

actuals <- testData[, "Type1"]
predictions <- predict(rf, testData)

pre.act.df <- as.data.frame(table(predictions, actuals))
pre.act.df %>% filter(Freq > 0) %>% head()

##   predictions actuals Freq
## 1         Bug     Bug    7
## 2       Fairy     Bug    4
## 3        Fire     Bug   10
## 4      Flying     Bug    1
## 5       Grass     Bug    3
## 6      Normal     Bug    8

correct.predictions <- nrow(pre.act.df[pre.act.df$predictions == pre.act.df$actuals,])
accuracy <- correct.predictions / nrow(pre.act.df[pre.act.df$Freq > 0,])
accuracy

## [1] 0.1666667

Our random forest correctly guesses the Type1 for 15-17% of the Pokemon. It seems like we can’t consistently predict a Pokemon’s Type1 using a random forest on its attributes.

Predicting each Pokemon’s Type2

We’re gonna use the Random Forest Model to see if we can use the various different stats of each Pokemon to see if we can predict its Type2. To do so we’re gonna filter out all of the Pokemon that don’t have a Type2, so that the model doesn’t get confused at all the NA Values.

pokedex.table %>% head()

##         Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1  Bulbasaur         45 1       2′04″      0.7         15.2        6.9
## 2    Ivysaur         45 2       3′03″      1.0         28.7       13.0
## 3   Venusaur         45 3       6′07″      2.0        220.5      100.0
## 4 Charmander         45 4       2′00″      0.6         18.7        8.5
## 5 Charmeleon         45 5       3′07″      1.1         41.9       19.0
## 6  Charizard         45 6       5′07″      1.7        199.5       90.5
##    BMI Type1  Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison   318 45     49      49     65     65    45   1
## 2 13.0 Grass Poison   405 60     62      63     80     80    60   1
## 3 25.0 Grass Poison   525 80     82      83    100    100    80   1
## 4 23.6  Fire   <NA>   309 39     52      43     60     50    65   1
## 5 15.7  Fire   <NA>   405 58     64      58     80     65    80   1
## 6 31.3  Fire Flying   534 78     84      78    109     85   100   1

df <- pokedex.table
df <- df %>% filter(!is.na(Type2))
df %>% head()

##         Name Catch_rate  # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1  Bulbasaur         45  1       2′04″      0.7         15.2        6.9
## 2    Ivysaur         45  2       3′03″      1.0         28.7       13.0
## 3   Venusaur         45  3       6′07″      2.0        220.5      100.0
## 4  Charizard         45  6       5′07″      1.7        199.5       90.5
## 5 Butterfree         45 12       3′07″      1.1         70.5       32.0
## 6     Weedle        255 13       1′00″      0.3          7.1        3.2
##    BMI Type1  Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison   318 45     49      49     65     65    45   1
## 2 13.0 Grass Poison   405 60     62      63     80     80    60   1
## 3 25.0 Grass Poison   525 80     82      83    100    100    80   1
## 4 31.3  Fire Flying   534 78     84      78    109     85   100   1
## 5 26.4   Bug Flying   395 60     45      50     90     80    70   1
## 6 35.6   Bug Poison   195 40     35      30     20     20    50   1

Now that we’ve got that settled, we’re gonna use our training data set to be Generations 2, 4, and 6. The main reason for picking these generations is because they ensure that all types of pokemon are being looked at – for example, generation 2 introduces Dark and Silver type Pokemon, while Generation 6 introduces Fairy type Pokemon. To test, we’ll set our Pokemon data to be Generations 1, 3 and 5.

library(randomForest)

trainingData <- df[df$Gen == 2 | df$Gen == 4 | df$Gen == 6, ]
trainingData %>% head()

##        Name Catch_rate   # Height (ft) Height_m Weight (lbs) Weight_kgs
## 68 Hoothoot        255 163       2′04″      0.7         46.7       21.2
## 69  Noctowl         90 164       5′03″      1.6         89.9       40.8
## 70   Ledyba        255 165       3′03″      1.0         23.8       10.8
## 71   Ledian         90 166       4′07″      1.4         78.5       35.6
## 72 Spinarak        255 167       1′08″      0.5         18.7        8.5
## 73  Ariados         90 168       3′07″      1.1         73.9       33.5
##     BMI  Type1  Type2 Total  HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 68 43.3 Normal Flying   262  60     30      30     36     56    50   2
## 69 15.9 Normal Flying   452 100     50      50     86     96    70   2
## 70 10.8    Bug Flying   265  40     20      30     40     80    55   2
## 71 18.2    Bug Flying   390  55     35      50     55    110    85   2
## 72 34.0    Bug Poison   250  40     60      40     40     40    30   2
## 73 27.7    Bug Poison   400  70     90      70     60     70    40   2

testData <- df[df$Gen == 1 | df$Gen == 3 | df$Gen == 5, ]
testData %>% head()

##         Name Catch_rate  # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1  Bulbasaur         45  1       2′04″      0.7         15.2        6.9
## 2    Ivysaur         45  2       3′03″      1.0         28.7       13.0
## 3   Venusaur         45  3       6′07″      2.0        220.5      100.0
## 4  Charizard         45  6       5′07″      1.7        199.5       90.5
## 5 Butterfree         45 12       3′07″      1.1         70.5       32.0
## 6     Weedle        255 13       1′00″      0.3          7.1        3.2
##    BMI Type1  Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison   318 45     49      49     65     65    45   1
## 2 13.0 Grass Poison   405 60     62      63     80     80    60   1
## 3 25.0 Grass Poison   525 80     82      83    100    100    80   1
## 4 31.3  Fire Flying   534 78     84      78    109     85   100   1
## 5 26.4   Bug Flying   395 60     45      50     90     80    70   1
## 6 35.6   Bug Poison   195 40     35      30     20     20    50   1

set.seed(1234)

Here, we’re gonna tell the model to look at Catch Rate, Height, Weight, HP, Attack, Defense, Sp. Atk, Sp. Def, Generation, and its Type to try and classify its Type 2.

rf <- randomForest(Type2 ~ Catch_rate + Height_m + Weight_kgs + HP + Attack + Defense + Sp.Atk+ Sp.Def + Speed + Gen + Type1, importance = TRUE, mtry=1, data = trainingData, na.action=na.exclude)

actuals <- testData[, "Type2"]
predictions <- predict(rf, testData)

pre.act.df <- as.data.frame(table(predictions, actuals))
pre.act.df %>% filter(Freq > 0) %>% head()

##   predictions actuals Freq
## 1      Dragon     Bug    1
## 2      Flying     Bug    1
## 3    Fighting    Dark    1
## 4      Flying    Dark    7
## 5       Steel    Dark    1
## 6      Flying  Dragon    4

correct.predictions <- nrow(pre.act.df[pre.act.df$predictions == pre.act.df$actuals,])
accuracy <- correct.predictions / nrow(pre.act.df[pre.act.df$Freq > 0,])
accuracy

## [1] 0.28125

Our random forest correctly guesses the Type2 for about 28-32% of the Pokemon. It seems like we can’t consistently predict a Pokemon’s Type2 based on its stats, although we have a much higher rate then predicting Type 1. Oddly enough, it appears that it got 35-38 out of 43 on flying types, so it may seem like some data is correlated.

Classifying Legendaries

Can we use a Random Forest to classify legendaries?

First, we’ll label our data set with legendaries using the list here: https://bulbapedia.bulbagarden.net/wiki/Legendary_Pokémon

legendaries = c("Zapdos", "Articuno", "Moltres", "Mewtwo",
                "Mew", "Raikou", "Entei", "Suicune", 
                "Lugia", "Celebi", 
                "Regirock", "Regice", "Registeel",
                "Latias", "Latios", "Kyogre", "Groudon", "Rayquaza",
                "Jirachi", "Uxie", "Mesprit", "Azelf", "Dialga", 
                "Palkia", "Heatran", "Regigigas", "Giratina", 
                "Cresselia", "Phione", "Manaphy", "Darkrai",
                "Arceus", "Victini", "Reshiram", "Zekrom", 
                "Kyurem", "Genesect", 
                "Cobalion" , "Terrakion", "Virizion")

df <- pokedex.table
df$isLegendary <- FALSE
for (i in 1:length(legendaries)) {
  df$isLegendary[df$Name == legendaries[i]] <- TRUE
}
df$isLegendary <- factor(df$isLegendary)

Next, we’ll split our data set into training and testing data. We’ll train the random forest on Generations 2, 4, and 6 then test it out on 1, 3, and 5.

library(randomForest)

trainingData <- df[df$Gen == 2 | df$Gen == 4 | df$Gen == 6, ]
trainingData %>% head()

##           Name Catch_rate   # Height (ft) Height_m Weight (lbs) Weight_kgs
## 152  Chikorita         45 152       2′11″      0.9         14.1        6.4
## 153    Bayleef         45 153       3′11″      1.2         34.8       15.8
## 154  Cyndaquil         45 155       1′08″      0.5         17.4        7.9
## 155    Quilava         45 156       2′11″      0.9         41.9       19.0
## 156 Typhlosion         45 157       5′07″      1.7        175.3       79.5
## 157   Totodile         45 158       2′00″      0.6         20.9        9.5
##      BMI Type1 Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 152  7.9 Grass  <NA>   318 45     49      65     49     65    45   2
## 153 11.0 Grass  <NA>   405 60     62      80     63     80    60   2
## 154 31.6  Fire  <NA>   309 39     52      43     60     50    65   2
## 155 23.5  Fire  <NA>   405 58     64      58     80     65    80   2
## 156 27.5  Fire  <NA>   534 78     84      78    109     85   100   2
## 157 26.4 Water  <NA>   314 50     65      64     44     48    43   2
##     isLegendary
## 152       FALSE
## 153       FALSE
## 154       FALSE
## 155       FALSE
## 156       FALSE
## 157       FALSE

testData <- df[df$Gen == 1 | df$Gen == 3 | df$Gen == 5, ]
testData %>% head()

##         Name Catch_rate # Height (ft) Height_m Weight (lbs) Weight_kgs
## 1  Bulbasaur         45 1       2′04″      0.7         15.2        6.9
## 2    Ivysaur         45 2       3′03″      1.0         28.7       13.0
## 3   Venusaur         45 3       6′07″      2.0        220.5      100.0
## 4 Charmander         45 4       2′00″      0.6         18.7        8.5
## 5 Charmeleon         45 5       3′07″      1.1         41.9       19.0
## 6  Charizard         45 6       5′07″      1.7        199.5       90.5
##    BMI Type1  Type2 Total HP Attack Defense Sp.Atk Sp.Def Speed Gen
## 1 14.1 Grass Poison   318 45     49      49     65     65    45   1
## 2 13.0 Grass Poison   405 60     62      63     80     80    60   1
## 3 25.0 Grass Poison   525 80     82      83    100    100    80   1
## 4 23.6  Fire   <NA>   309 39     52      43     60     50    65   1
## 5 15.7  Fire   <NA>   405 58     64      58     80     65    80   1
## 6 31.3  Fire Flying   534 78     84      78    109     85   100   1
##   isLegendary
## 1       FALSE
## 2       FALSE
## 3       FALSE
## 4       FALSE
## 5       FALSE
## 6       FALSE

Our random forest will use the Total and Catch_rate columns to predict whether a Pokemon is a legendary Pokemon. We use Total and Catch_rate because we know legendaries are usually very powerful Pokemon that are difficult to catch.

set.seed(1234)

rf <- randomForest(isLegendary ~ Total + Catch_rate, importance = TRUE, mtry=2, data = trainingData, na.action=na.exclude)
rf

## 
## Call:
##  randomForest(formula = isLegendary ~ Total + Catch_rate, data = trainingData,      importance = TRUE, mtry = 2, na.action = na.exclude) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 2.26%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE   247    4  0.01593625
## TRUE      2   13  0.13333333

Now we’ll check out the performance of our random forest model

test.labels <- testData[, "isLegendary"]
predictions <- predict(rf, testData)
table(predictions, test.labels)

##            test.labels
## predictions FALSE TRUE
##       FALSE   404    1
##       TRUE      3   16

Our random forest was able to predict 16/17 legendaries in the test set. That’s pretty good! Let’s see which legendary Pokemon we missed, our only false negative, and which Pokemon were our 3 false positives.

testData$Prediction <- predictions
testData.subset <- testData %>%
  select(Name, Total, Catch_rate, isLegendary, Prediction)

# filter for false negative
testData.subset %>%
  filter(Prediction == FALSE & isLegendary == TRUE)

##   Name Total Catch_rate isLegendary Prediction
## 1  Mew   600         45        TRUE      FALSE

# filter for false positives
testData.subset %>%
  filter(Prediction == TRUE & isLegendary == FALSE)

##        Name Total Catch_rate isLegendary Prediction
## 1    Beldum   300          3       FALSE       TRUE
## 2    Metang   420          3       FALSE       TRUE
## 3 Metagross   600          3       FALSE       TRUE

Mew was our false negative. This is reasonably hard to predict because Mew is unusually easy to catch for being a legendary Pokemon. This is probably why the random forest couldn’t predict it was legendary.

Beldum, Metang, and Metagross are the first, second, and third evoluation stages of the same Pokemon. Metagross, the final evolution, has battle stats that are on par with legendary Pokemon but is technically not considered legendary. Metagross is one of the pseudo-legendary Pokemon.

Overall, our random forest proved to be good at classifying legendaries based on the Total and Catch_rate.

CMSC320 Final Project