Chapter 3 Data transformation

For our scoresheet data, when we were done tabulating all the statistics for all the players in Excel, we exported each tab into a CSV file and then read those CSV files into R Studio. We made one data frame to hold the hitting statistics and one data frame to hold the pitching statistics.

There was some minor data cleanup to do for these data frames. There were many cells in Excel that were purposely left empty. For example, if a hitter never had a homerun in APBA baseball, that cell would have been left blank during the tabulation process, but the blank cell really represented a “0”. When you import into R Studio from a CSV however, these blanks cells turn into “NA”s, so we needed to use R Studio commands to turn these NAs into the 0s they represent.

3.1 Web Scrapped Batting Table

We need to add the following calculated fields into the web scrapped batting data to better measure the performance of the hitters:

\(\text{AVG}=\dfrac{\text{hits(H)}}{\text{at bats(AB)}}\)
\(\text{SLG}=\dfrac{\text{first base(1B)}+\text{second base(2B)}\times2+\text{third base(3B)}\times3+\text{home runs(HR)}\times4}{\text{at bats(AB)}}\)
\(\text{OBA}=\dfrac{\text{hits(H)}+\text{bases on balls(BB)}+\text{times hit by a pitch(HBP)}}{\text{at bats(AB)}+\text{bases on balls(BB)}+\text{times hit by a pitch(HBP)}}\)
\(\text{SAC}=\text{sacrifice hits(SH)}+\text{sacrifice flies(SF)}\)

The resulting real-life batting table looks like:

## tibble [211 x 21] (S3: tbl_df/tbl/data.frame)
##  $ playerID: chr [1:211] "abbotku01" "abreubo01" "aguilri01" "alfoned01" ...
##  $ name    : chr [1:211] "Kurt Abbott" "Bobby Abreu" "Rick Aguilera" "Edgardo Alfonzo" ...
##  $ hits    : chr [1:211] "Right" "Left" "Right" "Right" ...
##  $ AVG     : num [1:211] 0.263 0.312 NaN 0.278 0.274 ...
##  $ SLG     : num [1:211] 0.418 0.497 NaN 0.427 0.425 ...
##  $ OBA     : num [1:211] 0.312 0.411 NaN 0.357 0.375 ...
##  $ G       : num [1:211] 77 151 5 144 101 159 73 92 56 1 ...
##  $ AB      : num [1:211] 194 497 0 557 259 584 134 107 133 3 ...
##  $ R       : num [1:211] 26 68 0 94 51 104 25 7 17 0 ...
##  $ H       : num [1:211] 51 155 0 155 71 182 37 20 39 0 ...
##  $ RBI     : num [1:211] 24 74 0 78 33 124 4 10 16 0 ...
##  $ 2B      : num [1:211] 13 29 0 28 15 34 6 5 8 0 ...
##  $ 3B      : num [1:211] 1 6 0 2 3 5 0 0 0 0 ...
##  $ HR      : num [1:211] 5 17 0 17 6 38 1 1 1 0 ...
##  $ SAC     : num [1:211] 4 8 0 5 7 6 1 4 2 0 ...
##  $ SB      : num [1:211] 2 19 0 8 4 11 11 0 2 0 ...
##  $ CS      : num [1:211] 1 10 0 3 3 3 1 0 0 0 ...
##  $ SO      : num [1:211] 53 133 0 77 40 87 24 15 18 2 ...
##  $ BB      : num [1:211] 12 84 0 65 37 84 13 6 13 0 ...
##  $ HBP     : num [1:211] 2 0 0 3 5 5 1 0 1 0 ...
##  $ TB      : num [1:211] 81 247 0 238 110 340 46 28 50 0 ...

3.2 Web Scrapped Pitching Table

Then we need to add the following calculated fields into the web scrapped pitching data to better measure the performance of the pitchers:

\(\text{AVG_against}=\dfrac{\text{hits(H)}}{\text{hits(H)}+\text{innings pitched(IP)}\times3}\)

The resulting real-life pitching table looks like:

## tibble [83 x 19] (S3: tbl_df/tbl/data.frame)
##  $ playerID   : chr [1:83] "aguilri01" "arrojro01" "ashbyan01" "assenpa01" ...
##  $ name       : chr [1:83] "Rick Aguilera" "Rolando Arrojo" "Andy Ashby" "Paul Assenmacher" ...
##  $ throws     : chr [1:83] "Right" "Right" "Right" "Left" ...
##  $ ERA        : num [1:83] 4.24 3.56 3.34 3.26 5.02 7.29 3.02 5.15 4.27 3.23 ...
##  $ W          : num [1:83] 4 14 17 2 10 1 3 3 14 4 ...
##  $ L          : num [1:83] 9 12 9 5 7 10 4 9 14 8 ...
##  $ SV         : num [1:83] 38 0 0 3 0 8 51 0 0 1 ...
##  $ AVG_against: num [1:83] 0.252 0.243 0.247 0.277 0.257 ...
##  $ G          : num [1:83] 68 32 33 69 34 62 81 21 34 40 ...
##  $ GS         : num [1:83] 0 32 33 0 23 0 0 21 34 0 ...
##  $ CG         : num [1:83] 0 2 5 0 0 0 0 0 2 0 ...
##  $ IP         : num [1:83] 74.1 202 226.2 47 123.2 ...
##  $ ER         : num [1:83] 35 80 84 17 69 61 27 67 111 22 ...
##  $ H          : num [1:83] 75 195 223 54 128 100 86 126 247 46 ...
##  $ SO         : num [1:83] 57 152 151 43 57 68 81 113 130 57 ...
##  $ BB         : num [1:83] 15 65 58 19 64 26 20 63 73 28 ...
##  $ HBP        : num [1:83] 1 19 7 1 4 1 2 4 7 1 ...
##  $ WP         : num [1:83] 1 3 7 0 7 4 2 8 6 3 ...
##  $ BK         : num [1:83] 0 1 0 0 0 0 0 0 1 0 ...

3.3 Cleaning the Combined Data

Once real data was scraped into CSV files (again, one file for hitting and one for pitching), these were also put into their own data frames. We then consolidated the information from the game hitting and real hitting data frames into the same data frame such that every row in the consolidated hitting data frame had all the relevant information for the same player. For example, the first row of the hitting data frame had both game and real hitting data for “Derek Jeter”.

We then proceeded to remove all players from the consolidated hitting data frame that had either 0 at-bats (AB) in the game or 0 AB in real life. For these players, we had no good way to analyze or evaluate how well APBA was representing them.

For players with 0 AB in the game, there was nothing to compare against real life. In the game, these players were mainly relief pitchers. Relief pitchers do hit, but it’s rare, and such players didn’t pitch in enough games to eventually get the chance to hit.

For players with 0 AB in real life, there was nothing to compare game data to. In real life, these players were all pitchers in the American League since, in the American League, pitchers never hit and instead have another player designated to hit for them (a so-called “designated hitter”).

Once these rows were removed from the consolidated hitting data frame, we looked at the remaining rows (each row representing a player) and proceeded to calculate how many of each interesting statistic the player should have had in the board game in order to be completely real-life accurate. Statistics we thought were interesting were hits (H), total bases (TB), runs (R), runs batted in (RBI), doubles (2B), triples (3B), homeruns (HR), and stolen bases (SB).

Batting average (AVG) is a derived statistic equal to H / AB while slugging percentage (SLG) is a derived statistic equal to TB / AB. This means that AVG\(\times\)AB = H and SLG\(\times\)AB = TB, so, to get the number of H a player should have had in the game, we multiply his real-life AVG with the number of AB he had in the game. Similarly, to get the number of TB a player should have had in the game, we multiply his real-life SLG with the number of AB he had in the game. Here’s the lines from our code that do this (where rhdf2 is our data frame):

rhdf2$H_should_have = rhdf2$AVG * rhdf2$GameAB
rhdf2$TB_should_have = rhdf2$SLG * rhdf2$GameAB

We can make our own version of AVG and SLG for the other statistics, based on either AB or games played (G) as seemed appropriate (examples: (real-life doubles) / (real-life at-bats) and (real-life stolen bases) / (real-life games)). Once again, here’s the lines from our code:

rhdf2$R_should_have = rhdf2$R / rhdf2$G * rhdf2$GameG
rhdf2$RBI_should_have = rhdf2$RBI / rhdf2$G * rhdf2$GameG
rhdf2$"2B_should_have" = rhdf2$"2B" / rhdf2$AB * rhdf2$GameAB
rhdf2$"3B_should_have" = rhdf2$"3B" / rhdf2$AB * rhdf2$GameAB
rhdf2$HR_should_have = rhdf2$HR / rhdf2$AB * rhdf2$GameAB
rhdf2$SB_should_have = rhdf2$SB / rhdf2$G * rhdf2$GameG

Now, for each hitter, we have the expected number of H, TB, R, RBI, 2B, 3B, HR, and SB he should have had in the game and the actual number of H, TB, R, RBI, 2B, 3B, HR, and SB he had in the game. Therefore, we can find the deviation from reality for each of these statistics by taking the difference between game and expected versions as follows:

rhdf2$H_deviation = rhdf2$GameH - rhdf2$H_should_have
rhdf2$TB_deviation = rhdf2$GameTB - rhdf2$TB_should_have
rhdf2$R_deviation = rhdf2$GameR - rhdf2$R_should_have
rhdf2$RBI_deviation = rhdf2$GameRBI - rhdf2$RBI_should_have
rhdf2$"2B_deviation" = rhdf2$Game2B - rhdf2$"2B_should_have"
rhdf2$"3B_deviation" = rhdf2$Game3B - rhdf2$"3B_should_have"
rhdf2$HR_deviation = rhdf2$GameHR - rhdf2$HR_should_have
rhdf2$SB_deviation = rhdf2$GameSB - rhdf2$SB_should_have

Now for pitchers, the entire process is quite similar. We consolidated the information from the game pitching and real pitching data frames into the same data frame such that every row in the consolidated data frame had all the relevant information for the same pitcher. For example, the first row of the pitching data frame had both game and real pitching data for “Brian Bohanon”.

For pitchers, we look at the number of innings pitched (IP) or games pitched (G) in the game to determine how many of the other statistics pitchers should have. The standard baseball derived statistic, “earned run average” (ERA), can be used with IP to determine how many earned runs (ER) the player should have. Once again, we make up our own similarly derived statistics to determine how many of the other pitching statistics pitchers should have. Here’s the lines from our code:

rpdf$ER_should_have = rpdf$ERA * rpdf$GameIP / 9
rpdf$Hits_should_have = rpdf$H / rpdf$IP * rpdf$GameIP
rpdf$SO_should_have = rpdf$SO / rpdf$IP * rpdf$GameIP
rpdf$W_should_have = rpdf$W / rpdf$G * rpdf$GameG
rpdf$L_should_have = rpdf$L / rpdf$G * rpdf$GameG
rpdf$SV_should_have = rpdf$SV / rpdf$G * rpdf$GameG
rpdf$BB_should_have = rpdf$BB / rpdf$IP * rpdf$GameIP

And here’s the lines from our code for calculating the deviation for each pitcher:

rpdf$ER_deviation = rpdf$GameER - rpdf$ER_should_have
rpdf$Hits_deviation = rpdf$GameHits - rpdf$Hits_should_have
rpdf$SO_deviation = rpdf$GameSO - rpdf$SO_should_have
rpdf$W_deviation = rpdf$GameW - rpdf$W_should_have
rpdf$L_deviation = rpdf$GameL - rpdf$L_should_have
rpdf$SV_deviation = rpdf$GameSV - rpdf$SV_should_have
rpdf$BB_deviation = rpdf$GameBB - rpdf$BB_should_have