Preface

The examples in this preface is based on OpenIntro Statistics (Diez, Ceytinka-Rundel, Barr), Chapter 9.4 and 9.5, which provide more background information. You can access the book for free at https://www.openintro.org/book/os/

The main goal using data science is to understand data. Broadly speaking, this will involve building a statistical model for predicting, or estimating a response variable based on one or more predictors. Such models are used in a wide variety of fields such as finance, medicine, public policy, sports, and so on. We will look a couple of examples.

Examples

Example 1: Mario Kart Auction Prices

In this first example, we will look at Ebay auctions of a video game called Mario Kart that is played on Nintendo Wii. We want to predict the price of an auction based on whether the game is new or not, whether the auction’s main photo is a stock photo, the duration of the auction in days, and the number of Wii wheels included with the auction.

A model that we can use for this example is the linear regression model:

library(openintro)

Data<-mariokart
##fit model
result<-lm(total_pr~cond+stock_photo+duration+wheels, data=Data)

Generally speaking, a linear regression equation takes the following form:

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_k x_k \]

where \(\hat{y}\) denotes the predicted value of the response variable, the price of the action in this example, \(x_1, x_2, \cdots, x_k\) denote the values of the predictors. This is example, we have: \(x_1\) for whether the game is new or not, \(x_2\) for whether the auction’s main photo is a stock photo, \(x_3\) for the duration of the auction in days, and \(x_4\) for the number of Wii wheels included with the auction. \(\hat{\beta}_0, \hat{\beta}_1, \cdots, \hat{\beta}_k\) represent the estimated regression parameters. If we know what these values are, we can easily plug in the values of the predictors to obtain the predicted price of the auction.

Fitting the model in R, we obtain the estimated regression parameters:

##get estimated regression parameters
result

## 
## Call:
## lm(formula = total_pr ~ cond + stock_photo + duration + wheels, 
##     data = Data)
## 
## Coefficients:
##    (Intercept)        condused  stock_photoyes        duration          wheels  
##        43.5201         -2.5816         -6.7542          0.3788          9.9476

so we have:

\[ \hat{y} = 43.5201 - 2.5816 x_1 - 6.7542 x_2 + 0.3788 x_3 + 9.9476 x_4 \]

So for an auction for Mario Kart game that is used, that uses a stock photo, is listed for 2 days, and comes with 0 wheels, the predicted price will be \(\hat{y} = 43.5201 - 2.5816 - 6.7542 + 0.3788 \times 2 = 34.94\) or about 35 dollars.

Example 2: Job Application Callback Rates {#eg2}

In this example, we look at data from an experiment that sought to evaluate the effect of race and gender on job application callback rates. For the experiment, researchers created fake resumes to job postings in Boston and Chicago to see which resumes resulted in a callback. The fake resumes included relevant information such as the applicant’s educational attainment, how many year’s of experience the applicant as well as a first and last name. The names on the fake resume were meant to imply the applicant’s race and gender. Only two races were considered (Black or White) and only two genders were considered (Make or Female) for the experiment.

Prior to the experiment, the researchers conducted surveys to check for racial and gender associations for the names on the fake resumes; only names that passed a certain threshold from the surveys were included in the experiment.

A model that can be used in this example is the logistic regression model

Data2<-resume
##fit model
result2<-glm(received_callback~job_city + college_degree+years_experience+race+gender, family="binomial", data=Data2)

Generally speaking, a logistic regression equation takes the following form

\[ \log (\frac{\hat{\pi}}{1-\hat{\pi}}) = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_k x_k \]

where \(\hat{\pi}\) denotes the predicted probability that the applicant receives a call back. \(x_1, x_2, \cdots, x_k\) denote the values of the predictors. This is example, we have: \(x_1\) for which city is the job posting located in, \(x_2\) for whether the applicant has a college degree or not, \(x_3\) for the experience of the applicant, \(x_4\) for associated race of the applicant, and \(x_5\) for the associated gender of the applicant. Similar to linear regression, \(\hat{\beta}_0, \hat{\beta}_1, \cdots, \hat{\beta}_k\) represent the estimated regression parameters. If we know what these values are, we can easily plug in the values of the predictors to obtain the predicted probability for an applicant with those characteristics to receive a callback.

Fitting the model in R, we obtain the estimated regression parameters

##get estimated regression parameters
result2

## 
## Call:  glm(formula = received_callback ~ job_city + college_degree + 
##     years_experience + race + gender, family = "binomial", data = Data2)
## 
## Coefficients:
##      (Intercept)   job_cityChicago    college_degree  years_experience  
##         -2.63974          -0.39206          -0.06550           0.03152  
##        racewhite           genderm  
##          0.44299          -0.22814  
## 
## Degrees of Freedom: 4869 Total (i.e. Null);  4864 Residual
## Null Deviance:       2727 
## Residual Deviance: 2680  AIC: 2692

so we have

\[ \log (\frac{\hat{\pi}}{1-\hat{\pi}}) = -2.63974 - 0.39206 x_1 - 0.0655 x_2 + 0.03152 x_3 + 0.44299 x_4 - 0.22814 x_5 \]

So for an applicant in Boston, who has a college degree, has 10 years of experience and has a name that is associated with being a Black male, the logistic regression equation becomes \(\log (\frac{\hat{\pi}}{1-\hat{\pi}}) = -2.63974 - 0.0655 + 0.03152 \times 10 - 0.22814 = -2.61818\). Doing a little bit of algebra to solve, we get \(\hat{\pi} = 0.06797751\). Such an applicant has about a 6.8 percent chance of receiving a callback.

How were Estimated Parameters Calculated?

In the two examples, notice how I used some R functions, supplied the names of the variables, and the R functions generated the values of the estimated parameters? One thing you will learn is how the functions actually calculate these numbers. It turns out that these calculations are based on foundational concepts associated with measures of uncertainty, probability, and expected values. We will be learning about these concepts in this class.

Why do we want to know how these calculations are performed? So that we understand the intuition and logic behind how these models are built. It becomes a lot easier to work with these models when we understand their logic (for example, we know when these models can be used or cannot be used, we know what steps to take when we notice our data have certain characteristics, etc), instead of memorizing a bunch of steps.

When presenting models and data to people, some people may occasionally question our methods and models. Why should we trust the model? Should we trust these numbers that seem to come out from some black box?

Notice we used two different models, linear regression and logistic regression, for examples 1 and 2. Why did we use these models? Could we have swapped the type of model used in these examples? The answer is actually no. Knowing the foundational concepts of these models will help explain why certain models can be used in certain situations.

The Course: Understanding Uncertainty

As mentioned in the previous section, we will be learning about foundational concepts associated with measures of uncertainty, probability, and expected values. All of these concepts will then help explain the intuition and how statistical models are built.

At the end of the course, we will apply these concepts and revisit the linear regression model. Linear regression is the most widely used models used in data science, as it is relatively easy to understand and explain. More modern methods (that you will learn about in future classes) such as decision trees and neural networks can be viewed as extensions of the linear regression model.

Background Needed for the Course

Mathematical Background

We will be covering the basic mathematics associated with probability and measures of uncertainty, so that you know what these are measuring.

Some knowledge of basic calculus is helpful (i.e. the idea behind differentiation as it relates to optimizing a function, the idea that integrating a function corresponds to the area under the function), but you will not be performing these by hand, other than the occasional first order derivative that is simpler than in any calculus class that you’ve been in.
Being comfortable with matrix notation will also be helpful.
You should also be comfortable with notation regarding sums and products. For example, know what \(\sum_{i=1}^n x_i\) and \(\prod_{i=1}^n x_i\) represent.

R Background

These skills should have been covered in the R Boot Camp or in the programming proficiency test.

Control structures, especially for() loops.
Data wrangling.
Data visualization.

Note that I primarily use base R functions to perform data wrangling and data visualization, although you could perform the same operations using tidyverse functions (dplyr and ggplot2 functions are part of the tidyverse).

Practitioners of data science should be comfortable working with using base R and tidyverse functions. I highly discourage anyone from not learning one of these:

You will likely have to work with other people, so you may need to be comfortable using either base R or tidyverse functions.
You will find that you perform some tasks faster with base R, and some are faster with tidyverse functions.
Packages are constantly being added. Depending on the author of the package, the package may only work with base R or only with tidyverse functions. For now, I will say that most packages work with base R functions, but this may change in the future.

The code to produce visualizations with base R tends to be a lot shorter than with ggplot2 functions. I find that base R provides good enough visualizations, especially if all that is needed are some basic visuals for exploration, although `ggplot2 functions can produce visualizations that are more complex and beautiful.

You should feel free to use whatever approach, as long as it performs the needed tasks. You can even use a mix of both approaches if you choose to do so.

Reporting issues with this book

If you find any issues (typos, formatting, etc), please report them at https://github.com/jwooSDS/uncertainty/issues. Please be as specific as you can, including providing the specific section and paragraph where the issue is found.

1 Descriptive Statistics