R Programming and Statistics for Data Science

This document provides an overview of basic statistical tools for data science

Andrew L. Mackey

Overview

Statistics and probability are important tools for the analysis of data.

Basic Statistics

Sample Mean
$$\bar{x} = \dfrac{1}{n} \sum\limits_{i=1} x_i$$
data <- c(1,2,3,4,5)
mean(data)
sum(data)/length(data)

Sample Variance
$$s^2 = \dfrac{1}{n-1} \sum\limits_{i=1} (x_i - \bar{x}^2)$$
var(data)
sum( (data - mean(data))^2 ) / (length(data) - 1)

Sample Standard Deviation
$$s = \sqrt{s^2}$$
sd(data)
sqrt( var(data) )


Linear Regression

Linear regression models assume some relationship exists between a response variable ( $$y$$ ) and its predictor (or independent) variables ( $$\mathbf{x}$$ ). \begin{align} y &= \mathbf{x} \beta + \epsilon \\[1em] &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon \\[1em] \end{align}

To estimate the coefficients, we can use the least-squares approach:

\begin{align} \hat{\beta} &= (\mathbf{x}^T \mathbf{x})^{-1} \, \mathbf{x}^T \mathbf{y} \\[1em] &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon \\[1em] \end{align}

We can use the following code in R to estimate the coefficients manually.

	betahat <- solve( t(x) %*% x ) %*% t(x) %*% y

Linear Regression Model Predictions

Once we obtain the $$\beta$$ weights for our model (either by deriving them or through Gradient Descent), we can have the model estimate/predict values for given input data:

$$\hat{y} = \mathbf{x} \beta$$ $$h_\mathbf{\beta}(\mathbf{x}) = \hat{y} = \mathbf{x}\beta$$

The hypothesis function $$h_\beta(\mathbf{x})$$ (outputs are often denoted as $$\hat{y}$$) accepts some record $$\mathbf{x} = \begin{bmatrix} 1& x_1 & x_2 & ... & x_p \end{bmatrix}$$ and multiplies it by the corresponding weights $$\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix}$$.