# R Programming and Statistics for Data Science

This document provides an overview of basic statistical tools for data science

Andrew L. Mackey

#### Overview

Statistics and probability are important tools for the analysis of data. #### Basic Statistics

###### Sample Mean
$$\bar{x} = \dfrac{1}{n} \sum\limits_{i=1} x_i$$
data <- c(1,2,3,4,5)
mean(data)
sum(data)/length(data)

###### Sample Variance
$$s^2 = \dfrac{1}{n-1} \sum\limits_{i=1} (x_i - \bar{x}^2)$$
var(data)
sum( (data - mean(data))^2 ) / (length(data) - 1)

###### Sample Standard Deviation
$$s = \sqrt{s^2}$$
sd(data)
sqrt( var(data) )


#### Linear Regression

Linear regression models assume some relationship exists between a response variable ( $$y$$ ) and its predictor (or independent) variables ( $$\mathbf{x}$$ ). \begin{align} y &= \mathbf{x} \beta + \epsilon \\[1em] &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon \\[1em] \end{align}

To estimate the coefficients, we can use the least-squares approach:

\begin{align} \hat{\beta} &= (\mathbf{x}^T \mathbf{x})^{-1} \, \mathbf{x}^T \mathbf{y} \\[1em] &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon \\[1em] \end{align}

We can use the following code in R to estimate the coefficients manually.

	betahat <- solve( t(x) %*% x ) %*% t(x) %*% y

###### Linear Regression Model Predictions

Once we obtain the $$\beta$$ weights for our model (either by deriving them or through Gradient Descent), we can have the model estimate/predict values for given input data:

$$\hat{y} = \mathbf{x} \beta$$ $$h_\mathbf{\beta}(\mathbf{x}) = \hat{y} = \mathbf{x}\beta$$

The hypothesis function $$h_\beta(\mathbf{x})$$ (outputs are often denoted as $$\hat{y}$$) accepts some record $$\mathbf{x} = \begin{bmatrix} 1& x_1 & x_2 & ... & x_p \end{bmatrix}$$ and multiplies it by the corresponding weights $$\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix}$$.