# R Programming and Statistics for Data Science

This document provides an overview of basic statistical tools for data science

Andrew L. Mackey

#### Overview

Statistics and probability are important tools for the analysis of data.

#### Basic Statistics

Sample Mean $$\bar{x} = \dfrac{1}{n} \sum\limits_{i=1} x_i$$
data <- c(1,2,3,4,5)
mean(data)
sum(data)/length(data)

Sample Variance $$s^2 = \dfrac{1}{n-1} \sum\limits_{i=1} (x_i - \bar{x}^2)$$
var(data)
sum( (data - mean(data))^2 ) / (length(data) - 1)

Sample Standard Deviation $$s = \sqrt{s^2}$$
sd(data)
sqrt( var(data) )


#### Linear Regression

\begin{align} y &= \mathbf{x} \beta + \epsilon \\[1em] &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon \\[1em] \end{align}

To estimate the coefficients, we can use the least-squares approach:

\begin{align} \hat{\beta} &= (\mathbf{x}^T \mathbf{x})^{-1} \, \mathbf{x}^T \mathbf{y} \\[1em] &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon \\[1em] \end{align}

We can use the following code in R to estimate the coefficients manually.

	betahat <- solve( t(x) %*% x ) %*% t(x) %*% y