R Programming and Statistics for Data Science

This document provides an overview of basic statistical tools for data science

Andrew L. Mackey

Overview

Statistics and probability are important tools for the analysis of data.

Regression Rotating Image

 

Basic Statistics

Sample Mean $$ \bar{x} = \dfrac{1}{n} \sum\limits_{i=1} x_i$$
data <- c(1,2,3,4,5)
mean(data)
sum(data)/length(data)
Sample Variance $$ s^2 = \dfrac{1}{n-1} \sum\limits_{i=1} (x_i - \bar{x}^2) $$
var(data)
sum( (data - mean(data))^2 ) / (length(data) - 1)
Sample Standard Deviation $$ s = \sqrt{s^2}$$
sd(data)
sqrt( var(data) )

 

Linear Regression

$$ \begin{align} y &= \mathbf{x} \beta + \epsilon \\[1em] &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon \\[1em] \end{align} $$

To estimate the coefficients, we can use the least-squares approach:

$$ \begin{align} \hat{\beta} &= (\mathbf{x}^T \mathbf{x})^{-1} \, \mathbf{x}^T \mathbf{y} \\[1em] &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon \\[1em] \end{align} $$

We can use the following code in R to estimate the coefficients manually.

	betahat <- solve( t(x) %*% x ) %*% t(x) %*% y