This document provides an overview of basic statistical tools for data science
Andrew L. Mackey
Statistics and probability are important tools for the analysis of data.
data <- c(1,2,3,4,5) mean(data) sum(data)/length(data)
var(data) sum( (data - mean(data))^2 ) / (length(data) - 1)
sd(data) sqrt( var(data) )
To estimate the coefficients, we can use the least-squares approach:
$$ \begin{align} \hat{\beta} &= (\mathbf{x}^T \mathbf{x})^{-1} \, \mathbf{x}^T \mathbf{y} \\[1em] &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon \\[1em] \end{align} $$We can use the following code in R to estimate the coefficients manually.
betahat <- solve( t(x) %*% x ) %*% t(x) %*% y
Once we obtain the \( \beta \) weights for our model (either by deriving them or through Gradient Descent), we can have the model estimate/predict values for given input data:
$$\hat{y} = \mathbf{x} \beta$$ $$h_\mathbf{\beta}(\mathbf{x}) = \hat{y} = \mathbf{x}\beta$$The hypothesis function \( h_\beta(\mathbf{x}) \) (outputs are often denoted as \(\hat{y} \)) accepts some record \(\mathbf{x} = \begin{bmatrix} 1& x_1 & x_2 & ... & x_p \end{bmatrix}\) and multiplies it by the corresponding weights \( \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix}\).