Python for Data Science and Machine Learning

This document provides basic tools and algorithms for use with data science and machine learning tasks in Python

Andrew L. Mackey

Overview

Python is a fairly simple language to use for general programming. It has a rich set of libraries for performing a variety of tasks with respect to data science and machine learning. As a result, this makes it a fairly useful language for machine learning.

Splitting Data for Training and Testing

The following will split the X and y variables into training and testing splits of 80% and 20%, respectively.

from sklearn.model_selection import train_test_split
import pandas as pd

X = pd.read_csv("/path/to/xdata")
y = pd.read_csv("/path/to/ydata")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)

Support Vector Machine (SVM) Classifier

Support Vector Machine (SVM) classifiers provide reasonable accuracy for datasets where there is a clear margine of separation within the data.

Hyperparameter Options

Kernel - transform the dataset into a different form (e.g. radial basis function (RBF), polynomial, linear, etc.)
Regularization - C parameter to represent the misclassification or error term; a smaller C is useful for defining a hyperplane for the margin that is small whereas a larger C defines a hyperplane for the margin that is large.
Gamma - larger values of gamma will cause the algorithm to fit the training data more exactly (overfitting) where as smaller values will force it to fit the training data less exact.

from sklearn import svm
from sklearn import metrics

model  = svm.SVC(kernel='linear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print( "Accuracy:"  , metrics.accuracy_score(  y_test, y_pred )  )
print( "Precision:" , metrics.precision_score( y_test, y_pred )  )
print( "Recall:"    , metrics.recall_score(    y_test, y_pred )  )