Prepare Data for Classification

Big Ideas¶

Reproducibility puts the science in data science!
Docstrings are also your friends.
Helper functions work like plastic brick toys. ;)

Objectives¶

By the end of the acquire lesson and exercises, you will be able to...

perform a train, validate, test split on your data:

from sklearn.model_selection import train_test_split

train_validate, test = train_test_split(df, test_size=.2, 
                                        random_state=123, 
                                        stratify=df.target_column)

train, validate = train_test_split(train_validate, test_size=.3, 
                                   random_state=123, 
                                   stratify=train_validate.target_column)

use a SimpleImputer:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy = 'desired_strategy')

train['column_name'] = imputer.fit_transform(train[['column_name']])
validate['column_name'] = imputer.transform(validate[['column_name']])
test['column_name'] = imputer.transform(test[['column_name']])

perform a simple encoding on a categorical variable:

pd.get_dummies(df.column, drop_first=True)

create a prepare.py module and import your functions in a new notebook or script:

from classification_acquire import get_titanic_data, get_iris_data
from classification_prepare import prep_iris, prep_titanic

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from darden_class_acquire import get_titanic_data, get_iris_data

Iris Data¶

Use the function defined in acquire.py to load the iris data.¶

iris = get_iris_data()
iris.head()

Drop the species_id column.¶

iris = iris.drop(columns='species_id')
iris.head(2)

Rename the species_name column to just species.¶

iris = iris.rename(columns={'species_name': 'species'})
iris.head(2)

Create dummy variables of the species name.¶

species_dummies = pd.get_dummies(iris.species, drop_first=True)
species_dummies.head(3)

iris = pd.concat([iris, species_dummies], axis=1)
iris.head()

Create a function named prep_iris that accepts the untransformed iris data, and returns the data with the transformations above applied.¶

def prep_iris(cached=True):
    '''
    This function acquires and prepares the iris data from a local csv, default.
    Passing cached=False acquires fresh data from Codeup db and writes to csv.
    Returns the iris df with dummy variables encoding species.
    '''
    
    # use my aquire function to read data into a df from a csv file
    df = get_iris_data(cached)
    
    # drop and rename columns
    df = df.drop(columns='species_id').rename(columns={'species_name': 'species'})
    
    # create dummy columns for species
    species_dummies = pd.get_dummies(df.species, drop_first=True)
    
    # add dummy columns to df
    df = pd.concat([df, species_dummies], axis=1)
    
    return df

iris = prep_iris()
iris.sample(7)

Titanic Data¶

Use the function you defined in acquire.py to load the titanic data set.¶

titanic = get_titanic_data()
titanic.head()

Handle the missing values in the embark_town and embarked columns.¶

With a quick check, I see that in total I would only be dropping two rows, so that's how I'm going to handle these missing values.

titanic[titanic.embark_town.isnull()]

titanic[titanic.embarked.isnull()]

# using the complement operator, ~, to return the inverse of our instance above. Return everything but the null values.

titanic = titanic[~titanic.embarked.isnull()]
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  889 non-null    int64  
 1   survived      889 non-null    int64  
 2   pclass        889 non-null    int64  
 3   sex           889 non-null    object 
 4   age           712 non-null    float64
 5   sibsp         889 non-null    int64  
 6   parch         889 non-null    int64  
 7   fare          889 non-null    float64
 8   embarked      889 non-null    object 
 9   class         889 non-null    object 
 10  deck          201 non-null    object 
 11  embark_town   889 non-null    object 
 12  alone         889 non-null    int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 97.2+ KB

Remove the deck column.¶

titanic = titanic.drop(columns='deck')
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  889 non-null    int64  
 1   survived      889 non-null    int64  
 2   pclass        889 non-null    int64  
 3   sex           889 non-null    object 
 4   age           712 non-null    float64
 5   sibsp         889 non-null    int64  
 6   parch         889 non-null    int64  
 7   fare          889 non-null    float64
 8   embarked      889 non-null    object 
 9   class         889 non-null    object 
 10  embark_town   889 non-null    object 
 11  alone         889 non-null    int64  
dtypes: float64(2), int64(6), object(4)
memory usage: 90.3+ KB

Create a dummy variable of the embarked column.¶

titanic_dummies = pd.get_dummies(titanic.embarked, drop_first=True)
titanic_dummies.sample(10)

titanic = pd.concat([titanic, titanic_dummies], axis=1)
titanic.head()

Split Data¶

train_validate, test = train_test_split(titanic, test_size=.2, 
                                        random_state=123, 
                                        stratify=titanic.survived)

train, validate = train_test_split(train_validate, test_size=.3, 
                                   random_state=123, 
                                   stratify=train_validate.survived)

print(f'train -> {train.shape}')
print(f'validate -> {validate.shape}')
print(f'test -> {test.shape}')

train -> (497, 14)
validate -> (214, 14)
test -> (178, 14)

Create MVP function to split titanic data¶

def titanic_split(df):
    '''
    This function performs split on titanic data, stratify survived.
    Returns train, validate, and test dfs.
    '''
    train_validate, test = train_test_split(df, test_size=.2, 
                                        random_state=123, 
                                        stratify=df.survived)
    train, validate = train_test_split(train_validate, test_size=.3, 
                                   random_state=123, 
                                   stratify=train_validate.survived)
    return train, validate, test

train, validate, test = titanic_split(titanic)

print(f'train -> {train.shape}')
print(f'validate -> {validate.shape}')
print(f'test -> {test.shape}')

train -> (497, 14)
validate -> (214, 14)
test -> (178, 14)

train.head(2)

Fill the missing values in age.¶

The way you fill these values is up to you. Consider the tradeoffs of different methods.

# Create the imputer object.

imputer = SimpleImputer(strategy = 'mean')

# Fit the imputer to train and transform.

train['age'] = imputer.fit_transform(train[['age']])

# quick check

train['age'].isnull().sum()

0

# Transform the validate and test df age columns

validate['age'] = imputer.transform(validate[['age']])
test['age'] = imputer.transform(test[['age']])

Build a helper function for imputing¶

def impute_mean_age(train, validate, test):
    '''
    This function imputes the mean of the age column into
    observations with missing values.
    Returns transformed train, validate, and test df.
    '''
    # create the imputer object with mean strategy
    imputer = SimpleImputer(strategy = 'mean')
    
    # fit on and transform age column in train
    train['age'] = imputer.fit_transform(train[['age']])
    
    # transform age column in validate
    validate['age'] = imputer.transform(validate[['age']])
    
    # transform age column in test
    test['age'] = imputer.transform(test[['age']])
    
    return train, validate, test

Create a function named `prep_titanic`¶

It should accept the untransformed titanic data and return the data with the transformations above applied.

def prep_titanic(cached=True):
    '''
    This function reads titanic data into a df from a csv file.
    Returns prepped train, validate, and test dfs
    '''
    # use my acquire function to read data into a df from a csv file
    df = get_titanic_data(cached)
    
    # drop rows where embarked/embark town are null values
    df = df[~df.embarked.isnull()]
    
    # encode embarked using dummy columns
    titanic_dummies = pd.get_dummies(df.embarked, drop_first=True)
    
    # join dummy columns back to df
    df = pd.concat([df, titanic_dummies], axis=1)
    
    # drop the deck column
    df = df.drop(columns='deck')
    
    # split data into train, validate, test dfs
    train, validate, test = titanic_split(df)
    
    # impute mean of age into null values in age column
    train, validate, test = impute_mean_age(train, validate, test)
    
    return train, validate, test

train, validate, test = prep_titanic()

print(f'train -> {train.shape}')
print(f'validate -> {validate.shape}')
print(f'test -> {test.shape}')

train -> (497, 14)
validate -> (214, 14)
test -> (178, 14)

Test Functions in Testing Notebook¶

If my MVP functions work properly, and I have time, I can go back and generalize my helper functions if I want to reuse them in the future. Most importantly, I have functions that I built step-by-step, testing the code along the way and testing that I can import and use them in another notebook from my module.

	species_id	species_name	sepal_length	sepal_width	petal_length	petal_width
0	1	setosa	5.1	3.5	1.4	0.2
1	1	setosa	4.9	3.0	1.4	0.2
2	1	setosa	4.7	3.2	1.3	0.2
3	1	setosa	4.6	3.1	1.5	0.2
4	1	setosa	5.0	3.6	1.4	0.2

	species_name	sepal_length	sepal_width	petal_length	petal_width
0	setosa	5.1	3.5	1.4	0.2
1	setosa	4.9	3.0	1.4	0.2

	species	sepal_length	sepal_width	petal_length	petal_width
0	setosa	5.1	3.5	1.4	0.2
1	setosa	4.9	3.0	1.4	0.2

	species	sepal_length	sepal_width	petal_length	petal_width
0	setosa	5.1	3.5	1.4	0.2
1	setosa	4.9	3.0	1.4	0.2
2	setosa	4.7	3.2	1.3	0.2
3	setosa	4.6	3.1	1.5	0.2
4	setosa	5.0	3.6	1.4	0.2

	species	sepal_length	sepal_width	petal_length	petal_width	versicolor	virginica
13	setosa	4.3	3.0	1.1	0.1	0	0
60	versicolor	5.0	2.0	3.5	1.0	1	0
75	versicolor	6.6	3.0	4.4	1.4	1	0
69	versicolor	5.6	2.5	3.9	1.1	1	0
95	versicolor	5.7	3.0	4.2	1.2	1	0
131	virginica	7.9	3.8	6.4	2.0	0	1
77	versicolor	6.7	3.0	5.0	1.7	1	0

	passenger_id	survived	pclass	sex	age	sibsp	fare	embarked	class	deck	embark_town	alone
0	0	0	3	male	22.0	1	7.2500	S	Third	NaN	Southampton	0
1	1	1	1	female	38.0	1	71.2833	C	First	C	Cherbourg	0
2	2	1	3	female	26.0	0	7.9250	S	Third	NaN	Southampton	1
3	3	1	1	female	35.0	1	53.1000	S	First	C	Southampton	0
4	4	0	3	male	35.0	0	8.0500	S	Third	NaN	Southampton	1

	passenger_id	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	deck	embark_town	alone
61	61	1	1	female	38.0	0	0	80.0	NaN	First	B	NaN	1
829	829	1	1	female	62.0	0	0	80.0	NaN	First	B	NaN	1

	passenger_id	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	embark_town	alone	Q	S
583	583	0	1	male	36.0	0	0	40.125	C	First	Cherbourg	1	0	0
337	337	1	1	female	41.0	0	0	134.500	C	First	Cherbourg	1	0	0

	Q	S
854	0	1
18	0	1
620	0	0
633	0	1
47	1	0
256	0	0
663	0	1
648	0	1
295	0	0
641	0	0

Table of Contents