Prepare Data for Classification


Big Ideas

  • Reproducibility puts the science in data science!

  • Docstrings are also your friends.

  • Helper functions work like plastic brick toys. ;)

Objectives

By the end of the acquire lesson and exercises, you will be able to...

  • perform a train, validate, test split on your data:
from sklearn.model_selection import train_test_split

train_validate, test = train_test_split(df, test_size=.2, 
                                        random_state=123, 
                                        stratify=df.target_column)

train, validate = train_test_split(train_validate, test_size=.3, 
                                   random_state=123, 
                                   stratify=train_validate.target_column)
  • use a SimpleImputer:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy = 'desired_strategy')

train['column_name'] = imputer.fit_transform(train[['column_name']])
validate['column_name'] = imputer.transform(validate[['column_name']])
test['column_name'] = imputer.transform(test[['column_name']])
  • perform a simple encoding on a categorical variable:
pd.get_dummies(df.column, drop_first=True)
  • create a prepare.py module and import your functions in a new notebook or script:
from classification_acquire import get_titanic_data, get_iris_data
from classification_prepare import prep_iris, prep_titanic

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from darden_class_acquire import get_titanic_data, get_iris_data

Iris Data


Use the function defined in acquire.py to load the iris data.

In [2]:
iris = get_iris_data()
iris.head()
Out[2]:
species_id species_name sepal_length sepal_width petal_length petal_width
0 1 setosa 5.1 3.5 1.4 0.2
1 1 setosa 4.9 3.0 1.4 0.2
2 1 setosa 4.7 3.2 1.3 0.2
3 1 setosa 4.6 3.1 1.5 0.2
4 1 setosa 5.0 3.6 1.4 0.2

Drop the species_id column.

In [3]:
iris = iris.drop(columns='species_id')
iris.head(2)
Out[3]:
species_name sepal_length sepal_width petal_length petal_width
0 setosa 5.1 3.5 1.4 0.2
1 setosa 4.9 3.0 1.4 0.2

Rename the species_name column to just species.

In [4]:
iris = iris.rename(columns={'species_name': 'species'})
iris.head(2)
Out[4]:
species sepal_length sepal_width petal_length petal_width
0 setosa 5.1 3.5 1.4 0.2
1 setosa 4.9 3.0 1.4 0.2

Create dummy variables of the species name.

In [5]:
species_dummies = pd.get_dummies(iris.species, drop_first=True)
species_dummies.head(3)
Out[5]:
versicolor virginica
0 0 0
1 0 0
2 0 0
In [6]:
iris = pd.concat([iris, species_dummies], axis=1)
iris.head()
Out[6]:
species sepal_length sepal_width petal_length petal_width versicolor virginica
0 setosa 5.1 3.5 1.4 0.2 0 0
1 setosa 4.9 3.0 1.4 0.2 0 0
2 setosa 4.7 3.2 1.3 0.2 0 0
3 setosa 4.6 3.1 1.5 0.2 0 0
4 setosa 5.0 3.6 1.4 0.2 0 0

Create a function named prep_iris that accepts the untransformed iris data, and returns the data with the transformations above applied.

In [7]:
def prep_iris(cached=True):
    '''
    This function acquires and prepares the iris data from a local csv, default.
    Passing cached=False acquires fresh data from Codeup db and writes to csv.
    Returns the iris df with dummy variables encoding species.
    '''
    
    # use my aquire function to read data into a df from a csv file
    df = get_iris_data(cached)
    
    # drop and rename columns
    df = df.drop(columns='species_id').rename(columns={'species_name': 'species'})
    
    # create dummy columns for species
    species_dummies = pd.get_dummies(df.species, drop_first=True)
    
    # add dummy columns to df
    df = pd.concat([df, species_dummies], axis=1)
    
    return df
In [8]:
iris = prep_iris()
iris.sample(7)
Out[8]:
species sepal_length sepal_width petal_length petal_width versicolor virginica
13 setosa 4.3 3.0 1.1 0.1 0 0
60 versicolor 5.0 2.0 3.5 1.0 1 0
75 versicolor 6.6 3.0 4.4 1.4 1 0
69 versicolor 5.6 2.5 3.9 1.1 1 0
95 versicolor 5.7 3.0 4.2 1.2 1 0
131 virginica 7.9 3.8 6.4 2.0 0 1
77 versicolor 6.7 3.0 5.0 1.7 1 0

Titanic Data


Use the function you defined in acquire.py to load the titanic data set.

In [9]:
titanic = get_titanic_data()
titanic.head()
Out[9]:
passenger_id survived pclass sex age sibsp parch fare embarked class deck embark_town alone
0 0 0 3 male 22.0 1 0 7.2500 S Third NaN Southampton 0
1 1 1 1 female 38.0 1 0 71.2833 C First C Cherbourg 0
2 2 1 3 female 26.0 0 0 7.9250 S Third NaN Southampton 1
3 3 1 1 female 35.0 1 0 53.1000 S First C Southampton 0
4 4 0 3 male 35.0 0 0 8.0500 S Third NaN Southampton 1

Handle the missing values in the embark_town and embarked columns.

  • With a quick check, I see that in total I would only be dropping two rows, so that's how I'm going to handle these missing values.
In [10]:
titanic[titanic.embark_town.isnull()]
Out[10]:
passenger_id survived pclass sex age sibsp parch fare embarked class deck embark_town alone
61 61 1 1 female 38.0 0 0 80.0 NaN First B NaN 1
829 829 1 1 female 62.0 0 0 80.0 NaN First B NaN 1
In [11]:
titanic[titanic.embarked.isnull()]
Out[11]:
passenger_id survived pclass sex age sibsp parch fare embarked class deck embark_town alone
61 61 1 1 female 38.0 0 0 80.0 NaN First B NaN 1
829 829 1 1 female 62.0 0 0 80.0 NaN First B NaN 1
In [12]:
# using the complement operator, ~, to return the inverse of our instance above. Return everything but the null values.

titanic = titanic[~titanic.embarked.isnull()]
titanic.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  889 non-null    int64  
 1   survived      889 non-null    int64  
 2   pclass        889 non-null    int64  
 3   sex           889 non-null    object 
 4   age           712 non-null    float64
 5   sibsp         889 non-null    int64  
 6   parch         889 non-null    int64  
 7   fare          889 non-null    float64
 8   embarked      889 non-null    object 
 9   class         889 non-null    object 
 10  deck          201 non-null    object 
 11  embark_town   889 non-null    object 
 12  alone         889 non-null    int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 97.2+ KB

Remove the deck column.

In [13]:
titanic = titanic.drop(columns='deck')
titanic.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  889 non-null    int64  
 1   survived      889 non-null    int64  
 2   pclass        889 non-null    int64  
 3   sex           889 non-null    object 
 4   age           712 non-null    float64
 5   sibsp         889 non-null    int64  
 6   parch         889 non-null    int64  
 7   fare          889 non-null    float64
 8   embarked      889 non-null    object 
 9   class         889 non-null    object 
 10  embark_town   889 non-null    object 
 11  alone         889 non-null    int64  
dtypes: float64(2), int64(6), object(4)
memory usage: 90.3+ KB

Create a dummy variable of the embarked column.

In [14]:
titanic_dummies = pd.get_dummies(titanic.embarked, drop_first=True)
titanic_dummies.sample(10)
Out[14]:
Q S
854 0 1
18 0 1
620 0 0
633 0 1
47 1 0
256 0 0
663 0 1
648 0 1
295 0 0
641 0 0
In [15]:
titanic = pd.concat([titanic, titanic_dummies], axis=1)
titanic.head()
Out[15]:
passenger_id survived pclass sex age sibsp parch fare embarked class embark_town alone Q S
0 0 0 3 male 22.0 1 0 7.2500 S Third Southampton 0 0 1
1 1 1 1 female 38.0 1 0 71.2833 C First Cherbourg 0 0 0
2 2 1 3 female 26.0 0 0 7.9250 S Third Southampton 1 0 1
3 3 1 1 female 35.0 1 0 53.1000 S First Southampton 0 0 1
4 4 0 3 male 35.0 0 0 8.0500 S Third Southampton 1 0 1

Split Data

In [16]:
train_validate, test = train_test_split(titanic, test_size=.2, 
                                        random_state=123, 
                                        stratify=titanic.survived)
In [17]:
train, validate = train_test_split(train_validate, test_size=.3, 
                                   random_state=123, 
                                   stratify=train_validate.survived)
In [18]:
print(f'train -> {train.shape}')
print(f'validate -> {validate.shape}')
print(f'test -> {test.shape}')
train -> (497, 14)
validate -> (214, 14)
test -> (178, 14)

Create MVP function to split titanic data

In [19]:
def titanic_split(df):
    '''
    This function performs split on titanic data, stratify survived.
    Returns train, validate, and test dfs.
    '''
    train_validate, test = train_test_split(df, test_size=.2, 
                                        random_state=123, 
                                        stratify=df.survived)
    train, validate = train_test_split(train_validate, test_size=.3, 
                                   random_state=123, 
                                   stratify=train_validate.survived)
    return train, validate, test
In [20]:
train, validate, test = titanic_split(titanic)
In [21]:
print(f'train -> {train.shape}')
print(f'validate -> {validate.shape}')
print(f'test -> {test.shape}')
train -> (497, 14)
validate -> (214, 14)
test -> (178, 14)
In [22]:
train.head(2)
Out[22]:
passenger_id survived pclass sex age sibsp parch fare embarked class embark_town alone Q S
583 583 0 1 male 36.0 0 0 40.125 C First Cherbourg 1 0 0
337 337 1 1 female 41.0 0 0 134.500 C First Cherbourg 1 0 0

Fill the missing values in age.

  • The way you fill these values is up to you. Consider the tradeoffs of different methods.
In [23]:
# Create the imputer object.

imputer = SimpleImputer(strategy = 'mean')
In [24]:
# Fit the imputer to train and transform.

train['age'] = imputer.fit_transform(train[['age']])
In [25]:
# quick check

train['age'].isnull().sum()
Out[25]:
0
In [26]:
# Transform the validate and test df age columns

validate['age'] = imputer.transform(validate[['age']])
test['age'] = imputer.transform(test[['age']])

Build a helper function for imputing

In [27]:
def impute_mean_age(train, validate, test):
    '''
    This function imputes the mean of the age column into
    observations with missing values.
    Returns transformed train, validate, and test df.
    '''
    # create the imputer object with mean strategy
    imputer = SimpleImputer(strategy = 'mean')
    
    # fit on and transform age column in train
    train['age'] = imputer.fit_transform(train[['age']])
    
    # transform age column in validate
    validate['age'] = imputer.transform(validate[['age']])
    
    # transform age column in test
    test['age'] = imputer.transform(test[['age']])
    
    return train, validate, test

Create a function named prep_titanic

  • It should accept the untransformed titanic data and return the data with the transformations above applied.
In [28]:
def prep_titanic(cached=True):
    '''
    This function reads titanic data into a df from a csv file.
    Returns prepped train, validate, and test dfs
    '''
    # use my acquire function to read data into a df from a csv file
    df = get_titanic_data(cached)
    
    # drop rows where embarked/embark town are null values
    df = df[~df.embarked.isnull()]
    
    # encode embarked using dummy columns
    titanic_dummies = pd.get_dummies(df.embarked, drop_first=True)
    
    # join dummy columns back to df
    df = pd.concat([df, titanic_dummies], axis=1)
    
    # drop the deck column
    df = df.drop(columns='deck')
    
    # split data into train, validate, test dfs
    train, validate, test = titanic_split(df)
    
    # impute mean of age into null values in age column
    train, validate, test = impute_mean_age(train, validate, test)
    
    return train, validate, test
In [29]:
train, validate, test = prep_titanic()
In [30]:
print(f'train -> {train.shape}')
print(f'validate -> {validate.shape}')
print(f'test -> {test.shape}')
train -> (497, 14)
validate -> (214, 14)
test -> (178, 14)

Test Functions in Testing Notebook

If my MVP functions work properly, and I have time, I can go back and generalize my helper functions if I want to reuse them in the future. Most importantly, I have functions that I built step-by-step, testing the code along the way and testing that I can import and use them in another notebook from my module.