Table of Contents


Pandas DataFrames Exercises

Big Idea

The Cycle of Improvement - James Clear (Atomic Habits)

  1. Awareness - identify what you need to improve.

  2. Deliberate practice - focus your conscious effort on the specific area you want to improve.

  3. Habit - with practice, the effortful becomes automatic.

  4. Repeat - begin again.

Objectives

By the end of the lesson and exercises, you will be able to...

  • create subsets of data from a pandas DataFrame.
.loc[row_indexer, col_indexer]
.iloc[row_indexer, col_indexer]
df[bool_series]
  • create and manipulate columns in a pandas DataFrame.
.drop(columns=)
.rename(columns={'original_name': 'new_name')
.assign(new_column = some_expression_or_calculation)
  • sort the data in a pandas DataFrame.
.sort_values(by=column(s))
.sort_index(ascending=True)
  • chain methods to perform more complicated data manipulations.
df.drop().rename()

In [88]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

from pydataset import data

1. Create student grades DataFrame object

In [89]:
np.random.seed(123)

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# randomly generate scores for each student for each subject
# note that all the values need to have the same length here

math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})

Peek at DataFrame

In [90]:
df.head()
Out[90]:
name math english reading
0 Sally 62 85 80
1 Jane 88 79 67
2 Suzie 94 74 95
3 Billy 98 96 88
4 Ada 77 92 98
In [91]:
# Use shape attribute

df.shape
Out[91]:
(12, 4)
In [92]:
# Use info method to view both datatypes and potential missing values

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     12 non-null     object
 1   math     12 non-null     int64 
 2   english  12 non-null     int64 
 3   reading  12 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 512.0+ bytes
In [93]:
# Use the `.describe()` method to produce descriptive statistics for columns with numeric datatypes

df.describe()
Out[93]:
math english reading
count 12.000000 12.000000 12.000000
mean 84.833333 77.666667 86.500000
std 11.134168 13.371158 9.643651
min 62.000000 62.000000 67.000000
25% 78.500000 63.750000 80.750000
50% 90.000000 77.500000 89.000000
75% 92.250000 86.750000 93.250000
max 98.000000 99.000000 98.000000

a. Create a column named passing_english that indicates whether each student has a passing grade in english.

In [94]:
# Create a boolean Series or a boolean mask that returns `True` for a passing English grade.

df.english >= 70
Out[94]:
0      True
1      True
2      True
3      True
4      True
5      True
6     False
7     False
8     False
9      True
10     True
11    False
Name: english, dtype: bool
In [95]:
# Create a new column in our DataFrame and assign our boolean Series to it.

df['passing_english'] = df.english >= 70
In [96]:
# Check dataframe for new column.

df.head()
Out[96]:
name math english reading passing_english
0 Sally 62 85 80 True
1 Jane 88 79 67 True
2 Suzie 94 74 95 True
3 Billy 98 96 88 True
4 Ada 77 92 98 True
In [97]:
# How many people are passing English? Use the `.sum()` function to add the True bool (1) values.

df['passing_english'].sum()
Out[97]:
8
In [98]:
# How many students are failing English? Use the `.sum()` function to add True values for failing.

df['passing_english'] == False
Out[98]:
0     False
1     False
2     False
3     False
4     False
5     False
6      True
7      True
8      True
9     False
10    False
11     True
Name: passing_english, dtype: bool
In [99]:
(df['passing_english'] == False).sum()
Out[99]:
4

b. Sort the english grades by the passing_english column. How are duplicates handled?

  • .sort_values returns a sorted copy of a given DataFrame unless inplace=True.

  • It looks like duplicate values are handled according to the index value, small to large or ascending.

In [100]:
# DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
# sorts all of the rows in the DF using the column passed

df.sort_values(by='passing_english')
Out[100]:
name math english reading passing_english
6 Thomas 82 64 81 False
7 Marie 93 63 90 False
8 Albert 92 62 87 False
11 Alan 92 62 72 False
0 Sally 62 85 80 True
1 Jane 88 79 67 True
2 Suzie 94 74 95 True
3 Billy 98 96 88 True
4 Ada 77 92 98 True
5 John 79 76 93 True
9 Richard 69 80 94 True
10 Isaac 92 99 93 True

c. Sort the english grades first by passing_english and then by student name.

All the students that are failing english should be first, and within the students who are failing english, they are ordered alphabetically. The same will be true for the students passing english.

Hint: You can pass a list to the .sort_values method.
In [101]:
# Now we see that Alan comes before Albert because there is a secondary sort going on alphabetically by name

df.sort_values(by=['passing_english', 'name'])
Out[101]:
name math english reading passing_english
11 Alan 92 62 72 False
8 Albert 92 62 87 False
7 Marie 93 63 90 False
6 Thomas 82 64 81 False
4 Ada 77 92 98 True
3 Billy 98 96 88 True
10 Isaac 92 99 93 True
1 Jane 88 79 67 True
5 John 79 76 93 True
9 Richard 69 80 94 True
0 Sally 62 85 80 True
2 Suzie 94 74 95 True
In [102]:
# What if I want the students passing English first but names in alpha order?

df.sort_values(by=['passing_english', 'name'], ascending=[False, True])
Out[102]:
name math english reading passing_english
4 Ada 77 92 98 True
3 Billy 98 96 88 True
10 Isaac 92 99 93 True
1 Jane 88 79 67 True
5 John 79 76 93 True
9 Richard 69 80 94 True
0 Sally 62 85 80 True
2 Suzie 94 74 95 True
11 Alan 92 62 72 False
8 Albert 92 62 87 False
7 Marie 93 63 90 False
6 Thomas 82 64 81 False

d. Sort the english grades first by passing_english, and then by the actual english grade, similar to how we did in the last step.

In [103]:
df.sort_values(by=['passing_english', 'english'])
Out[103]:
name math english reading passing_english
8 Albert 92 62 87 False
11 Alan 92 62 72 False
7 Marie 93 63 90 False
6 Thomas 82 64 81 False
2 Suzie 94 74 95 True
5 John 79 76 93 True
1 Jane 88 79 67 True
9 Richard 69 80 94 True
0 Sally 62 85 80 True
4 Ada 77 92 98 True
3 Billy 98 96 88 True
10 Isaac 92 99 93 True
In [104]:
# Reverse my sort on both columns

df.sort_values(by=['passing_english', 'english'], ascending=[False, False])
Out[104]:
name math english reading passing_english
10 Isaac 92 99 93 True
3 Billy 98 96 88 True
4 Ada 77 92 98 True
0 Sally 62 85 80 True
9 Richard 69 80 94 True
1 Jane 88 79 67 True
5 John 79 76 93 True
2 Suzie 94 74 95 True
6 Thomas 82 64 81 False
7 Marie 93 63 90 False
8 Albert 92 62 87 False
11 Alan 92 62 72 False

e. Calculate each student's overall grade and add it as a column on the DataFrame. The overall grade is the average of the math, english, and reading grades.

The .iloc attribute allows me to access a group of rows and columns by their integer location or position. Notice below that the observations returned match the integer index location passed to .iloc, and the indexing is NOT inclusive. (from Series Review Notebook)

df.iloc[row_indexer, column_indexer]

I could also solve this using .loc if I wanted to use column labels instead of index position.

df.loc[:, 'math': 'reading']
In [105]:
# Here, I'm selecting all rows and columns at index positions 1, 2, 3. Now I can just sum them.

df.iloc[:, 1:4]
Out[105]:
math english reading
0 62 85 80
1 88 79 67
2 94 74 95
3 98 96 88
4 77 92 98
5 79 76 93
6 82 64 81
7 93 63 90
8 92 62 87
9 69 80 94
10 92 99 93
11 92 62 72
In [106]:
# Since I want the total of the columns in each row, I set axis=1 for columns.

df.iloc[:, 1:4].sum(axis=1)
Out[106]:
0     227
1     234
2     263
3     282
4     267
5     248
6     227
7     246
8     241
9     243
10    284
11    226
dtype: int64
In [107]:
# Assign to variable and use the Series to create the calculated column I want on my df.

totals = df.iloc[:, 1:4].sum(axis=1)
In [108]:
# assign Series back to DataFrame as overall_grade

df['overall_average'] = round((totals / 3), 0).astype(int)
df.head()
Out[108]:
name math english reading passing_english overall_average
0 Sally 62 85 80 True 76
1 Jane 88 79 67 True 78
2 Suzie 94 74 95 True 88
3 Billy 98 96 88 True 94
4 Ada 77 92 98 True 89

2. Load the mpg dataset. Read the documentation for the dataset and use it for the following questions:

In [109]:
data('mpg', show_doc=True)
mpg

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Fuel economy data from 1999 and 2008 for 38 popular models of car

### Description

This dataset contains a subset of the fuel economy data that the EPA makes
available on http://fueleconomy.gov. It contains only models which had a new
release every year between 1999 and 2008 - this was used as a proxy for the
popularity of the car.

### Usage

    data(mpg)

### Format

A data frame with 234 rows and 11 variables

### Details

  * manufacturer. 

  * model. 

  * displ. engine displacement, in litres 

  * year. 

  * cyl. number of cylinders 

  * trans. type of transmission 

  * drv. f = front-wheel drive, r = rear wheel drive, 4 = 4wd 

  * cty. city miles per gallon 

  * hwy. highway miles per gallon 

  * fl. 

  * class. 


In [110]:
mpg = data('mpg')
In [111]:
mpg.head()
Out[111]:
manufacturer model displ year cyl trans drv cty hwy fl class
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact

a. How many rows and columns are there?

In [112]:
mpg.shape
Out[112]:
(234, 11)
In [113]:
print(f'There are {mpg.shape[0]} rows and {mpg.shape[1]} columns in the mpg DataFrame.')
There are 234 rows and 11 columns in the mpg DataFrame.

b. What are the data types of each column?

In [114]:
mpg.dtypes
Out[114]:
manufacturer     object
model            object
displ           float64
year              int64
cyl               int64
trans            object
drv              object
cty               int64
hwy               int64
fl               object
class            object
dtype: object

c. Summarize the dataframe with the .info() and .describe() methods.

  • .info() shows us that all of the columns have the same number of non-null values.

  • .describe() provides us with the descriptive statistics for all columns with numeric dtypes.

In [115]:
mpg.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB
In [116]:
mpg.describe()
Out[116]:
displ year cyl cty hwy
count 234.000000 234.000000 234.000000 234.000000 234.000000
mean 3.471795 2003.500000 5.888889 16.858974 23.440171
std 1.291959 4.509646 1.611534 4.255946 5.954643
min 1.600000 1999.000000 4.000000 9.000000 12.000000
25% 2.400000 1999.000000 4.000000 14.000000 18.000000
50% 3.300000 2003.500000 6.000000 17.000000 24.000000
75% 4.600000 2008.000000 8.000000 19.000000 27.000000
max 7.000000 2008.000000 8.000000 35.000000 44.000000

d. Rename the cty column to city and hwy to highway using .rename() method.

  • .rename() takes in a dictionary with the key as the original name and the value as the new name.

  • If you want to change your original DataFrame to reflect your new column names, inplace=True

In [117]:
mpg.rename(columns={'cty': 'city', 'hwy': 'highway'}, inplace=True)
In [118]:
mpg.head(1)
Out[118]:
manufacturer model displ year cyl trans drv city highway fl class
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact

Another way to rename columns...

  • This is another way you can rename columns, especially if you want to change many at once.

  • I can use the .columns attribute to grab my column labels; I can go a step further and print out a list of the current columns in the DataFrame by adding the .to_list() method. It's not necessary if I'm just grabbing the column names, but I love getting the nice list.

  • Then, I make any changes I want to the names in the list and reassign them to df.columns.

In [119]:
mpg = data('mpg')
mpg.head()
Out[119]:
manufacturer model displ year cyl trans drv cty hwy fl class
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
In [120]:
mpg.columns.to_list()
Out[120]:
['manufacturer',
 'model',
 'displ',
 'year',
 'cyl',
 'trans',
 'drv',
 'cty',
 'hwy',
 'fl',
 'class']
In [121]:
mpg.columns = ['manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'city',
       'highway', 'fl', 'class']
In [122]:
mpg.head(1)
Out[122]:
manufacturer model displ year cyl trans drv city highway fl class
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact

e. Do any cars have better city mileage than highway mileage?

In [123]:
# Create a boolean Series or a boolean Mask

bool_series = mpg.city > mpg.highway
bool_series.head()
Out[123]:
1    False
2    False
3    False
4    False
5    False
dtype: bool
In [124]:
# Return a subset of the original DataFrame using the indexing operator.

mpg[bool_series]
Out[124]:
manufacturer model displ year cyl trans drv city highway fl class
In [125]:
# I can do a quick check to validate my findings above. There are no observations that meet this condition.

bool_series.sum()
Out[125]:
0

f. Create a column named mileage_difference; this column should contain the difference between highway and city mileage for each car.

  • You saw above how to create a new column in a df using bracket notation; this time I'll show you how to create a new column in a df using the .assign() method.

  • They are both valid ways to create a new column; if you want to create more than one column at once, .assign() is worth looking into. Choose your flavor...

In [126]:
# This operation returns a Series of differences in mileage for each row. Looks good.

mpg.highway - mpg.city
Out[126]:
1      11
2       8
3      11
4       9
5      10
       ..
230     9
231     8
232    10
233     8
234     9
Length: 234, dtype: int64
In [127]:
# This is how I can create a new column in my df from a calculation on two existing columns.
# I can reassign to my original df at this point, or I might want to assign to a new df name.

mpg = mpg.assign(mileage_difference = mpg.highway - mpg.city)
In [128]:
mpg.head()
Out[128]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 11
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 8
3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 11
4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact 9
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 10

g. Which car (or cars) has the highest mileage difference?

In [129]:
# Use the `.max()` function to find the max value in the column.

mpg.mileage_difference.max()
Out[129]:
12
In [130]:
# Create a boolean Series where `True` values identify a match with the max mileage_difference. 

bool_series = mpg.mileage_difference == mpg.mileage_difference.max()
bool_series.head()
Out[130]:
1    False
2    False
3    False
4    False
5    False
Name: mileage_difference, dtype: bool
In [131]:
# Pass my bool_series as a selector for rows in my mpg Series.

mpg[bool_series]
Out[131]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference
107 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact 12
223 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact 12
In [132]:
# I can get more specific if I like...

mpg[bool_series][['manufacturer', 'model']]
Out[132]:
manufacturer model
107 honda civic
223 volkswagen new beetle

h. Which compact class car has the lowest highway mileage?

In [133]:
# Why doesn't this work??? This is an example of a situation where you have to use bracket notation

mpg.class
  File "<ipython-input-133-4e43ef5af34c>", line 3
    mpg.class
            ^
SyntaxError: invalid syntax
In [134]:
# I check and see that I have a column named `class`. What's up?
# 'class' is a reserved word, so I have to use bracket notation to access the column!

mpg.columns
Out[134]:
Index(['manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'city',
       'highway', 'fl', 'class', 'mileage_difference'],
      dtype='object')
In [135]:
# I'm taking a look at all of the possible values in the column here.

mpg['class'].value_counts()
Out[135]:
suv           62
compact       47
midsize       41
subcompact    35
pickup        33
minivan       11
2seater        5
Name: class, dtype: int64
In [136]:
# Create the bool Series or selector for the compact class of cars.

bool_series = mpg['class'] == 'compact'
bool_series.head()
Out[136]:
1    True
2    True
3    True
4    True
5    True
Name: class, dtype: bool
In [137]:
# Pass my selector to my original DataFrame to get a subset of compact cars.

compacts = mpg[bool_series]
compacts.head()
Out[137]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 11
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 8
3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 11
4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact 9
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 10
In [138]:
# Sort the DataFrame by highway mileage

compacts.sort_values(by='highway').head()
Out[138]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference
220 volkswagen jetta 2.8 1999 6 auto(l4) f 16 23 r compact 7
221 volkswagen jetta 2.8 1999 6 manual(m5) f 17 24 r compact 7
212 volkswagen gti 2.8 1999 6 manual(m5) f 17 24 r compact 7
172 subaru impreza awd 2.5 2008 4 manual(m5) 4 19 25 p compact 6
170 subaru impreza awd 2.5 2008 4 auto(s4) 4 20 25 p compact 5
In [139]:
# If you want to isolate the compact car with the lowest highway mileage

compacts.sort_values(by='highway').head(1)
Out[139]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference
220 volkswagen jetta 2.8 1999 6 auto(l4) f 16 23 r compact 7

The best highway mileage?

In [140]:
# Isolate the compact car with the best highway mileage.

compacts.sort_values(by='highway', ascending=False).head()
Out[140]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference
213 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact 11
197 toyota corolla 1.8 2008 4 manual(m5) f 28 37 r compact 9
196 toyota corolla 1.8 1999 4 manual(m5) f 26 35 r compact 9
198 toyota corolla 1.8 2008 4 auto(l4) f 26 35 r compact 9
195 toyota corolla 1.8 1999 4 auto(l4) f 24 33 r compact 9
In [141]:
# isolate the row using .head() method

compacts.sort_values(by='highway', ascending=False).head(1)
Out[141]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference
213 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact 11

j. Create a column named average_mileage that is the mean of the city and highway mileage.

In [142]:
# Create the calculated column

(mpg.city + mpg.highway) / 2
Out[142]:
1      23.5
2      25.0
3      25.5
4      25.5
5      21.0
       ... 
230    23.5
231    25.0
232    21.0
233    22.0
234    21.5
Length: 234, dtype: float64
In [143]:
# Assign the Series back to the original DataFrame as a new column.

mpg['average_mileage'] = (mpg.city + mpg.highway) / 2
In [144]:
mpg.head(2)
Out[144]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference average_mileage
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 11 23.5
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 8 25.0

k. Which Dodge car has the best average mileage? The worst?

In [145]:
# Create the boolean Series to filter for Dodge vehicles.

bool_series = mpg.manufacturer == 'dodge'
bool_series.head()
Out[145]:
1    False
2    False
3    False
4    False
5    False
Name: manufacturer, dtype: bool
In [146]:
# Use the selector to create a subset of only Dodge vehicles.

dodges = mpg[bool_series]
dodges.head()
Out[146]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference average_mileage
38 dodge caravan 2wd 2.4 1999 4 auto(l3) f 18 24 r minivan 6 21.0
39 dodge caravan 2wd 3.0 1999 6 auto(l4) f 17 24 r minivan 7 20.5
40 dodge caravan 2wd 3.3 1999 6 auto(l4) f 16 22 r minivan 6 19.0
41 dodge caravan 2wd 3.3 1999 6 auto(l4) f 16 22 r minivan 6 19.0
42 dodge caravan 2wd 3.3 2008 6 auto(l4) f 17 24 r minivan 7 20.5
In [147]:
# Isolate the Dodge with the best average mileage.

dodges.sort_values(by='average_mileage').tail(1)
Out[147]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference average_mileage
38 dodge caravan 2wd 2.4 1999 4 auto(l3) f 18 24 r minivan 6 21.0
In [148]:
# Isolate the Dodge with the worst average mileage.

dodges.sort_values(by='average_mileage').head(4)
Out[148]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference average_mileage
70 dodge ram 1500 pickup 4wd 4.7 2008 8 manual(m6) 4 9 12 e pickup 3 10.5
66 dodge ram 1500 pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pickup 3 10.5
60 dodge durango 4wd 4.7 2008 8 auto(l5) 4 9 12 e suv 3 10.5
55 dodge dakota pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pickup 3 10.5

In [149]:
# another way...

worst_mileage = dodges.average_mileage.min()
worst_mileage
Out[149]:
10.5
In [150]:
# Compare the average_mileage of each observation to the scalar value of worst_mileage; return ALL values that match.

dodges[dodges.average_mileage == worst_mileage]
Out[150]:
manufacturer model displ year cyl trans drv city highway fl class mileage_difference average_mileage
55 dodge dakota pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pickup 3 10.5
60 dodge durango 4wd 4.7 2008 8 auto(l5) 4 9 12 e suv 3 10.5
66 dodge ram 1500 pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pickup 3 10.5
70 dodge ram 1500 pickup 4wd 4.7 2008 8 manual(m6) 4 9 12 e pickup 3 10.5

3. Load the Mammals dataset. Read the documentation for it, and use the data to answer these questions:

In [151]:
data('Mammals', show_doc=True)
Mammals

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Garland(1983) Data on Running Speed of Mammals

### Description

Observations on the maximal running speed of mammal species and their body
mass.

### Usage

    data(Mammals)

### Format

A data frame with 107 observations on the following 4 variables.

weight

Body mass in Kg for "typical adult sizes"

speed

Maximal running speed (fastest sprint velocity on record)

hoppers

logical variable indicating animals that ambulate by hopping, e.g. kangaroos

specials

logical variable indicating special animals with "lifestyles in which speed
does not figure as an important factor": Hippopotamus, raccoon (Procyon),
badger (Meles), coati (Nasua), skunk (Mephitis), man (Homo), porcupine
(Erithizon), oppossum (didelphis), and sloth (Bradypus)

### Details

Used by Chappell (1989) and Koenker, Ng and Portnoy (1994) to illustrate the
fitting of piecewise linear curves.

### Source

Garland, T. (1983) The relation between maximal running speed and body mass in
terrestrial mammals, _J. Zoology_, 199, 1557-1570.

### References

Koenker, R., P. Ng and S. Portnoy, (1994) Quantile Smoothing Splines”
_Biometrika_, 81, 673-680.

Chappell, R. (1989) Fitting Bent Lines to Data, with Applications ot
Allometry, _J. Theo. Biology_, 138, 235-256.

### See Also

`rqss`

### Examples

    data(Mammals)
    attach(Mammals)
    x <- log(weight)
    y <- log(speed)
    plot(x,y, xlab="Weight in log(Kg)", ylab="Speed in log(Km/hour)",type="n")
    points(x[hoppers],y[hoppers],pch = "h", col="red")
    points(x[specials],y[specials],pch = "s", col="blue")
    others <- (!hoppers & !specials)
    points(x[others],y[others], col="black",cex = .75)
    fit <- rqss(y ~ qss(x, lambda = 1),tau = .9)
    plot(fit)


In [152]:
# Load DataFrame and save as `mammals`.

mammals = data('Mammals')
In [153]:
mammals.head()
Out[153]:
weight speed hoppers specials
1 6000.0 35.0 False False
2 4000.0 26.0 False False
3 3000.0 25.0 False False
4 1400.0 45.0 False False
5 400.0 70.0 False False

a. How many rows and columns are there?

In [154]:
mammals.shape
Out[154]:
(107, 4)

b. What are the data types?

In [155]:
mammals.dtypes
Out[155]:
weight      float64
speed       float64
hoppers        bool
specials       bool
dtype: object

c. Summarize the dataframe with .info and .describe

In [156]:
mammals.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 1 to 107
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   weight    107 non-null    float64
 1   speed     107 non-null    float64
 2   hoppers   107 non-null    bool   
 3   specials  107 non-null    bool   
dtypes: bool(2), float64(2)
memory usage: 2.7 KB
In [157]:
mammals.describe()
Out[157]:
weight speed
count 107.000000 107.000000
mean 278.688178 46.208411
std 839.608269 26.716778
min 0.016000 1.600000
25% 1.700000 22.500000
50% 34.000000 48.000000
75% 142.500000 65.000000
max 6000.000000 110.000000

Quick visualization of the distributioin of weight and speed values

In [158]:
plt.hist(mammals.weight, bins=20, color='orange', edgecolor='black')

plt.title('Weight of Mammals in kg')
plt.xlabel('kg')
plt.ylabel('Count')

plt.show()
In [159]:
plt.hist(mammals.speed, bins=15, color='dodgerblue', edgecolor='black')

plt.title('Full Sprint Speed of Mammals in km')
plt.xlabel('km')
plt.ylabel('Count')

plt.show()

d. What is the the weight of the fastest animal?

In [160]:
# Find the fastest speed using max(). I see there is only one observation with 110 km.

mammals.speed.value_counts().sort_index(ascending=False).head()
Out[160]:
110.0    1
105.0    1
100.0    1
97.0     2
90.0     1
Name: speed, dtype: int64
In [161]:
# This returns the index LABEL, not the integer position, and that's why I'm using .loc below. `df.loc[row_label]`

mammals.speed.idxmax()
Out[161]:
53
In [162]:
# I can isolate that observation/mammal using the `.loc` indexer and `.idxmax()` method.

mammals.loc[mammals.speed.idxmax()]
Out[162]:
weight         55
speed         110
hoppers     False
specials    False
Name: 53, dtype: object
In [163]:
# You guessed it; it's a Series! I can access any of those index labels or values.

type(mammals.loc[mammals.speed.idxmax()])
Out[163]:
pandas.core.series.Series
In [164]:
mammals.loc[mammals.speed.idxmax()].weight
Out[164]:
55.0
In [165]:
# OR -> I can create a boolan Series to filter for observations that match the max speed.

bool_series = mammals.speed == mammals.speed.max()
bool_series.head()
Out[165]:
1    False
2    False
3    False
4    False
5    False
Name: speed, dtype: bool
In [166]:
# Pass my boolean Series to the indexing operator as a selector to find observations that match the fastest speed

mammals[bool_series]
Out[166]:
weight speed hoppers specials
53 55.0 110.0 False False
In [167]:
# Isolate the weight value if you want like this...

mammals[bool_series].weight
Out[167]:
53    55.0
Name: weight, dtype: float64
In [168]:
# or like this without the index label. Again, your context will dictate the data you want to return and how you get it.

mammals.loc[mammals.speed.idxmax()].weight
Out[168]:
55.0

e. What is the overall percentage of specials?

In [169]:
# We have a boolean Series already.

mammals.specials.head()
Out[169]:
1    False
2    False
3    False
4    False
5    False
Name: specials, dtype: bool
In [170]:
# First, I'll find the number of mammals classified as specials.
# True boolean values are understood as having a value of 1, so I can sum the boolean Series.

total_specials = mammals.specials.sum()
total_specials
Out[170]:
10
In [171]:
# Find the total number of mammals in my df.

total_mammals = len(mammals)
total_mammals
Out[171]:
107
In [172]:
print(f'{round(total_specials / total_mammals * 100, 2)}% of mammals in the df are specials.')
9.35% of mammals in the df are specials.

f. How many animals are hoppers that are above the median speed? What percentage is this?

  • I interpreted this question as animals that are hoppers AND above the median speed for all mammals in our DataFrame.
In [173]:
# Remind myself of column names and values.

mammals.head(1)
Out[173]:
weight speed hoppers specials
1 6000.0 35.0 False False
In [174]:
# Find the median speed of mammals in our df.

median_speed = mammals.speed.median()
median_speed
Out[174]:
48.0
In [175]:
# Create boolean Series for mammals with speed above `median_speed` and True for hoppers

bool_series = (mammals.speed > median_speed) & (mammals.hoppers == True)
bool_series.head()
Out[175]:
1    False
2    False
3    False
4    False
5    False
dtype: bool
In [176]:
# These are our fast hoppers.

fast_hoppers = mammals[bool_series]
fast_hoppers
len(fast_hoppers)
Out[176]:
7
In [177]:
print(f'This puts fast hoppers at {round((len(fast_hoppers) / len(mammals)) * 100, 2)}% of the mammals.')
This puts fast hoppers at 6.54% of the mammals.