Pandas DataFrames Exercises

Big Idea¶

The Cycle of Improvement - James Clear (Atomic Habits)

Awareness - identify what you need to improve.
Deliberate practice - focus your conscious effort on the specific area you want to improve.
Habit - with practice, the effortful becomes automatic.
Repeat - begin again.

Objectives¶

By the end of the lesson and exercises, you will be able to...

create subsets of data from a pandas DataFrame.

.loc[row_indexer, col_indexer]
.iloc[row_indexer, col_indexer]
df[bool_series]

create and manipulate columns in a pandas DataFrame.

.drop(columns=)
.rename(columns={'original_name': 'new_name')
.assign(new_column = some_expression_or_calculation)

sort the data in a pandas DataFrame.

.sort_values(by=column(s))
.sort_index(ascending=True)

chain methods to perform more complicated data manipulations.

df.drop().rename()

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

from pydataset import data

1. Create student grades DataFrame object¶

np.random.seed(123)

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# randomly generate scores for each student for each subject
# note that all the values need to have the same length here

math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})

Peek at DataFrame

df.head()

# Use shape attribute

df.shape

(12, 4)

# Use info method to view both datatypes and potential missing values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     12 non-null     object
 1   math     12 non-null     int64 
 2   english  12 non-null     int64 
 3   reading  12 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 512.0+ bytes

# Use the `.describe()` method to produce descriptive statistics for columns with numeric datatypes

df.describe()

a. Create a column named `passing_english` that indicates whether each student has a passing grade in english.¶

# Create a boolean Series or a boolean mask that returns `True` for a passing English grade.

df.english >= 70

0      True
1      True
2      True
3      True
4      True
5      True
6     False
7     False
8     False
9      True
10     True
11    False
Name: english, dtype: bool

# Create a new column in our DataFrame and assign our boolean Series to it.

df['passing_english'] = df.english >= 70

# Check dataframe for new column.

df.head()

# How many people are passing English? Use the `.sum()` function to add the True bool (1) values.

df['passing_english'].sum()

8

# How many students are failing English? Use the `.sum()` function to add True values for failing.

df['passing_english'] == False

0     False
1     False
2     False
3     False
4     False
5     False
6      True
7      True
8      True
9     False
10    False
11     True
Name: passing_english, dtype: bool

(df['passing_english'] == False).sum()

4

b. Sort the english grades by the `passing_english` column. How are duplicates handled?¶

.sort_values returns a sorted copy of a given DataFrame unless inplace=True.
It looks like duplicate values are handled according to the index value, small to large or ascending.

# DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
# sorts all of the rows in the DF using the column passed

df.sort_values(by='passing_english')

c. Sort the english grades first by `passing_english` and then by student `name`.¶

All the students that are failing english should be first, and within the students who are failing english, they are ordered alphabetically. The same will be true for the students passing english.

Hint: You can pass a list to the .sort_values method.

# Now we see that Alan comes before Albert because there is a secondary sort going on alphabetically by name

df.sort_values(by=['passing_english', 'name'])

# What if I want the students passing English first but names in alpha order?

df.sort_values(by=['passing_english', 'name'], ascending=[False, True])

d. Sort the english grades first by `passing_english`, and then by the actual `english` grade, similar to how we did in the last step.¶

df.sort_values(by=['passing_english', 'english'])

# Reverse my sort on both columns

df.sort_values(by=['passing_english', 'english'], ascending=[False, False])

e. Calculate each student's overall grade and add it as a column on the DataFrame. The overall grade is the average of the math, english, and reading grades.¶

The .iloc attribute allows me to access a group of rows and columns by their integer location or position. Notice below that the observations returned match the integer index location passed to .iloc, and the indexing is NOT inclusive. (from Series Review Notebook)

df.iloc[row_indexer, column_indexer]

I could also solve this using .loc if I wanted to use column labels instead of index position.

df.loc[:, 'math': 'reading']

# Here, I'm selecting all rows and columns at index positions 1, 2, 3. Now I can just sum them.

df.iloc[:, 1:4]

# Since I want the total of the columns in each row, I set axis=1 for columns.

df.iloc[:, 1:4].sum(axis=1)

0     227
1     234
2     263
3     282
4     267
5     248
6     227
7     246
8     241
9     243
10    284
11    226
dtype: int64

# Assign to variable and use the Series to create the calculated column I want on my df.

totals = df.iloc[:, 1:4].sum(axis=1)

# assign Series back to DataFrame as overall_grade

df['overall_average'] = round((totals / 3), 0).astype(int)
df.head()

2. Load the mpg dataset. Read the documentation for the dataset and use it for the following questions:¶

data('mpg', show_doc=True)

mpg

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Fuel economy data from 1999 and 2008 for 38 popular models of car

### Description

This dataset contains a subset of the fuel economy data that the EPA makes
available on http://fueleconomy.gov. It contains only models which had a new
release every year between 1999 and 2008 - this was used as a proxy for the
popularity of the car.

### Usage

    data(mpg)

### Format

A data frame with 234 rows and 11 variables

### Details

  * manufacturer. 

  * model. 

  * displ. engine displacement, in litres 

  * year. 

  * cyl. number of cylinders 

  * trans. type of transmission 

  * drv. f = front-wheel drive, r = rear wheel drive, 4 = 4wd 

  * cty. city miles per gallon 

  * hwy. highway miles per gallon 

  * fl. 

  * class.

mpg = data('mpg')

mpg.head()

a. How many rows and columns are there?¶

mpg.shape

(234, 11)

print(f'There are {mpg.shape[0]} rows and {mpg.shape[1]} columns in the mpg DataFrame.')

There are 234 rows and 11 columns in the mpg DataFrame.

b. What are the data types of each column?¶

mpg.dtypes

manufacturer     object
model            object
displ           float64
year              int64
cyl               int64
trans            object
drv              object
cty               int64
hwy               int64
fl               object
class            object
dtype: object

c. Summarize the dataframe with the `.info()` and `.describe()` methods.¶

.info() shows us that all of the columns have the same number of non-null values.
.describe() provides us with the descriptive statistics for all columns with numeric dtypes.

mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB

mpg.describe()

d. Rename the `cty` column to `city` and `hwy` to `highway` using `.rename()` method.¶

.rename() takes in a dictionary with the key as the original name and the value as the new name.
If you want to change your original DataFrame to reflect your new column names, inplace=True

mpg.rename(columns={'cty': 'city', 'hwy': 'highway'}, inplace=True)

mpg.head(1)

Another way to rename columns...¶

This is another way you can rename columns, especially if you want to change many at once.
I can use the .columns attribute to grab my column labels; I can go a step further and print out a list of the current columns in the DataFrame by adding the .to_list() method. It's not necessary if I'm just grabbing the column names, but I love getting the nice list.
Then, I make any changes I want to the names in the list and reassign them to df.columns.

mpg = data('mpg')
mpg.head()

mpg.columns.to_list()

['manufacturer',
 'model',
 'displ',
 'year',
 'cyl',
 'trans',
 'drv',
 'cty',
 'hwy',
 'fl',
 'class']

mpg.columns = ['manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'city',
       'highway', 'fl', 'class']

mpg.head(1)

e. Do any cars have better city mileage than highway mileage?¶

# Create a boolean Series or a boolean Mask

bool_series = mpg.city > mpg.highway
bool_series.head()

1    False
2    False
3    False
4    False
5    False
dtype: bool

# Return a subset of the original DataFrame using the indexing operator.

mpg[bool_series]

# I can do a quick check to validate my findings above. There are no observations that meet this condition.

bool_series.sum()

0

f. Create a column named `mileage_difference`; this column should contain the difference between highway and city mileage for each car.¶

You saw above how to create a new column in a df using bracket notation; this time I'll show you how to create a new column in a df using the .assign() method.
They are both valid ways to create a new column; if you want to create more than one column at once, .assign() is worth looking into. Choose your flavor...

# This operation returns a Series of differences in mileage for each row. Looks good.

mpg.highway - mpg.city

1      11
2       8
3      11
4       9
5      10
       ..
230     9
231     8
232    10
233     8
234     9
Length: 234, dtype: int64

# This is how I can create a new column in my df from a calculation on two existing columns.
# I can reassign to my original df at this point, or I might want to assign to a new df name.

mpg = mpg.assign(mileage_difference = mpg.highway - mpg.city)

mpg.head()

g. Which car (or cars) has the highest mileage difference?¶

# Use the `.max()` function to find the max value in the column.

mpg.mileage_difference.max()

12

# Create a boolean Series where `True` values identify a match with the max mileage_difference. 

bool_series = mpg.mileage_difference == mpg.mileage_difference.max()
bool_series.head()

1    False
2    False
3    False
4    False
5    False
Name: mileage_difference, dtype: bool

# Pass my bool_series as a selector for rows in my mpg Series.

mpg[bool_series]

# I can get more specific if I like...

mpg[bool_series][['manufacturer', 'model']]

h. Which compact class car has the lowest highway mileage?¶

# Why doesn't this work??? This is an example of a situation where you have to use bracket notation

mpg.class

  File "<ipython-input-133-4e43ef5af34c>", line 3
    mpg.class
            ^
SyntaxError: invalid syntax

# I check and see that I have a column named `class`. What's up?
# 'class' is a reserved word, so I have to use bracket notation to access the column!

mpg.columns

Index(['manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'city',
       'highway', 'fl', 'class', 'mileage_difference'],
      dtype='object')

# I'm taking a look at all of the possible values in the column here.

mpg['class'].value_counts()

suv           62
compact       47
midsize       41
subcompact    35
pickup        33
minivan       11
2seater        5
Name: class, dtype: int64

# Create the bool Series or selector for the compact class of cars.

bool_series = mpg['class'] == 'compact'
bool_series.head()

1    True
2    True
3    True
4    True
5    True
Name: class, dtype: bool

# Pass my selector to my original DataFrame to get a subset of compact cars.

compacts = mpg[bool_series]
compacts.head()

# Sort the DataFrame by highway mileage

compacts.sort_values(by='highway').head()

# If you want to isolate the compact car with the lowest highway mileage

compacts.sort_values(by='highway').head(1)

The best highway mileage?

# Isolate the compact car with the best highway mileage.

compacts.sort_values(by='highway', ascending=False).head()

# isolate the row using .head() method

compacts.sort_values(by='highway', ascending=False).head(1)

j. Create a column named `average_mileage` that is the mean of the city and highway mileage.¶

# Create the calculated column

(mpg.city + mpg.highway) / 2

1      23.5
2      25.0
3      25.5
4      25.5
5      21.0
       ... 
230    23.5
231    25.0
232    21.0
233    22.0
234    21.5
Length: 234, dtype: float64

# Assign the Series back to the original DataFrame as a new column.

mpg['average_mileage'] = (mpg.city + mpg.highway) / 2

mpg.head(2)

k. Which Dodge car has the best average mileage? The worst?¶

# Create the boolean Series to filter for Dodge vehicles.

bool_series = mpg.manufacturer == 'dodge'
bool_series.head()

1    False
2    False
3    False
4    False
5    False
Name: manufacturer, dtype: bool

# Use the selector to create a subset of only Dodge vehicles.

dodges = mpg[bool_series]
dodges.head()

# Isolate the Dodge with the best average mileage.

dodges.sort_values(by='average_mileage').tail(1)

# Isolate the Dodge with the worst average mileage.

dodges.sort_values(by='average_mileage').head(4)

# another way...

worst_mileage = dodges.average_mileage.min()
worst_mileage

10.5

# Compare the average_mileage of each observation to the scalar value of worst_mileage; return ALL values that match.

dodges[dodges.average_mileage == worst_mileage]

3. Load the Mammals dataset. Read the documentation for it, and use the data to answer these questions:¶

data('Mammals', show_doc=True)

Mammals

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Garland(1983) Data on Running Speed of Mammals

### Description

Observations on the maximal running speed of mammal species and their body
mass.

### Usage

    data(Mammals)

### Format

A data frame with 107 observations on the following 4 variables.

weight

Body mass in Kg for "typical adult sizes"

speed

Maximal running speed (fastest sprint velocity on record)

hoppers

logical variable indicating animals that ambulate by hopping, e.g. kangaroos

specials

logical variable indicating special animals with "lifestyles in which speed
does not figure as an important factor": Hippopotamus, raccoon (Procyon),
badger (Meles), coati (Nasua), skunk (Mephitis), man (Homo), porcupine
(Erithizon), oppossum (didelphis), and sloth (Bradypus)

### Details

Used by Chappell (1989) and Koenker, Ng and Portnoy (1994) to illustrate the
fitting of piecewise linear curves.

### Source

Garland, T. (1983) The relation between maximal running speed and body mass in
terrestrial mammals, _J. Zoology_, 199, 1557-1570.

### References

Koenker, R., P. Ng and S. Portnoy, (1994) Quantile Smoothing Splines”
_Biometrika_, 81, 673-680.

Chappell, R. (1989) Fitting Bent Lines to Data, with Applications ot
Allometry, _J. Theo. Biology_, 138, 235-256.

### See Also

`rqss`

### Examples

    data(Mammals)
    attach(Mammals)
    x <- log(weight)
    y <- log(speed)
    plot(x,y, xlab="Weight in log(Kg)", ylab="Speed in log(Km/hour)",type="n")
    points(x[hoppers],y[hoppers],pch = "h", col="red")
    points(x[specials],y[specials],pch = "s", col="blue")
    others <- (!hoppers & !specials)
    points(x[others],y[others], col="black",cex = .75)
    fit <- rqss(y ~ qss(x, lambda = 1),tau = .9)
    plot(fit)

# Load DataFrame and save as `mammals`.

mammals = data('Mammals')

mammals.head()

a. How many rows and columns are there?¶

mammals.shape

(107, 4)

b. What are the data types?¶

mammals.dtypes

weight      float64
speed       float64
hoppers        bool
specials       bool
dtype: object

c. Summarize the dataframe with .info and .describe¶

mammals.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 1 to 107
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   weight    107 non-null    float64
 1   speed     107 non-null    float64
 2   hoppers   107 non-null    bool   
 3   specials  107 non-null    bool   
dtypes: bool(2), float64(2)
memory usage: 2.7 KB

mammals.describe()

Quick visualization of the distributioin of weight and speed values

plt.hist(mammals.weight, bins=20, color='orange', edgecolor='black')

plt.title('Weight of Mammals in kg')
plt.xlabel('kg')
plt.ylabel('Count')

plt.show()

plt.hist(mammals.speed, bins=15, color='dodgerblue', edgecolor='black')

plt.title('Full Sprint Speed of Mammals in km')
plt.xlabel('km')
plt.ylabel('Count')

plt.show()

d. What is the the weight of the fastest animal?¶

# Find the fastest speed using max(). I see there is only one observation with 110 km.

mammals.speed.value_counts().sort_index(ascending=False).head()

110.0    1
105.0    1
100.0    1
97.0     2
90.0     1
Name: speed, dtype: int64

# This returns the index LABEL, not the integer position, and that's why I'm using .loc below. `df.loc[row_label]`

mammals.speed.idxmax()

53

# I can isolate that observation/mammal using the `.loc` indexer and `.idxmax()` method.

mammals.loc[mammals.speed.idxmax()]

weight         55
speed         110
hoppers     False
specials    False
Name: 53, dtype: object

# You guessed it; it's a Series! I can access any of those index labels or values.

type(mammals.loc[mammals.speed.idxmax()])

pandas.core.series.Series

mammals.loc[mammals.speed.idxmax()].weight

55.0

# OR -> I can create a boolan Series to filter for observations that match the max speed.

bool_series = mammals.speed == mammals.speed.max()
bool_series.head()

1    False
2    False
3    False
4    False
5    False
Name: speed, dtype: bool

# Pass my boolean Series to the indexing operator as a selector to find observations that match the fastest speed

mammals[bool_series]

# Isolate the weight value if you want like this...

mammals[bool_series].weight

53    55.0
Name: weight, dtype: float64

# or like this without the index label. Again, your context will dictate the data you want to return and how you get it.

mammals.loc[mammals.speed.idxmax()].weight

55.0

e. What is the overall percentage of specials?¶

# We have a boolean Series already.

mammals.specials.head()

1    False
2    False
3    False
4    False
5    False
Name: specials, dtype: bool

# First, I'll find the number of mammals classified as specials.
# True boolean values are understood as having a value of 1, so I can sum the boolean Series.

total_specials = mammals.specials.sum()
total_specials

10

# Find the total number of mammals in my df.

total_mammals = len(mammals)
total_mammals

107

print(f'{round(total_specials / total_mammals * 100, 2)}% of mammals in the df are specials.')

9.35% of mammals in the df are specials.

f. How many animals are hoppers that are above the median speed? What percentage is this?¶

I interpreted this question as animals that are hoppers AND above the median speed for all mammals in our DataFrame.

# Remind myself of column names and values.

mammals.head(1)

# Find the median speed of mammals in our df.

median_speed = mammals.speed.median()
median_speed

48.0

# Create boolean Series for mammals with speed above `median_speed` and True for hoppers

bool_series = (mammals.speed > median_speed) & (mammals.hoppers == True)
bool_series.head()

1    False
2    False
3    False
4    False
5    False
dtype: bool

# These are our fast hoppers.

fast_hoppers = mammals[bool_series]
fast_hoppers
len(fast_hoppers)

7

print(f'This puts fast hoppers at {round((len(fast_hoppers) / len(mammals)) * 100, 2)}% of the mammals.')

This puts fast hoppers at 6.54% of the mammals.

	math	english	reading
count	12.000000	12.000000	12.000000
mean	84.833333	77.666667	86.500000
std	11.134168	13.371158	9.643651
min	62.000000	62.000000	67.000000
25%	78.500000	63.750000	80.750000
50%	90.000000	77.500000	89.000000
75%	92.250000	86.750000	93.250000
max	98.000000	99.000000	98.000000

	displ	year	cyl	cty	hwy
count	234.000000	234.000000	234.000000	234.000000	234.000000
mean	3.471795	2003.500000	5.888889	16.858974	23.440171
std	1.291959	4.509646	1.611534	4.255946	5.954643
min	1.600000	1999.000000	4.000000	9.000000	12.000000
25%	2.400000	1999.000000	4.000000	14.000000	18.000000
50%	3.300000	2003.500000	6.000000	17.000000	24.000000
75%	4.600000	2008.000000	8.000000	19.000000	27.000000
max	7.000000	2008.000000	8.000000	35.000000	44.000000

	manufacturer	model
107	honda	civic
223	volkswagen	new beetle

	weight	speed
count	107.000000	107.000000
mean	278.688178	46.208411
std	839.608269	26.716778
min	0.016000	1.600000
25%	1.700000	22.500000
50%	34.000000	48.000000
75%	142.500000	65.000000
max	6000.000000	110.000000

	name	math	english	reading	passing_english
0	Sally	62	85	80	True
1	Jane	88	79	67	True
2	Suzie	94	74	95	True
3	Billy	98	96	88	True
4	Ada	77	92	98	True

	name	math	english	reading	passing_english
6	Thomas	82	64	81	False
7	Marie	93	63	90	False
8	Albert	92	62	87	False
11	Alan	92	62	72	False
0	Sally	62	85	80	True
1	Jane	88	79	67	True
2	Suzie	94	74	95	True
3	Billy	98	96	88	True
4	Ada	77	92	98	True
5	John	79	76	93	True
9	Richard	69	80	94	True
10	Isaac	92	99	93	True

	math	english	reading
0	62	85	80
1	88	79	67
2	94	74	95
3	98	96	88
4	77	92	98
5	79	76	93
6	82	64	81
7	93	63	90
8	92	62	87
9	69	80	94
10	92	99	93
11	92	62	72

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
2	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
3	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
4	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
5	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact

	manufacturer	model	displ	year	cyl	trans	drv	city	highway	fl	class	mileage_difference
220	volkswagen	jetta	2.8	1999	6	auto(l4)	f	16	23	r	compact	7
221	volkswagen	jetta	2.8	1999	6	manual(m5)	f	17	24	r	compact	7
212	volkswagen	gti	2.8	1999	6	manual(m5)	f	17	24	r	compact	7
172	subaru	impreza awd	2.5	2008	4	manual(m5)	4	19	25	p	compact	6
170	subaru	impreza awd	2.5	2008	4	auto(s4)	4	20	25	p	compact	5

	manufacturer	model	displ	year	cyl	trans	drv	city	highway	fl	class	mileage_difference	average_mileage
38	dodge	caravan 2wd	2.4	1999	4	auto(l3)	f	18	24	r	minivan	6	21.0
39	dodge	caravan 2wd	3.0	1999	6	auto(l4)	f	17	24	r	minivan	7	20.5
40	dodge	caravan 2wd	3.3	1999	6	auto(l4)	f	16	22	r	minivan	6	19.0
41	dodge	caravan 2wd	3.3	1999	6	auto(l4)	f	16	22	r	minivan	6	19.0
42	dodge	caravan 2wd	3.3	2008	6	auto(l4)	f	17	24	r	minivan	7	20.5

	manufacturer	model	displ	year	cyl	trans	drv	city	highway	fl	class	mileage_difference	average_mileage
70	dodge	ram 1500 pickup 4wd	4.7	2008	8	manual(m6)	4	9	12	e	pickup	3	10.5
66	dodge	ram 1500 pickup 4wd	4.7	2008	8	auto(l5)	4	9	12	e	pickup	3	10.5
60	dodge	durango 4wd	4.7	2008	8	auto(l5)	4	9	12	e	suv	3	10.5
55	dodge	dakota pickup 4wd	4.7	2008	8	auto(l5)	4	9	12	e	pickup	3	10.5

	weight	speed	hoppers	specials
1	6000.0	35.0	False	False
2	4000.0	26.0	False	False
3	3000.0	25.0	False	False
4	1400.0	45.0	False	False
5	400.0	70.0	False	False

	math	english	reading
0	62	85	80
1	88	79	67
2	94	74	95
3	98	96	88
4	77	92	98
5	79	76	93
6	82	64	81
7	93	63	90
8	92	62	87
9	69	80	94
10	92	99	93
11	92	62	72

Table of Contents

Pandas DataFrames Exercises

Big Idea¶

Objectives¶

1. Create student grades DataFrame object¶

a. Create a column named passing_english that indicates whether each student has a passing grade in english.¶

b. Sort the english grades by the passing_english column. How are duplicates handled?¶

c. Sort the english grades first by passing_english and then by student name.¶

d. Sort the english grades first by passing_english, and then by the actual english grade, similar to how we did in the last step.¶

e. Calculate each student's overall grade and add it as a column on the DataFrame. The overall grade is the average of the math, english, and reading grades.¶

2. Load the mpg dataset. Read the documentation for the dataset and use it for the following questions:¶

a. How many rows and columns are there?¶

b. What are the data types of each column?¶

c. Summarize the dataframe with the .info() and .describe() methods.¶

d. Rename the cty column to city and hwy to highway using .rename() method.¶

Another way to rename columns...¶

e. Do any cars have better city mileage than highway mileage?¶

f. Create a column named mileage_difference; this column should contain the difference between highway and city mileage for each car.¶

g. Which car (or cars) has the highest mileage difference?¶

h. Which compact class car has the lowest highway mileage?¶

j. Create a column named average_mileage that is the mean of the city and highway mileage.¶

k. Which Dodge car has the best average mileage? The worst?¶

3. Load the Mammals dataset. Read the documentation for it, and use the data to answer these questions:¶

a. How many rows and columns are there?¶

b. What are the data types?¶

c. Summarize the dataframe with .info and .describe¶

d. What is the the weight of the fastest animal?¶

e. What is the overall percentage of specials?¶

f. How many animals are hoppers that are above the median speed? What percentage is this?¶

a. Create a column named `passing_english` that indicates whether each student has a passing grade in english.¶

b. Sort the english grades by the `passing_english` column. How are duplicates handled?¶

c. Sort the english grades first by `passing_english` and then by student `name`.¶

d. Sort the english grades first by `passing_english`, and then by the actual `english` grade, similar to how we did in the last step.¶

c. Summarize the dataframe with the `.info()` and `.describe()` methods.¶

d. Rename the `cty` column to `city` and `hwy` to `highway` using `.rename()` method.¶

f. Create a column named `mileage_difference`; this column should contain the difference between highway and city mileage for each car.¶

j. Create a column named `average_mileage` that is the mean of the city and highway mileage.¶

	math	english	reading
0	62	85	80
1	88	79	67
2	94	74	95
3	98	96	88
4	77	92	98
5	79	76	93
6	82	64	81
7	93	63	90
8	92	62	87
9	69	80	94
10	92	99	93
11	92	62	72