import pandas as pd
import numpy as np

import os
import requests

import warnings
warnings.filterwarnings("ignore")

Acquire Exercises¶

Here is the API Review Notebook that supports acquiring data using a REST API.

Explore Site and API¶

I'm going to investigate the documentation provided by the API and explore a couple of responses before I dig into the exercises.

# I can make a request to the url below and use the `.json` method on my results to return a...

base_url = 'https://python.zach.lol'

type(requests.get(base_url).json())

dict

# I have two choices of paths I can add to my base url, '/api/v1' and '/documentation'.

requests.get(base_url).json()

{'api': '/api/v1', 'help': '/documentation'}

# I'll create a doc_url to request help with using this api below.

doc_url = base_url + '/documentation'

# I have two keys in the dictionary returned from my request.

requests.get(doc_url).json().keys()

dict_keys(['payload', 'status'])

# I can print the value for the status key.

print(requests.get(doc_url).json()['status'])

ok

# I can print the value for the payload key.

print(requests.get(doc_url).json()['payload'])

The API accepts GET requests for all endpoints, where endpoints are prefixed
with

    /api/{version}

Where version is "v1"

Valid endpoints:

- /stores[/{store_id}]
- /items[/{item_id}]
- /sales[/{sale_id}]

All endpoints accept a `page` parameter that can be used to navigate through
the results.

This tells me that I can use 3 different endpoints to access data by adding stores, items, or sales, to my base_url + /api/v1/ like below:

'https://python.zach.lol/api/v1/items'
'https://python.zach.lol/api/v1/stores'
'https://python.zach.lol/api/v1/sales'

There is also a page parameter that I can add to each of these endpoints to navigate through multiple pages of results.

'?page=n'

For example:

'https://python.zach.lol/api/v1/items?page=1'

# I will create my api url.

api_url = base_url + '/api/v1/'

1. Items Pages¶

Using the code from the lesson as a guide, create a dataframe named items that has all of the data for items.

I want to explore the items endpoint first to see how the information returned by this API is structured.

# This submits the request for the first page of results and stores the results in response.
# My request was successful.

response = requests.get(api_url + 'items')
response.ok

True

# Use .json() method on my response to get a dictionary object; I'll store it in `data` variable.

data = response.json()

print(type(data))
data

<class 'dict'>

{'payload': {'items': [{'item_brand': 'Riceland',
    'item_id': 1,
    'item_name': 'Riceland American Jazmine Rice',
    'item_price': 0.84,
    'item_upc12': '35200264013',
    'item_upc14': '35200264013'},
   {'item_brand': 'Caress',
    'item_id': 2,
    'item_name': 'Caress Velvet Bliss Ultra Silkening Beauty Bar - 6 Ct',
    'item_price': 6.44,
    'item_upc12': '11111065925',
    'item_upc14': '11111065925'},
   {'item_brand': 'Earths Best',
    'item_id': 3,
    'item_name': 'Earths Best Organic Fruit Yogurt Smoothie Mixed Berry',
    'item_price': 2.43,
    'item_upc12': '23923330139',
    'item_upc14': '23923330139'},
   {'item_brand': 'Boars Head',
    'item_id': 4,
    'item_name': 'Boars Head Sliced White American Cheese - 120 Ct',
    'item_price': 3.14,
    'item_upc12': '208528800007',
    'item_upc14': '208528800007'},
   {'item_brand': 'Back To Nature',
    'item_id': 5,
    'item_name': 'Back To Nature Gluten Free White Cheddar Rice Thin Crackers',
    'item_price': 2.61,
    'item_upc12': '759283100036',
    'item_upc14': '759283100036'},
   {'item_brand': 'Sally Hansen',
    'item_id': 6,
    'item_name': 'Sally Hansen Nail Color Magnetic 903 Silver Elements',
    'item_price': 6.93,
    'item_upc12': '74170388732',
    'item_upc14': '74170388732'},
   {'item_brand': 'Twinings Of London',
    'item_id': 7,
    'item_name': 'Twinings Of London Classics Lady Grey Tea - 20 Ct',
    'item_price': 9.64,
    'item_upc12': '70177154004',
    'item_upc14': '70177154004'},
   {'item_brand': 'Lea & Perrins',
    'item_id': 8,
    'item_name': 'Lea & Perrins Marinade In-a-bag Cracked Peppercorn',
    'item_price': 1.68,
    'item_upc12': '51600080015',
    'item_upc14': '51600080015'},
   {'item_brand': 'Van De Kamps',
    'item_id': 9,
    'item_name': 'Van De Kamps Fillets Beer Battered - 10 Ct',
    'item_price': 1.79,
    'item_upc12': '19600923015',
    'item_upc14': '19600923015'},
   {'item_brand': 'Ahold',
    'item_id': 10,
    'item_name': 'Ahold Cocoa Almonds',
    'item_price': 3.17,
    'item_upc12': '688267141676',
    'item_upc14': '688267141676'},
   {'item_brand': 'Honest Tea',
    'item_id': 11,
    'item_name': 'Honest Tea Peach White Tea',
    'item_price': 3.93,
    'item_upc12': '657622604842',
    'item_upc14': '657622604842'},
   {'item_brand': 'Mueller',
    'item_id': 12,
    'item_name': 'Mueller Sport Care Basic Support Level Medium Elastic Knee Support',
    'item_price': 8.4,
    'item_upc12': '74676640211',
    'item_upc14': '74676640211'},
   {'item_brand': 'Garnier Nutritioniste',
    'item_id': 13,
    'item_name': 'Garnier Nutritioniste Moisture Rescue Fresh Cleansing Foam',
    'item_price': 6.47,
    'item_upc12': '603084234561',
    'item_upc14': '603084234561'},
   {'item_brand': 'Pamprin',
    'item_id': 14,
    'item_name': 'Pamprin Maximum Strength Multi-symptom Menstrual Pain Relief',
    'item_price': 7.54,
    'item_upc12': '41167300121',
    'item_upc14': '41167300121'},
   {'item_brand': 'Suave',
    'item_id': 15,
    'item_name': 'Suave Naturals Moisturizing Body Wash Creamy Tropical Coconut',
    'item_price': 9.11,
    'item_upc12': '79400847201',
    'item_upc14': '79400847201'},
   {'item_brand': 'Burts Bees',
    'item_id': 16,
    'item_name': 'Burts Bees Daily Moisturizing Cream Sensitive',
    'item_price': 5.17,
    'item_upc12': '792850014008',
    'item_upc14': '792850014008'},
   {'item_brand': 'Ducal',
    'item_id': 17,
    'item_name': 'Ducal Refried Red Beans',
    'item_price': 1.16,
    'item_upc12': '88313590791',
    'item_upc14': '88313590791'},
   {'item_brand': 'Scotch',
    'item_id': 18,
    'item_name': 'Scotch Removable Clear Mounting Squares - 35 Ct',
    'item_price': 4.39,
    'item_upc12': '21200725340',
    'item_upc14': '21200725340'},
   {'item_brand': 'Careone',
    'item_id': 19,
    'item_name': 'Careone Family Comb Set - 8 Ct',
    'item_price': 0.74,
    'item_upc12': '41520035646',
    'item_upc14': '41520035646'},
   {'item_brand': 'Usda Produce',
    'item_id': 20,
    'item_name': 'Plums Black',
    'item_price': 5.62,
    'item_upc12': '204040000000',
    'item_upc14': '204040000000'}],
  'max_page': 3,
  'next_page': '/api/v1/items?page=2',
  'page': 1,
  'previous_page': None},
 'status': 'ok'}

# List the keys in my dictionary object; I see payload and status.

data.keys()

dict_keys(['payload', 'status'])

I can see above that 'payload' is also a dictionary object; I also see that the first key, items, has a value that is a list of dictionaries. I can check out all of the key:value pairs in payload to see what is of use to me.

# Look at the keys in the payload dictionary.

data['payload'].keys()

dict_keys(['items', 'max_page', 'next_page', 'page', 'previous_page'])

# I see that the `items` list holds 20 dictionaries (items).

len(data['payload']['items'])

20

# I'll check out just the first 2 dictionaries (items) in the list.

data['payload']['items'][:2]

[{'item_brand': 'Riceland',
  'item_id': 1,
  'item_name': 'Riceland American Jazmine Rice',
  'item_price': 0.84,
  'item_upc12': '35200264013',
  'item_upc14': '35200264013'},
 {'item_brand': 'Caress',
  'item_id': 2,
  'item_name': 'Caress Velvet Bliss Ultra Silkening Beauty Bar - 6 Ct',
  'item_price': 6.44,
  'item_upc12': '11111065925',
  'item_upc14': '11111065925'}]

# Look at the values of the other keys in the 'payload' dictionary.

print(f"The current page of the results from my request is {data['payload']['page']}.")
print(f"The next page of the results from my request is {data['payload']['next_page']}.")
print(f"The total number of pages in the results from my request is {data['payload']['max_page']}.")
print(f"The previous page in the results from my request is {data['payload']['previous_page']}.")

The current page of the results from my request is 1.
The next page of the results from my request is /api/v1/items?page=2.
The total number of pages in the results from my request is 3.
The previous page in the results from my request is None.

# I create a list variable to hold the list of the 20 items from page one.

items = data['payload']['items']
print(len(items))
type(items)

20

list

# 'next_page' returns the path and page param for the second page of results.

data['payload']['next_page']

'/api/v1/items?page=2'

# Submit a request for the the next page and store it in the `response` variable.

response = requests.get(base_url + data['payload']['next_page'])

# Use the `.json()` method to return a dictionary object like I did above for page 1.

data = response.json()

# Add items from the second page to our list using `.extend()`

items.extend(data['payload']['items'])

# The `items` list now contains 40 items (dictionaries).

len(items)

40

# Our next page is page 3 of items out of 3

data['payload']['next_page']

'/api/v1/items?page=3'

# Grab the next page in the same way and add items to my `items` list.
# I see there are only 10 items on this last page.

response = requests.get(base_url + data['payload']['next_page'])

data = response.json()
len(data['payload']['items'])

10

# Add the last 10 items to my `items` list, which now contains a total of 50 items.

items.extend(data['payload']['items'])
len(items)

50

There is no next page, so data['payload']['next_page'] returns None. This could come in handy when we write our function later to automate the above process.

data['payload']['next_page'] == None

True

# Use our items, our list of dictionaries, to create a DataFrame

items_df = pd.DataFrame(items)
print(f'The items_df has the shape {items_df.shape}.\n')
items_df.head(2)

The items_df has the shape (50, 6).

2. Stores Pages¶

Do the same thing, but for stores.

There is only 1 page of stores data to request.

# I want to see I how many pages of stores I have to request.

api_url = base_url + '/api/v1/'
response = requests.get(api_url + 'stores')
data = response.json()

data['payload']['max_page']

1

# This time I want to grab stores instead of items.

data['payload'].keys()

dict_keys(['max_page', 'next_page', 'page', 'previous_page', 'stores'])

# Again, I have a list of dictionaries; I can convert this into a pandas DataFrame now.

stores = data['payload']['stores'][:2]
stores_df = pd.DataFrame(stores)

print(f"My stores_df has the shape {stores_df.shape}")
stores_df.head()

My stores_df has the shape (2, 5)

3. Sales Pages¶

Extract the data for sales. Your code should continue fetching data from the next page until all of the data is extracted.

There are 183 pages of data here, so I'm going to build a function to automate the above process.

api_url = base_url + '/api/v1/'
response = requests.get(api_url + 'sales')
data = response.json()
data['payload']['max_page']

183

Build Helper Function¶

This will request the data from the API and save the individual dataframes to csv files for each path name I pass in, one at a time.

def get_df(name):
    """
    This function takes in the string
    'items', 'stores', or 'sales' and
    returns a df containing all pages and
    creates a .csv file for future use.
    """
    base_url = 'https://python.zach.lol'
    api_url = base_url + '/api/v1/'
    response = requests.get(api_url + name)
    data = response.json()
    
    # create list from 1st page
    my_list = data['payload'][name]
    
    # loop through the pages and add to list
    while data['payload']['next_page'] != None:
        response = requests.get(base_url + data['payload']['next_page'])
        data = response.json()
        my_list.extend(data['payload'][name])
    
    # Create DataFrame from list
    df = pd.DataFrame(my_list)
    
    # Write DataFrame to csv file for future use
    df.to_csv(name + '.csv')
    return df

items_df = get_df('items')
print(items_df.shape)
items_df.head()

(50, 6)

stores_df = get_df('stores')
print(stores_df.shape)
stores_df.head()

(10, 5)

sales_df = get_df('sales')
print(sales_df.shape)
sales_df.head()

(913000, 5)

5. Merge DataFrames¶

Combine the data from your three separate dataframes into one large dataframe.

# I can see all of my dataframes above, so I know how to join and what to drop.

df = pd.merge(sales_df, stores_df, left_on='store', right_on='store_id').drop(columns={'store'})
df.head(2)

df = pd.merge(df, items_df, left_on='item', right_on='item_id').drop(columns={'item'})
df.head(2)

df.shape

(913000, 14)

There is another way that I can approach pagination of APIs using the params parameter with the .get() method. The documentation for the API informed me that "All endpoints accept a page parameter that can be used to navigate through the results."

Above we used the value of data['payload']['next_page'] to provide the path and query parameter, '/api/v1/items?page=n', that we concatenated to our base_url, https://python.zach.lol, to access each page.

Below, I will instead pass a dictionary to params to 'turn the pages' so to speak. This is just a different way to access the data and may come in handy when you work with different APIs. If it's TMI right now, skip it; the above method works fine for this API.

default

requests.get(url, params={key: value}, args)

Here are more parameters that can be used with the .get() method.

# Create endpoints for use below.

items_url = 'https://python.zach.lol/api/v1/items'
stores_url = 'https://python.zach.lol/api/v1/stores'
sales_url = 'https://python.zach.lol/api/v1/sales'

# Create an empty list names `results`.
results = []

# Loop through the pages of my endpoint until my reponse is empty.
for i in range(3):
    response =  requests.get(items_url, params = {"page": i+1})    
    
    # We have reached the end of the results if the response length is 0.
    if len(response.json()) == 0:   
        break
    else:
        
        # Convert my response to a dictionary and store as variable `data`.
        data = response.json()
        
        # Add the list of dictionaries to my list
        results.extend(data['payload']['items'])
        
print(results[:2])
len(results)

[{'item_brand': 'Riceland', 'item_id': 1, 'item_name': 'Riceland American Jazmine Rice', 'item_price': 0.84, 'item_upc12': '35200264013', 'item_upc14': '35200264013'}, {'item_brand': 'Caress', 'item_id': 2, 'item_name': 'Caress Velvet Bliss Ultra Silkening Beauty Bar - 6 Ct', 'item_price': 6.44, 'item_upc12': '11111065925', 'item_upc14': '11111065925'}]

50

def get_df_params(name):
    """
    This function takes in the string
    'items', 'stores', or 'sales' and
    returns a df containing all pages and
    creates a .csv file for future use.
    """
    # Create an empty list names `results`.
    results = []
    
    # Create api_url variable
    api_url = 'https://python.zach.lol/api/v1/'
    
    # Loop through the page parameters until an empty response is returned.
    for i in range(3):
        response =  requests.get(items_url, params = {"page": i+1})    
    
        # We have reached the end of the results
        if len(response.json()) == 0:   
            break
            
        else:
            # Convert my response to a dictionary and store as variable `data`
            data = response.json()
        
            # Add the list of dictionaries to my list
            results.extend(data['payload'][name])
    
    # Create DataFrame from list
    df = pd.DataFrame(results)
    
    # Write DataFrame to csv file for future use
    df.to_csv(name + '.csv')
    
    return df

get_df_params('items').head()

# This helper function returns the same data as my other function.

get_df_params('items').shape

(50, 6)

7a. `get_store_date()` Function¶

Create a function that checks for a csv file, and if one doesn't exist it creates one.
The function should also create one large df using all three df.
Create this function using either of our helper functions above; your choice.

def get_store_data():
    """
    This function checks for csv files
    for items, sales, stores, and big_df 
    if there are none, it creates them.
    It returns one big_df of merged dfs.
    """
    # check for csv files or create them
    if os.path.isfile('items.csv'):
        items_df = pd.read_csv('items.csv', index_col=0)
    else:
        items_df = get_df('items')
        
    if os.path.isfile('stores.csv'):
        stores_df = pd.read_csv('stores.csv', index_col=0)
    else:
        stores_df = get_df('stores')
        
    if os.path.isfile('sales.csv'):
        sales_df = pd.read_csv('sales.csv', index_col=0)
    else:
        sales_df = get_df('sales')
        
    if os.path.isfile('big_df.csv'):
        df = pd.read_csv('big_df.csv', index_col=0)
        return df
    else:
        # merge all of the DataFrames into one
        df = pd.merge(sales_df, stores_df, left_on='store', right_on='store_id').drop(columns={'store'})
        df = pd.merge(df, items_df, left_on='item', right_on='item_id').drop(columns={'item'})

        # write merged DateTime df with all data to directory for future use
        df.to_csv('big_df.csv')
        return df

df = get_store_data()
df.head(2)

df.shape

(913000, 14)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 913000 entries, 0 to 912999
Data columns (total 14 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   sale_amount    913000 non-null  float64
 1   sale_date      913000 non-null  object 
 2   sale_id        913000 non-null  int64  
 3   store_address  913000 non-null  object 
 4   store_city     913000 non-null  object 
 5   store_id       913000 non-null  int64  
 6   store_state    913000 non-null  object 
 7   store_zipcode  913000 non-null  int64  
 8   item_brand     913000 non-null  object 
 9   item_id        913000 non-null  int64  
 10  item_name      913000 non-null  object 
 11  item_price     913000 non-null  float64
 12  item_upc12     913000 non-null  int64  
 13  item_upc14     913000 non-null  int64  
dtypes: float64(2), int64(6), object(6)
memory usage: 104.5+ MB

6. German Engergy Data¶

Acquire the Open Power Systems Data for Germany, which has been rapidly expanding its renewable energy production in recent years. The data set includes country-wide totals of electricity consumption, wind power production, and solar power production for 2006-2017. You can get the data here: https://raw.githubusercontent.com/jenfly/opsd/master/opsd_germany_daily.csv

url = 'https://raw.githubusercontent.com/jenfly/opsd/master/opsd_germany_daily.csv'
df = pd.read_csv(url)
df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         4383 non-null   object 
 1   Consumption  4383 non-null   float64
 2   Wind         2920 non-null   float64
 3   Solar        2188 non-null   float64
 4   Wind+Solar   2187 non-null   float64
dtypes: float64(4), object(1)
memory usage: 171.3+ KB

7b. `opsd_germany_daily()` Function¶

Create a function that retrieves German Energy data and reads/writes csv.

def opsd_germany_daily():
    """
    This function uses or creates the 
    opsd_germany_daily csv and returns a df.
    """
    if os.path.isfile('opsd_germany_daily.csv'):
        df = pd.read_csv('opsd_germany_daily.csv', index_col=0)
    else:
        url = 'https://raw.githubusercontent.com/jenfly/opsd/master/opsd_germany_daily.csv'
        df = pd.read_csv(url)
        df.to_csv('opsd_germany_daily.csv')
    return df

gdf = opsd_germany_daily()
gdf.head(2)

gdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         4383 non-null   object 
 1   Consumption  4383 non-null   float64
 2   Wind         2920 non-null   float64
 3   Solar        2188 non-null   float64
 4   Wind+Solar   2187 non-null   float64
dtypes: float64(4), object(1)
memory usage: 171.3+ KB

	store_address	store_city	store_id	store_state	store_zipcode
0	12125 Alamo Ranch Pkwy	San Antonio	1	TX	78253
1	9255 FM 471 West	San Antonio	2	TX	78251

	store_address	store_city	store_id	store_state	store_zipcode
0	12125 Alamo Ranch Pkwy	San Antonio	1	TX	78253
1	9255 FM 471 West	San Antonio	2	TX	78251
2	2118 Fredericksburg Rdj	San Antonio	3	TX	78201
3	516 S Flores St	San Antonio	4	TX	78204
4	1520 Austin Hwy	San Antonio	5	TX	78218

	item	sale_amount	sale_date	sale_id	store
0	1	13.0	Tue, 01 Jan 2013 00:00:00 GMT	1	1
1	1	11.0	Wed, 02 Jan 2013 00:00:00 GMT	2	1
2	1	14.0	Thu, 03 Jan 2013 00:00:00 GMT	3	1
3	1	13.0	Fri, 04 Jan 2013 00:00:00 GMT	4	1
4	1	10.0	Sat, 05 Jan 2013 00:00:00 GMT	5	1

	item	sale_amount	sale_date	sale_id	store_address	store_city	store_id	store_state	store_zipcode
0	1	13.0	Tue, 01 Jan 2013 00:00:00 GMT	1	12125 Alamo Ranch Pkwy	San Antonio	1	TX	78253
1	1	11.0	Wed, 02 Jan 2013 00:00:00 GMT	2	12125 Alamo Ranch Pkwy	San Antonio	1	TX	78253

	Date	Consumption	Wind	Solar	Wind+Solar
0	2006-01-01	1069.184	NaN	NaN	NaN
1	2006-01-02	1380.521	NaN	NaN	NaN
2	2006-01-03	1442.533	NaN	NaN	NaN
3	2006-01-04	1457.217	NaN	NaN	NaN
4	2006-01-05	1477.131	NaN	NaN	NaN

	item_brand	item_id	item_name	item_price	item_upc12	item_upc14
0	Riceland	1	Riceland American Jazmine Rice	0.84	35200264013	35200264013
1	Caress	2	Caress Velvet Bliss Ultra Silkening Beauty Bar...	6.44	11111065925	11111065925