In [2]:
import pandas as pd
import numpy as np

from requests import get
import re
from bs4 import BeautifulSoup
import os

Codeup Blogs

Goals: Write a function to scrape urls from main Codeup blog web page and write a function that returns a dictionary of blog titles and text for each blog page.

Grab Title from Page

Here I use the .find() method on my soup with the <h1> tag. As always, there is no one way to accomplish our task, so I'm demonstrating one way to scrape the headline, not THE way to scrape the headline.

In [3]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'} 
    
response = get(url, headers=headers)
response.ok
Out[3]:
True
In [4]:
# Here's our long string of HTML; we'll use response.text to make our soup object.

print(type(response.text))
<class 'str'>
In [5]:
# Create our Soup object by passing our HTML string and choice of parser.

soup = BeautifulSoup(response.text, 'html.parser')

# Now we have our BeautifulSoup object and can use its built-in methods and attributes.

print(type(soup))
<class 'bs4.BeautifulSoup'>
In [6]:
# The h1 element holds my title.

title = soup.find('h1').text
title
Out[6]:
'Codeup’s Data Science Career Accelerator is Here!'

Grab Text from Page

In [7]:
content = soup.find('div', class_="jupiterx-post-content").text
print(content)
The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in Glassdoor’s #1 Best Job in America.
Data Science is a method of providing actionable intelligence from data. The data revolution has hit San Antonio, resulting in an explosion in Data Scientist positions across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen UTSA invest $70 M for a Cybersecurity Center and School of Data Science. We built a program to specifically meet the growing demands of this industry.
Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Students will work with real data sets, realistic problems, and the entire data science pipeline from collection to deployment. They will receive professional development training in resume writing, interviewing, and continuing education to prepare for a smooth transition to the workforce.
We focus on applied data science for immediate impact and ROI in a business, which is how we can back it all up with a 6 month tuition refund guarantee – just like our existing Web Dev program. We’re focusing on Data Science with Python, SQL, and ML, covered in 14 modules: 1) Fundamentals; 2) Applied statistics; 3) SQL; 4) Python; 5) Supervised machine learning – regression; 6) Supervised machine learning – classification; 7) Unsupervised machine learning – clustering; 8) Time series analysis; 9) Anomaly detection; 10) Natural language processing; 11) Distributed machine learning; 12) Advanced topics (deep learning, NoSQL, cloud deployment, etc.); 13) Storytelling with data; and 14) Domain expertise development.
Applications are now open for Codeup’s first Data Science cohort, which will start class on February 4, 2019. Hurry – there are only 25 seats available! To further our mission of cultivating inclusive growth, scholarships will be available to women, minorities, LGBTQIA+ individuals, veterans, first responders, and people relocating to San Antonio.
If you want to learn about joining our program or hiring our graduates, email datascience@codeup.com!


In [8]:
print(type(content))
<class 'str'>

Build Blog Function

In [9]:
# Create a helper function that requests and parses HTML returning a soup object.

def make_soup(url):
    '''
    This helper function takes in a url and requests and parses HTML
    returning a soup object.
    '''
    headers = {'User-Agent': 'Codeup Data Science'} 
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup
In [10]:
def get_blog_articles(urls, cached=False):
    '''
    This function takes in a list of Codeup Blog urls and a parameter
    with default cached == False which scrapes the title and text for each url, 
    creates a list of dictionaries with the title and text for each blog, 
    converts list to df, and returns df.
    If cached == True, the function returns a df from a json file.
    '''
    if cached == True:
        df = pd.read_json('big_blogs.json')
        
    # cached == False completes a fresh scrape for df     
    else:

        # Create an empty list to hold dictionaries
        articles = []

        # Loop through each url in our list of urls
        for url in urls:

            # Make request and soup object using helper
            soup = make_soup(url)

            # Save the title of each blog in variable title
            title = soup.find('h1').text

            # Save the text in each blog to variable text
            content = soup.find('div', class_="jupiterx-post-content").text

            # Create a dictionary holding the title and content for each blog
            article = {'title': title, 'content': content}

            # Add each dictionary to the articles list of dictionaries
            articles.append(article)
            
        # convert our list of dictionaries to a df
        df = pd.DataFrame(articles)

        # Write df to a json file for faster access
        df.to_json('big_blogs.json')
    
    return df

Test Function

In [11]:
# Here cached == False, so the function will do a fresh scrape of the urls and write data to a json file.

urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/',
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

blogs = get_blog_articles(urls=urls, cached=False)
blogs
Out[11]:
title content
0 Codeup’s Data Science Career Accelerator is Here! The rumors are true! The time has arrived. Cod...
1 Data Science Myths By Dimitri Antoniou and Maggie Giust\nData Sci...
2 Data Science VS Data Analytics: What’s The Dif... By Dimitri Antoniou\nA week ago, Codeup launch...
3 10 Tips to Crush It at the SA Tech Job Fair SA Tech Job Fair\nThe third bi-annual San Anto...
4 Competitor Bootcamps Are Closing. Is the Model... Competitor Bootcamps Are Closing. Is the Model...

Bonus URL Scrape

In [12]:
# I'm going to hit Codeup's main blog page to scrape the urls and use my new function.

url = 'https://codeup.com/resources/#blog'
soup = make_soup(url)
In [13]:
# I'm filtering my soup to return a list of all anchor elements from my HTML. (view first 2)

urls_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')
urls_list[:2]
Out[13]:
[<a class="jet-listing-dynamic-link__link" href="https://codeup.com/introducing-salary-refund-guarantee/"><span class="jet-listing-dynamic-link__label">Introducing Our Salary Refund Guarantee</span></a>,
 <a class="jet-listing-dynamic-link__link" href="https://codeup.com/introducing-salary-refund-guarantee/"><span class="jet-listing-dynamic-link__label">Read More</span></a>]
In [14]:
# Filter the href attribute value for each anchor element in my list; we scraped 40 urls.

# I'm using a set comprehension to return only unique urls because there are two links for each article.
urls = {link.get('href') for link in urls_list}

# I'm converting my set to a list of urls.
urls = list(urls)

print(f'There are {len(urls)} unique links in our urls list.')
print()
urls 
There are 20 unique links in our urls list.

Out[14]:
['https://codeup.com/education-is-an-investment/',
 'https://codeup.com/what-is-machine-learning/',
 'https://codeup.com/math-in-data-science/',
 'https://codeup.com/how-were-celebrating-world-mental-health-day-from-home/',
 'https://codeup.com/codeups-application-process/',
 'https://codeup.com/codeup-alumni-make-water/',
 'https://codeup.com/codeup-wins-civtech-datathon/',
 'https://codeup.com/codeup-inc-5000/',
 'https://codeup.com/codeup-in-houston/',
 'https://codeup.com/build-your-career-in-tech/',
 'https://codeup.com/covid-19-data-challenge/',
 'https://codeup.com/what-to-expect-at-codeup/',
 'https://codeup.com/succeed-in-a-coding-bootcamp/',
 'https://codeup.com/introducing-salary-refund-guarantee/',
 'https://codeup.com/transition-into-data-science/',
 'https://codeup.com/new-scholarship/',
 'https://codeup.com/what-data-science-career-is-for-you/',
 'https://codeup.com/journey-into-web-development/',
 'https://codeup.com/from-slacker-to-data-scientist/',
 'https://codeup.com/what-is-python/']

Bonus URL Function

In [15]:
def get_all_urls():
    '''
    This function scrapes all of the Codeup blog urls from
    the main Codeup blog page and returns a list of urls.
    '''
    # The base url for the main Codeup blog page
    url = 'https://codeup.com/resources/#blog' 
    
    # Make request and soup object using helper
    soup = make_soup(url)
    
    # Create a list of the anchor elements that hold the urls.
    urls_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')
    
    # I'm using a set comprehension to return only unique urls because list contains duplicate urls.
    urls = {link.get('href') for link in urls_list}

    # I'm converting my set to a list of urls.
    urls = list(urls)
        
    return urls
In [16]:
# Now I can use my same function with my new function.
# cached == False does a fresh scrape.

big_blogs = get_blog_articles(urls=get_all_urls(), cached=False)
In [17]:
big_blogs.head(10)
Out[17]:
title content
0 Your Education is an Investment You have many options regarding educational ro...
1 What is Machine Learning? There’s a lot we can learn about machines, and...
2 What are the Math and Stats Principles You Nee... Coming into our Data Science program, you will...
3 How We’re Celebrating World Mental Health Day ... World Mental Health Day is on October 10th. Al...
4 What is Codeup’s Application Process? Curious about Codeup’s application process? Wo...
5 How Codeup Alumni are Helping to Make Water Imagine having a kit mailed to you with all th...
6 Codeup Grads Win CivTech Datathon Many Codeup alumni enjoy competing in hackatho...
7 Codeup on Inc. 5000 Fastest Growing Private Co... We’re excited to announce a huge Codeup achiev...
8 Codeup Launches Houston! Houston, we have a problem: there aren’t enoug...
9 Build Your Career in Tech: Advice from Alumni! Bryan Walsh, Codeup Web Development alum, and ...
In [18]:
big_blogs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    20 non-null     object
 1   content  20 non-null     object
dtypes: object(2)
memory usage: 448.0+ bytes
In [19]:
# cached == True reads in a df from `big_blogs.json`.

big_blogs = get_blog_articles(urls=get_all_urls(), cached=True)
big_blogs.head()
Out[19]:
title content
0 Your Education is an Investment You have many options regarding educational ro...
1 What is Machine Learning? There’s a lot we can learn about machines, and...
2 What are the Math and Stats Principles You Nee... Coming into our Data Science program, you will...
3 How We’re Celebrating World Mental Health Day ... World Mental Health Day is on October 10th. Al...
4 What is Codeup’s Application Process? Curious about Codeup’s application process? Wo...

Inshorts News Articles

Goal: Write a function that scrapes the news articles for the following topics:

  • Business
  • Sports
  • Technology
  • Entertainment
In [20]:
# Make the soup object using my function.

url = 'https://inshorts.com/en/read/entertainment'
soup = make_soup(url)

Scrape News Cards from Main Page

In [21]:
# Scrape a ResultSet of all the news cards on the page and inspect the elements on the first card.

cards = soup.find_all('div', class_='news-card')

print(f'There are {len(cards)} news cards on this page.')
print()
cards[0]
There are 25 news cards on this page.

Out[21]:
<div class="news-card z-depth-1" itemscope="" itemtype="http://schema.org/NewsArticle">
<span content="" itemid="https://inshorts.com/en/news/53yrold-actor-asif-basra-found-hanging-at-a-private-complex-in-himachal-1605179736432" itemprop="mainEntityOfPage" itemscope="" itemtype="https://schema.org/WebPage"></span>
<span itemprop="author" itemscope="itemscope" itemtype="https://schema.org/Person">
<span content="Pragya Swastik" itemprop="name"></span>
</span>
<span content="53-yr-old actor Asif Basra found hanging at a private complex in Himachal" itemprop="description"></span>
<span itemprop="image" itemscope="" itemtype="https://schema.org/ImageObject">
<meta content="https://static.inshorts.com/inshorts/images/v1/variants/jpg/m/2020/11_nov/12_thu/img_1605179566538_924.jpg?" itemprop="url"/>
<meta content="864" itemprop="width"/>
<meta content="483" itemprop="height"/>
</span>
<span itemprop="publisher" itemscope="itemscope" itemtype="https://schema.org/Organization">
<span content="https://inshorts.com/" itemprop="url"></span>
<span content="Inshorts" itemprop="name"></span>
<span itemprop="logo" itemscope="" itemtype="https://schema.org/ImageObject">
<span content="https://assets.inshorts.com/inshorts/images/v1/variants/jpg/m/2018/11_nov/21_wed/img_1542823931298_497.jpg" itemprop="url"></span>
<meta content="400" itemprop="width"/>
<meta content="60" itemprop="height"/>
</span>
</span>
<div class="news-card-image" style="background-image: url('https://static.inshorts.com/inshorts/images/v1/variants/jpg/m/2020/11_nov/12_thu/img_1605179566538_924.jpg?')">
</div>
<div class="news-card-title news-right-box">
<a class="clickable" href="/en/news/53yrold-actor-asif-basra-found-hanging-at-a-private-complex-in-himachal-1605179736432" onclick="ga('send', {'hitType': 'event', 'eventCategory': 'TitleOfNews', 'eventAction': 'clicked', 'eventLabel': '53-yr-old%20actor%20Asif%20Basra%20found%20hanging%20at%20a%20private%20complex%20in%20Himachal)' });" style="color:#44444d!important">
<span itemprop="headline">53-yr-old actor Asif Basra found hanging at a private complex in Himachal</span>
</a>
<div class="news-card-author-time news-card-author-time-in-title">
<a href="/prev/en/news/53yrold-actor-asif-basra-found-hanging-at-a-private-complex-in-himachal-1605179736432"><span class="short">short</span></a> by <span class="author">Pragya Swastik</span> / 
      <span class="time" content="2020-11-12T11:15:36.000Z" itemprop="datePublished">04:45 pm</span> on <span clas="date">12 Nov 2020,Thursday</span>
</div>
</div>
<div class="news-card-content news-right-box">
<div itemprop="articleBody">Actor Asif Basra was found hanging in a private complex in Himachal Pradesh's Dharamshala, ANI reported on Thursday quoting Kangra Police official Vimukt Ranjan. "Forensic team is at the spot and police is investigating the matter," the police official said. The 53-year-old actor acted in several Bollywood films including 'Kai Po Che!', 'Black Friday' and 'Hichki'.</div>
<div class="news-card-author-time news-card-author-time-in-content">
<a href="/prev/en/news/53yrold-actor-asif-basra-found-hanging-at-a-private-complex-in-himachal-1605179736432"><span class="short">short</span></a> by <span class="author">Pragya Swastik</span> / 
      <span class="time" content="2020-11-12T11:15:36.000Z" itemprop="dateModified">04:45 pm</span> on <span class="date">12 Nov</span>
</div>
</div>
<div class="news-card-footer news-right-box">
<div class="read-more">read more at <a class="source" href="https://www.timesnownews.com/amp/india/article/actor-asif-basra-found-dead-at-private-complex-in-dharamshala/681115?utm_campaign=fullarticle&amp;utm_medium=referral&amp;utm_source=inshorts " onclick="ga('send', {'hitType': 'event', 'eventCategory': 'ReadMore', 'eventAction': 'clicked', 'eventLabel': 'Times%20Now' });" target="_blank">Times Now</a></div>
</div>
</div>

Scrape the Title from Each News Card

In [22]:
# Create a list of titles using the span element and itemprop attribute with text method.

titles = [card.find('span', itemprop='headline').text for card in cards]
titles[:5]
Out[22]:
['53-yr-old actor Asif Basra found hanging at a private complex in Himachal',
 'Balika Vadhu actress Avika Gor confirms relationship with ex-Roadies contestant',
 'I cast Asif Basra in Jab We Met as I wanted an intelligent actor: Imtiaz Ali',
 "American singer sings 'Om Jai Jagdish Hare' for Diwali, releases video",
 "Kangana's brother Aksht gets married in Udaipur, actress shares 1st pics"]

Scrape Author from News Cards

In [23]:
# Create a list of authors using the span element and class attribute with text method.

authors = [card.find('span', class_='author').text for card in cards]
authors[:5]
Out[23]:
['Pragya Swastik', 'Daisy Mowke', 'Anmol Sharma', 'Daisy Mowke', 'Daisy Mowke']

Scrape Text from News Cards

In [24]:
# Create a list of content strings using the div element and itemprop attribute with text method.

content = [card.find('div', itemprop='articleBody').text for card in cards]
content[:5]
Out[24]:
['Actor Asif Basra was found hanging in a private complex in Himachal Pradesh\'s Dharamshala, ANI reported on Thursday quoting Kangra Police official Vimukt Ranjan. "Forensic team is at the spot and police is investigating the matter," the police official said. The 53-year-old actor acted in several Bollywood films including \'Kai Po Che!\', \'Black Friday\' and \'Hichki\'.',
 '\'Balika Vadhu\' fame actress Avika Gor took to Instagram on Wednesday to confirm her relationship with Milind Chandwani, who participated in Roadies Real Heroes 2019. "This kind human is mine. And I\'m his forever," she wrote while sharing a collage. "We all deserve a partner that understands us, believes in us, inspires us...& truly cares for us," Avika added.',
 'Following the demise of Asif Basra, director Imtiaz Ali said, "The news of his demise is very shocking and disturbing." He added that he cast Asif in \'Jab We Met\' as he wanted an intelligent actor who could convey what he was trying to say. Imtiaz further said, "Asif was an engaging actor. I will miss him in the movies."',
 'American singer Mary Millben on Wednesday released a rendition of \'Om Jai Jagdish Hare\' as her Diwali greetings to people across the globe, particularly in India and the US, celebrating the festival of lights. "\'Om Jai Jagdish Hare\', a beautiful Hindi hymn...is a song of worship and celebration. This hymn continues to...stir my passion for Indian culture," said Millben.',
 'Actress Kangana Ranaut took to social media to share pictures from her brother Aksht\'s wedding in Udaipur. "Welcome to our family Ritu," she wrote. "Dear friends, bless my brother Aksht and his new bride Ritu, hope they find great companionship in this new phase of their lives," the actress wrote in another tweet.']
In [25]:
# Create an empty list, articles, to hold the dictionaries for each article.
articles = []

# Loop through each news card on the page and get what we want
for card in cards:
    title = card.find('span', itemprop='headline' ).text
    author = card.find('span', class_='author').text
    content = card.find('div', itemprop='articleBody').text
    
    # Create a dictionary, article, for each news card
    article = {'title': title, 'author': author, 'content': content}
    
    # Add the dictionary, article, to our list of dictionaries, articles.
    articles.append(article)
In [26]:
# Here we see our list contains 24-25 dictionaries for news cards

print(len(articles))
articles[0]
25
Out[26]:
{'title': '53-yr-old actor Asif Basra found hanging at a private complex in Himachal',
 'author': 'Pragya Swastik',
 'content': 'Actor Asif Basra was found hanging in a private complex in Himachal Pradesh\'s Dharamshala, ANI reported on Thursday quoting Kangra Police official Vimukt Ranjan. "Forensic team is at the spot and police is investigating the matter," the police official said. The 53-year-old actor acted in several Bollywood films including \'Kai Po Che!\', \'Black Friday\' and \'Hichki\'.'}

Build Article Function

In [27]:
def get_news_articles(cached=False):
    '''
    This function with default cached == False does a fresh scrape of inshort pages with topics 
    business, sports, technology, and entertainment and writes the returned df to a json file.
    cached == True returns a df read in from a json file.
    '''
    # option to read in a json file instead of scrape for df
    if cached == True:
        df = pd.read_json('articles.json')
        
    # cached == False completes a fresh scrape for df    
    else:
    
        # Set base_url that will be used in get request
        base_url = 'https://inshorts.com/en/read/'
        
        # List of topics to scrape
        topics = ['business', 'sports', 'technology', 'entertainment']
        
        # Create an empty list, articles, to hold our dictionaries
        articles = []

        for topic in topics:
            
            # Create url with topic endpoint
            topic_url = base_url + topic
            
            # Make request and soup object using helper
            soup = make_soup(topic_url)

            # Scrape a ResultSet of all the news cards on the page
            cards = soup.find_all('div', class_='news-card')

            # Loop through each news card on the page and get what we want
            for card in cards:
                title = card.find('span', itemprop='headline' ).text
                author = card.find('span', class_='author').text
                content = card.find('div', itemprop='articleBody').text

                # Create a dictionary, article, for each news card
                article = ({'topic': topic, 
                            'title': title, 
                            'author': author, 
                            'content': content})

                # Add the dictionary, article, to our list of dictionaries, articles.
                articles.append(article)
            
        # Create a DataFrame from list of dictionaries
        df = pd.DataFrame(articles)
        
        # Write df to json file for future use
        df.to_json('articles.json')
    
    return df
In [28]:
# Test our function with cached == False to do a freash scrape and create `articles.json` file.

df = get_news_articles(cached=False)
df.head()
Out[28]:
topic title author content
0 business Father said 'if you want to blow money up, fin... Pragya Swastik Serum Institute of India's (SII) CEO Adar Poon...
1 business Ambani's RIL to invest up to $50 mn in Bill Ga... Pragya Swastik India's richest man Mukesh Ambani's RIL has an...
2 business Special Indian version of PUBG Mobile to be la... Krishna Veera Vanamali South Korea's PUBG Corporation on Thursday ann...
3 business China's Alibaba records $74 billion sales in 1... Pragya Swastik Chinese billionaire Jack Ma's Alibaba on Thurs...
4 business Retail inflation in October rises to 7.61%, hi... Pragya Swastik The country's retail inflation, measured by th...
In [29]:
df.topic.value_counts()
Out[29]:
sports           25
entertainment    25
technology       24
business         24
Name: topic, dtype: int64
In [30]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   topic    98 non-null     object
 1   title    98 non-null     object
 2   author   98 non-null     object
 3   content  98 non-null     object
dtypes: object(4)
memory usage: 3.2+ KB
In [31]:
# Test our function to read in the df from `articles.csv`

df = get_news_articles(cached=True)
df.head()
Out[31]:
topic title author content
0 business Father said 'if you want to blow money up, fin... Pragya Swastik Serum Institute of India's (SII) CEO Adar Poon...
1 business Ambani's RIL to invest up to $50 mn in Bill Ga... Pragya Swastik India's richest man Mukesh Ambani's RIL has an...
2 business Special Indian version of PUBG Mobile to be la... Krishna Veera Vanamali South Korea's PUBG Corporation on Thursday ann...
3 business China's Alibaba records $74 billion sales in 1... Pragya Swastik Chinese billionaire Jack Ma's Alibaba on Thurs...
4 business Retail inflation in October rises to 7.61%, hi... Pragya Swastik The country's retail inflation, measured by th...
In [32]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 98 entries, 0 to 97
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   topic    98 non-null     object
 1   title    98 non-null     object
 2   author   98 non-null     object
 3   content  98 non-null     object
dtypes: object(4)
memory usage: 3.8+ KB