Web scraping is defined as an automated method that is applied to withdraw large amounts of data from websites. As there is unstructured data on the website, hence this unstructured is extracted and is then stored in a structured format.
Beautiful-Soup is a Python library that is used to extract unstructured data and allows parsing of HTML and XML files. It is used for scraping websites. As it is not a standard python library, therefore you need to install it first.
Reasons Why Web Scraping is used?
Below are a variety of reasons for using Web Scraping
- Comparison of prices: The data is collected about the products from various online shopping websites and then their prices are compared.
- Research and Development: This methodology is used for collecting and analyzing a large set of data in the form of Statistics, General Information, Weather, etc from websites, which is further used to calculate surveys.
- Social Media: Social media data is collected to search the trending topic.
- Contact Information: contact information including names and email addresses is collected to send bulk emails for publicity to the customers.
- Job listings: includes data related job openings and interviews from different websites and is then collaborated in one place so that users can easily access it.
- Reviewing Movies: Ratings and popularity of the movies are based on the information collected about movies.
Is Web Scraping Legal?
Some websites allow web scraping while some deny accessing their websites. You can find out whether a website allows web scraping or not by using the website’s “robots.txt” file. This file can be appended by using “/robots.txt” to the URL that you desire to scrape.
Overview
We’ll be creating a program that will be used to scrap a movie rating webpage by parsing the HTML page and collecting the information to create a dataset. The below steps will be performed using Python and Beautiful Soup.
- Required Libraries are imported.
- Retrieve and Convert the HTML.
- Find and Extract the Data Elements.
- Create and Display the Data Frame.
- Conversion of Data Frame to a CVS File.
Web Scraping Review
This review gives us some basic information about web scraping. The program we will be creating will web scraping the IMDB Top 100 Movies website (shown below). This webpage contains important information. This site can be accessed by going on this link: IMDb “Top 1000” (Sorted by IMDb Rating Descending) – IMDb.
By parsing the HTML the webpage can be scraped and information needed for the dataset is extracted.
Right-click anywhere on the web page to scrape data from this web page. Below dialog box will be seen.
From the bottom of the list select Inspect option which will result in displaying the HTML of the webpage as shown below:
The above image shows the webpage on the left side and the HTML of the whole webpage is seen on the right side. In case you wish to see the HTML of the particular element click the arrow icon on the upper left-hand side of the screen with the HTML and then select the movie title of the first movie displayed on the webpage resulting in as shown in the image below:
You will see the HTML line that is highlighted displaying the movie title Jai Bhim.
<a href=”/title/tt15097216/?ref_=adv_li_tt”>Jai Bhim</a>
The <a is the anchor tag in HTML denoting the link of the website.
Above the movie title, there is <h3 class=”lister-item-header”> denotes the parent for the movie title.
Therefore, to find, extract and capture all the movie titles on the web page the following steps are used.
- Create an object that defines the list of all the HTML lines for the specific parent and their associated children.
- Create an object that consists of a list of all the specific children (movie titles) found within their parents.
The code for finding and extracting movie titles is as follows:
movies = soup.findAll(‘h3’, class_=’lister-item-header’) titles = [movie.find(‘a’).text for movie in movies] The output will look like this: [‘Jai Bhim’, ‘The Shawshank Redemption, ‘Soorarai Pottru’, ‘The Godfather’, ‘RRR’, ‘The Dark Knight’, …
Program
Objective: Extract the data elements on a webpage and collect the data into a dataset.
Import the Required Packages and
# Install the packages !pip install requests beautifulsoup4 pandas # Install the libraries import requests import bs4 import pandas as pd import csv
Libraries
- requests HTTP requests are sent that return a Response Object with all the response data that is HTML.
- beautiful soup (bs4) is used to extract data from HTML files and convert the data to a BeautifulSoup object, representing HTML in nested data format.
- Pandas: used for data analysis and manipulation.
Retrieving and Converting the HTML
Create an object that consists of the URL of the website in simple words the address and send a get request for the specific URL’s HTML. The server sends back the HTML data all you have to do is retrieve it and then convert the data into a Beautiful Soup object.
url = 'https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating' def get_page_contents(url): page = requests.get(url, headers={"Accept-Language": "en-US"}) return bs4.BeautifulSoup(page.text, "html.parser") soup = get_page_contents(url)
The Beautiful Soup will scrape and parse the HTML content of the web pages. Beautiful Soup is famous for offering numerous functions that can be used to extract data from HTML. To learn more about BeautifulSoup select this link.
For this project, The IMDB Top 100 Movies webpage is used. You can find this webpage by selecting this link: IMDb “Top 1000” (Sorted by IMDb Rating Descending) – IMDb.
From the IMDB Top 100 Movies webpage, the following data elements will be extracted:
- Movie Title
- Release Year
- Audience Rating
- Runtime
- Genre
- IMDB Rating
- Number of Votes
- Box Office Earnings
Search and Extract the Data Elements
Start by creating a list of all distinct movies and their corresponding HTML. Then use the findAll method that will generate a list that has HTML captured within the ‘div’ tag with the class ‘lister-item-content’.
movies = soup.findAll('div', class_='lister-item-content')
The next step is to extract each data element, search through all movies, collect and extract all the HTML lines within the specified tag and class, and then store the data elements. In case to find specific HTML lines in votes and earnings, data elements, a tag, name, and other parameters are used. The data elements extracted will be stored in a list.
titles = [movie.find('a').text for movie in movies] release = [movie.find('span', class_='lister-item-year text-muted unbold').text for movie in movies] audience_rating = [movie.find('span', class_='certificate') for movie in movies] runtime = [movie.find('span', class_='runtime').text for movie in movies] genre = [movie.find('span', class_='genre').text.strip() for movie in movies] imdb_rating = [movie.find('div', 'inline-block ratings-imdb-rating', text_attribute=False).text.strip() for movie in movies] votes = [movie.find('span', {'name' : 'nv'}, text_attribute=False, order=None).text for movie in movies] earnings = [movie.find('span', {'name' : 'nv'},[1], text_attribute=False).text for movie in movies]
Design and Display the Data Frame
Creating a new Data Frame consisting of names and data elements that were extracted.
movies_dict = {'Title': titles, 'Relase': release, 'Audience Rating': audience_rating, 'Runtime': runtime, 'Genre': genre, 'IMDB Rating': imdb_rating, 'Votes': votes, 'Box Office Earnings': earnings} movies = pd.DataFrame(movies_dict) movies.head(10)
Conversion of Data Frame to a CVS File
In the previous step, the data frame that was created is used to create a CSV file from the data frame that was created in the previous step.
csv_data = movies.to_csv() print('\nCSV String Values:\n', csv_data)
Web Scraping becomes necessary in cases where the website does not have API. It can be used for an unlimited number of purposes. Whether it is a new business or a growing one, web scraping helps your business to grow with web data.