What is Web Scraping?
Web scraping is the process of composing structured web data in an automated way. It is also known as web data extraction. Web scraping is used in many cases such as price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.
The data present on the website is targeted to extract the information from the Html page, which is known as the dataset. Once the dataset is identified then the logic is applied. The Html page consists of a root HTML tag, HEAD tag, BODY tag, and over dataset that is present inside the BODY tag. Other Html tags like p, div, along with attributes like id, and class. The logic behind web scraping depends on these elements. For example;
<body>
<p> Hello world </p>
</body>
In the above example, the dataset is “Hello world” and to scrap it out from the Html page, the logic would be to get the text inside the p tag.
Now, the web scraping concept is clear as to how we filter the data. To perform web scraping in laravel, we will use a package called laravel-goutte. This repository implements a simple ServiceProvider that makes a singleton instance of the Goutte client easily accessible through Facade in Laravel.
Goutte is a library that is used to implement web scraping logic. It is a PHP Web Scraper. It acts as a thin envelope around Symfony Components like BrowserKit, CssSelector, and DomCrawler. For HTTP requests, this library uses Guzzle HTTP Component.
- Installing the Laravel package using the following command:
Composer require weidner/goutte
- Register it in config/app.php and provide an alias for its Façade
- After registering, let’s scrap it.
The controller called GoutteController is created (it is up to whatever name you want to name it), Created goutte client, guzzle client with timeout 60.
$goutteClient->request method calls the desired HTML page and returns DomCrawler Component on which any filtering logic can be applied, according to the dataset that you have decided to get.
The script will dump the header name of the search result. This is simple logic, yet powerful. Below is the image of the page on which scraping will be done :
Below is the output displayed
Below I will provide links where you can go and check in-depth detail related to the package and library
laravel-goutte: https://github.com/dweidner/laravel-goutte
Goutte: https://github.com/FriendsOfPHP/Goutte
Symfony DomCrawler Component: https://symfony.com/doc/current/components/dom_crawler.html#component-dom-crawler-dumping
This article is all about what is web scraping, which package you should use to integrate your logic of web scraping in laravel.