May 15, 2024
Web scraping, due to its significant business implications, is increasingly used by companies. Enabling in-depth competition analysis or mass outreach, this strategic practice has easily found its place in a world where data reigns supreme.
Are you interested in this technique of mass information collection and processing? Do you want to know how it works and how to practice web scraping yourself?
We explain everything in this article!
The word scraping comes from the verb to scrape (really?). And that's more or less what web scraping (or scraping in general) does: scrape data from a third-party source to retrieve specific information.
When preceded by the word web, scraping thus seeks data from internet sources to exploit the results for various purposes, often commercial. For example, we can imagine a web scraping tool retrieving products from competitors of a given e-commerce site to adjust prices based on this competition.
Web scraping is generally used on large amounts of data, too large to be processed by humans. This brings us closer to the concept of Big Data.
To go a little further, and before diving into the pure technique, we need to understand what web scraping consists of. We can break this technique down into three stages: fetching, parsing, and exploitation.
Fetching is a well-known term among web developers. Fetch is indeed the common keyword for a server call. It's exactly the same principle here: the fetching stage involves calling URLs (via GET requests) to retrieve raw content.
Once this raw content is retrieved, comes the next stage: parsing. Less known, the term "parsing" is still familiar to developers. Parsing means to analyze or decompose. Here, we analyze and sort the retrieved data to extract what interests us.
Finally, once this data is parsed, comes the exploitation stage. Here, everything depends on the scraping project. In general, the retrieved data is stored in a database for future use (like the price comparison example given earlier). However, sometimes this data is used "in real-time," to send emails to a scraped list of email addresses, for example.
Find more development tools on Indie Dev Tools.
We have already given a quick example of the use of web scraping: the competition analysis of an online store.
But the use cases for this practice are numerous, here are some others:
Now that we have seen what web scraping is and how it works, it is time to get practical!
First of all, and as with any IT project, it is important to assess the needs for scraping.
What are the types of sources? Are they easily accessible? Does fetching or parsing require a lot of resources? Where and how to store the parsed data, if needed?
These are all questions you need to answer before starting any development.
Then it is time to choose the technologies to use. Here, two main factors need to be considered:
Once the tech stack is chosen, it should be set up. Deploying servers or databases, for example, are actions to be launched before development.
Now that everything is ready, we can start developing the web scraping tool. This includes the three stages already mentioned: fetching, parsing, and exploitation.
Each project, due to its specific needs and technologies, imposes its own rules. These algorithms and how they will be constructed depend on the product to be created.
To understand the technical functioning of web scraping, the best way is to analyze a concrete example.
Here, the use case will be the following: retrieving LinkedIn job offers based on certain criteria to create a listing with the important summarized information. This information will then be added to a database.
We will go through the three stages of scraping: fetching, parsing, and exploitation.
The technology chosen for this example here is Node, with the native fetch library for server calls, and Cheerio for parsing.
Fetching consists of making calls to specific URLs to retrieve raw data. We will keep it simple and show only one call here. Generally, these fetches are executed in a loop with different parameters each time.
In this example, we will search for jobs in France on LinkedIn, containing the keyword "developer".
// Public URL of LinkedIn
const LINKEDIN_FETCH_URL = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?f_TPR=r604800&geoId=105015875&keywords=d%C3%A9veloppeur&location=France&refresh=true&start=0';// Call to the URL, then retrieving the data as raw text
const response = await fetch(LINKEDIN_FETCH_URL, { timeout: 30000 }).then(res => res.text())
Here, in just two lines of code, we retrieve all the HTML content returned by the request as text.
Parsing is usually a bit more complex than fetching, so we will use a library to help us: Cheerio. It will allow us to parse the retrieved content while keeping the code as clean as possible:
// Loading the text content using Cheerio's load function
const $ = load(response);
const jobs = $('.base-card');// If LinkedIn jobs are retrieved, we loop through them to get the elements we are interested in
if (jobs != undefined && jobs != null && jobs.length) {
jobs.each((index, element) => {
const job = $(element);
const jobTitle = $(job).find('h3.base-search-card__title').text().trim().replace(/['"]+/g, '') || '-';
const companyName = $(job).find('h4.base-search-card__subtitle').text().trim() || '-';
const link = $(job).find('a.base-card__full-link').attr('href') || '-';
});
}
As mentioned, the code here is slightly more complex. We use various functions provided by Cheerio such as load or find to navigate the HTML DOM more easily and extract specific data, such as the job title.
Once these elements are retrieved (in this example only the job title, the company, and the link to the LinkedIn job post), we can exploit this data. Here, we will simply simulate creating an object that will be stored in a database:
const dbJob = {jobTitle, companyName, link};
this._saveJobToDB(dbJob);
And that's it! LinkedIn job offers have been fetched, parsed, and exploited! We have just seen a simplified yet complete example of web scraping. All you have to do is adapt and evolve these pieces of code according to your needs.
In summary, web scraping is a practice of retrieving online data. Increasingly used, this technique, which is divided into three stages (fetching, parsing, and exploitation), has a strong business and strategic component.
Technically, although sometimes complex, it is relatively simple to set up web scraping by following the steps detailed in this article.
You might also want to read: Optimize your SPA web app for SEO, without SSR
Written by Alexandre Grisey