Skip to content

What is web scraping? Definition and uses

Logo of a cloud representing an API

Web scraping, due to its significant business implications, is increasingly used by companies. Enabling in-depth competition analysis or mass outreach, this strategic practice has easily found its place in a world where data reigns supreme.

Are you interested in this technique of mass information collection and processing? Do you want to know how it works and how to practice web scraping yourself?

We explain everything in this article!

Web scraping: the definition

The word scraping comes from the verb to scrape (really?). And that's more or less what web scraping (or scraping in general) does: scrape data from a third-party source to retrieve specific information.

When preceded by the word web, scraping thus seeks data from internet sources to exploit the results for various purposes, often commercial. For example, we can imagine a web scraping tool retrieving products from competitors of a given e-commerce site to adjust prices based on this competition.

Web scraping is generally used on large amounts of data, too large to be processed by humans. This brings us closer to the concept of Big Data.

The stages of web scraping

To go a little further, and before diving into the pure technique, we need to understand what web scraping consists of. We can break this technique down into three stages: fetching, parsing, and exploitation.

Fetching

Fetching is a well-known term among web developers. Fetch is indeed the common keyword for a server call. It's exactly the same principle here: the fetching stage involves calling URLs (via GET requests) to retrieve raw content.

Parsing

Once this raw content is retrieved, comes the next stage: parsing. Less known, the term "parsing" is still familiar to developers. Parsing means to analyze or decompose. Here, we analyze and sort the retrieved data to extract what interests us.

Exploitation

Finally, once this data is parsed, comes the exploitation stage. Here, everything depends on the scraping project. In general, the retrieved data is stored in a database for future use (like the price comparison example given earlier). However, sometimes this data is used "in real-time," to send emails to a scraped list of email addresses, for example.

Use cases

We have already given a quick example of the use of web scraping: the competition analysis of an online store.

But the use cases for this practice are numerous, here are some others:

  • Lead generation for outreach;
  • Price monitoring (typically for real estate);
  • SEO research (Search Engine Optimization) for competitor ranking based on keywords;
  • Fetching candidates, or their CVs, for recruitment purposes.

How to do web scraping?

Now that we have seen what web scraping is and how it works, it is time to get practical!

Assessing the needs

First of all, and as with any IT project, it is important to assess the needs for scraping.

What are the types of sources? Are they easily accessible? Does fetching or parsing require a lot of resources? Where and how to store the parsed data, if needed?

These are all questions you need to answer before starting any development.

Choosing the right technologies

Then it is time to choose the technologies to use. Here, two main factors need to be considered:

  • The needs, mentioned above, may require specific technologies;
  • Your knowledge and preferences. Web scraping is possible with almost all languages, so the choice of the tech stack is primarily subjective.

Once the tech stack is chosen, it should be set up. Deploying servers or databases, for example, are actions to be launched before development.

Developing the fetching, parsing, and exploitation algorithms

Now that everything is ready, we can start developing the web scraping tool. This includes the three stages already mentioned: fetching, parsing, and exploitation.

Each project, due to its specific needs and technologies, imposes its own rules. These algorithms and how they will be constructed depend on the product to be created.

Web scraping example: retrieving LinkedIn job offers with Node

To understand the technical functioning of web scraping, the best way is to analyze a concrete example.

Here, the use case will be the following: retrieving LinkedIn job offers based on certain criteria to create a listing with the important summarized information. This information will then be added to a database.

We will go through the three stages of scraping: fetching, parsing, and exploitation.

The technology chosen for this example here is Node, with the native fetch library for server calls, and Cheerio for parsing.

Fetching

Fetching consists of making calls to specific URLs to retrieve raw data. We will keep it simple and show only one call here. Generally, these fetches are executed in a loop with different parameters each time.

In this example, we will search for jobs in France on LinkedIn, containing the keyword "developer".

// Public URL of LinkedIn
const LINKEDIN_FETCH_URL = 'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?f_TPR=r604800&geoId=105015875&keywords=d%C3%A9veloppeur&location=France&refresh=true&start=0';// Call to the URL, then retrieving the data as raw text
const response = await fetch(LINKEDIN_FETCH_URL, { timeout: 30000 }).then(res => res.text())

Here, in just two lines of code, we retrieve all the HTML content returned by the request as text.

Parsing

Parsing is usually a bit more complex than fetching, so we will use a library to help us: Cheerio. It will allow us to parse the retrieved content while keeping the code as clean as possible:

// Loading the text content using Cheerio's load function
const $ = load(response);
const jobs = $('.base-card');// If LinkedIn jobs are retrieved, we loop through them to get the elements we are interested in
if (jobs != undefined && jobs != null && jobs.length) {
    jobs.each((index, element) => {
        const job = $(element);
        const jobTitle = $(job).find('h3.base-search-card__title').text().trim().replace(/['"]+/g, '') || '-';
        const companyName = $(job).find('h4.base-search-card__subtitle').text().trim() || '-';
        const link = $(job).find('a.base-card__full-link').attr('href') || '-';
    });
}

As mentioned, the code here is slightly more complex. We use various functions provided by Cheerio such as load or find to navigate the HTML DOM more easily and extract specific data, such as the job title.

Exploitation

Once these elements are retrieved (in this example only the job title, the company, and the link to the LinkedIn job post), we can exploit this data. Here, we will simply simulate creating an object that will be stored in a database:

const dbJob = {jobTitle, companyName, link};
this._saveJobToDB(dbJob);

And that's it! LinkedIn job offers have been fetched, parsed, and exploited! We have just seen a simplified yet complete example of web scraping. All you have to do is adapt and evolve these pieces of code according to your needs.

Conclusion

In summary, web scraping is a practice of retrieving online data. Increasingly used, this technique, which is divided into three stages (fetching, parsing, and exploitation), has a strong business and strategic component.

Technically, although sometimes complex, it is relatively simple to set up web scraping by following the steps detailed in this article.

You might also want to read: Optimize your SPA web app for SEO, without SSR


Written by Alexandre Grisey

Indie Dev Tools

Indie Dev Tools is a hand-curated collection of highly-useful tools for independent developers and solopreneurs.

Indie Dev Tools - Free Directory of Tools for Independent Developers | Product Hunt
Copyright © 2024 - All rights reserved

Built with ❤️ by Alexandre Grisey 👋

Categories

More Categories