Building a Simple Web Scraper in Python

Web scraping is one of the most practical skills you can add to your Python toolkit. With just two libraries and a basic understanding of HTML, you can pull real data from websites and put it to work in your own projects.

Every time you visit a website, your browser sends an HTTP request and receives an HTML document in return. Web scraping does the same thing programmatically — you request the page, receive the HTML, and then parse out exactly the data you want. Python makes this surprisingly straightforward, and the two libraries you will rely on most are requests for fetching pages and BeautifulSoup for navigating the HTML that comes back.

What Web Scraping Actually Does

When you load a webpage in a browser, a lot happens behind the scenes: DNS lookups, TCP handshakes, TLS negotiation, and then the actual HTTP exchange. For scraping purposes, you only care about the last part. Your script sends a GET request to a URL, the server sends back an HTML response, and you parse that response to extract the data you need.

HTML is a nested tree of elements — headings, paragraphs, links, tables, lists — each tagged with identifiers that describe what they are. A scraper navigates that tree and plucks out specific nodes by tag name, CSS class, ID attribute, or position in the document. The data could be anything: product prices, article headlines, weather readings, job listings, sports scores. If it renders in a browser, Python can generally read it.

Note

Some websites render content dynamically using JavaScript after the initial page load. In those cases, requests alone will not capture the fully rendered HTML. That scenario calls for tools like Playwright or Selenium, which control a real browser. This article focuses on static HTML pages, which cover the majority of beginner scraping projects.

Setting Up Your Environment

You need two third-party libraries. Neither ships with Python by default, so install them with pip:

pip install requests beautifulsoup4

requests handles the HTTP layer. beautifulsoup4 is the parsing library, but you will import it in your code as bs4. You will also need to specify an HTML parser — Python's built-in html.parser works fine and requires no additional installation.

Once installed, confirm everything works by opening a Python shell and running:

import requests
from bs4 import BeautifulSoup
print("Ready.")

If you see Ready. without errors, your environment is set up correctly.

Fetching a Page with requests

The requests library reduces an HTTP GET request to a single function call. Pass it a URL and it returns a Response object containing everything the server sent back.

import requests

url = "https://books.toscrape.com/"
response = requests.get(url)

print(response.status_code)  # 200 means success
print(response.text[:500])   # First 500 characters of the HTML

The status_code attribute tells you whether the request succeeded. A 200 means everything went as expected. A 404 means the page was not found. A 403 means the server refused to serve you, often because it detected automated traffic. Before doing anything with the response, it is good practice to check that the status code is actually 200.

import requests

url = "https://books.toscrape.com/"
response = requests.get(url)

if response.status_code == 200:
    print("Page fetched successfully.")
else:
    print(f"Request failed with status: {response.status_code}")
Pro Tip

Many servers block requests that do not include a User-Agent header, because the default requests user agent string is obviously automated. Pass a headers dictionary that mimics a browser to avoid this: headers = {"User-Agent": "Mozilla/5.0"} and then requests.get(url, headers=headers).

Parsing HTML with BeautifulSoup

Once you have the page's HTML as a string, you pass it to BeautifulSoup along with the name of your parser. BeautifulSoup builds an in-memory tree of the document that you can navigate and search.

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")

From there, you have several ways to find elements. The two most common are find(), which returns the first matching element, and find_all(), which returns a list of all matching elements.

Finding by tag name

# Get the first h1 element on the page
heading = soup.find("h1")
print(heading.text)

Finding by CSS class

# Get all elements with class "product_pod"
products = soup.find_all("article", class_="product_pod")
print(f"Found {len(products)} products.")

Accessing attributes

# Get the href attribute from a link
link = soup.find("a")
print(link["href"])

BeautifulSoup also supports CSS selector syntax through the select() method, which some developers find more intuitive if they already know CSS:

# Equivalent to find_all("article", class_="product_pod")
products = soup.select("article.product_pod")

A Complete Scraper Example

The site books.toscrape.com is a sandbox built specifically for practicing web scraping. It has no terms of service restrictions on automated access and its HTML structure is clean and predictable. The following scraper collects the title and price of every book on the first page.

import requests
from bs4 import BeautifulSoup

def scrape_books():
    url = "https://books.toscrape.com/"
    headers = {"User-Agent": "Mozilla/5.0"}

    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to fetch page: {response.status_code}")
        return []

    soup = BeautifulSoup(response.text, "html.parser")

    books = []
    for article in soup.find_all("article", class_="product_pod"):
        title = article.find("h3").find("a")["title"]
        price = article.find("p", class_="price_color").text.strip()
        books.append({"title": title, "price": price})

    return books


if __name__ == "__main__":
    results = scrape_books()
    for book in results:
        print(f"{book['title']} — {book['price']}")

Breaking this down: the scraper fetches the page, checks for a successful response, then iterates over every article element with the class product_pod. For each article, it finds the title by navigating to the h3 tag and reading the title attribute of the link inside it (the visible link text is often truncated, but the title attribute holds the full name). It then grabs the price from the paragraph with class price_color.

Running this script should print 20 book titles and their prices, one per line.

Saving results to a CSV file

Printing results to a terminal is useful for testing, but you will usually want to store them. Python's built-in csv module makes it easy to write your scraped data to a file:

import csv

def save_to_csv(books, filename="books.csv"):
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["title", "price"])
        writer.writeheader()
        writer.writerows(books)
    print(f"Saved {len(books)} records to {filename}.")

Call save_to_csv(results) after scrape_books() and you will have a spreadsheet-ready file waiting for you.

Scraping Responsibly

Web scraping sits in a legally and ethically nuanced space. A few principles keep you on the right side of both:

  • Read the terms of service. Many sites explicitly prohibit automated access. If a site's terms forbid scraping, do not scrape it, regardless of whether you could technically get away with it.
  • Check robots.txt. The file at yourtargetsite.com/robots.txt specifies which paths automated agents are allowed or disallowed from accessing. Respecting it is standard practice.
  • Add delays between requests. Hammering a server with rapid-fire requests can degrade service for real users and may get your IP address blocked. Use time.sleep() to introduce a pause of one to two seconds between requests.
  • Do not scrape personal data. Collecting personally identifiable information without consent raises serious legal exposure under laws like GDPR and CCPA.
  • Prefer official APIs. If a site offers a public API for the data you need, use it. APIs are intentionally designed for programmatic access and are far more stable than scraping HTML, which can break whenever the site redesigns its layout.
Warning

Courts in the United States and Europe have reached different conclusions on the legality of web scraping depending on the circumstances. Academic or personal research projects generally face lower risk than commercial applications, but this is not legal advice. When in doubt about a specific use case, consult an attorney familiar with internet law.

Key Takeaways

  1. requests fetches, BeautifulSoup parses. Keep these two roles clear in your mind. The HTTP layer and the parsing layer are separate concerns, and separating them in your code makes both easier to debug.
  2. Always check the status code. Never assume a request succeeded. A response with a non-200 status code contains no useful data and processing it will produce confusing errors.
  3. Use a sandbox site to practice. Sites like books.toscrape.com and quotes.toscrape.com exist precisely for learners. Start there before pointing your scraper at production websites.
  4. Add delays and respect robots.txt. Responsible scraping is not just ethical — it also keeps your scraper working longer by avoiding IP bans and server-side blocking.
  5. Static HTML is just the beginning. Once you are comfortable with requests and BeautifulSoup, the natural next steps are handling pagination, managing sessions and cookies, and moving to Playwright for JavaScript-heavy sites.

Web scraping rewards curiosity. The best way to get comfortable with it is to pick a dataset you actually want — book titles, weather data, stock tickers, job postings — and write a scraper to collect it. The concepts covered here will carry you through the vast majority of beginner and intermediate scraping projects.

back to articles