top of page

Coffee and Tips Newsletter

Inscreva-se na nossa newsletter semanal

Nos vemos em breve!

Scraping with Python: A Complete Beginner's Guide with Practical Example

  • Writer: JP
    JP
  • Apr 1
  • 5 min read

Updated: Apr 2


scraping python
Scraping com Python

Learn how to collect data from the web using Python scraping, step-by-step, with BeautifulSoup and requests.


If you've ever wondered how some websites manage to automatically gather data from the internet, the answer probably involves a technique called web scraping. And guess who’s the favorite for this task? Exactly: Python scraping!


In this post, we’ll explore:


  • What scraping is;

  • Use cases;

  • Pros and cons;

  • A full practical example with code explanation.


  1. What is Python Scraping?


Scraping is the process of automatically collecting information from websites. In the case of scraping with Python, we use specific libraries to simulate browsing, capture page content, and turn it into usable data.


  1. Use Cases of Scraping with Python


  • Price monitoring on e-commerce websites;

  • Data collection for market analysis;

  • Extraction of information from news websites;

  • Content aggregators (like promotion search engines);

  • Automatic database updates using public data;

  • Data extraction for use in ETLs (Learn how to create an ETL with Python here).


  1. Advantages of Scraping with Python


  • Python has several powerful libraries for scraping;

  • Simple and readable code — ideal for beginners;

  • Automates repetitive tasks and enables large-scale data collection;

  • Easy integration with data science libraries like pandas and matplotlib.


  1. Disadvantages


  • Sites with protection (like Cloudflare or captchas) make scraping difficult;

  • Changes in website structure can break your code;

  • Legality: not all websites allow scraping (check their robots.txt);

  • Your IP may be blocked due to excessive requests.


  1. Complete Practical Example of Scraping with Python


We'll use the requests and BeautifulSoup libraries to extract data from the Books to Scrape website, which was created specifically for scraping practice.


Install the libraries

pip install requests beautifulsoup4

Now, Let’s Develop the Code



Execution Result


Title: A Light in the Attic
Price: £51.77
Availability: In stock
---
Title: Tipping the Velvet
Price: £53.74
Availability: In stock
---
Title: Soumission
Price: £50.10
Availability: In stock
---
Title: Sharp Objects
Price: £47.82
Availability: In stock
---
Title: Sapiens: A Brief History of Humankind
Price: £54.23
Availability: In stock
---

  1. Understanding the Source Code


  1. Understanding the Function – requests.get(url)


What does this line do?

response = requests.get(url)

It sends an HTTP GET request to the given URL — in other words, it accesses the website as if it were a browser asking for the page content.


If the URL is:

response = "http://books.toscrape.com/"

Then requests.get(url) will do the same thing as typing that address into your browser and hitting "Enter".r".


What is requests?


requests is a super popular library in Python for handling HTTP requests (GET, POST, PUT, DELETE, etc.). It's like the "post office" for your code: you send a letter (the request) and wait for the reply (the site's content).


What’s inside response?


The response object contains several important pieces of information from the page’s reply. Some that we commonly use:


  • response.status_code → shows the HTTP status code (200, 404, 500...);

    • 200 = Success ✨

    • 404 = Page not found ❌

  • response.text → the full HTML of the page (as a string);

  • response.content → same as text, but in bytes (useful for images, PDFs, etc.);

  • response.headers → the HTTP headers sent by the server (you can see things like content type, encoding, etc.).


Pro Tip:


Always check the status_code before proceeding with scraping, like this:


if response.status_code == 200:
	# Tudo certo, bora continuar
else:
print("Erro ao acessar a página")

This way, your code won’t break if the website is down or the URL path has changed.


 

  1. Understanding the Function – BeautifulSoup()


What does this line do?
soup = BeautifulSoup(response.text, 'html.parser')

BeautifulSoup is an HTML parsing tool (in other words, it helps "understand" HTML). It converts that huge block of text returned by the website into a navigable object — allowing you to search for tags, attributes, classes, text… all in a very simple way.


  • response.text: this is the HTML of the page, returned by the requests.get() request.

  • 'html.parser': this is the parser used — the engine that will interpret the HTML.


There are other parsers like 'lxml' or 'html5lib', but 'html.parser' comes built-in with Python and works well in most cases.

What does the soup variable become?

It becomes a BeautifulSoup object. This object represents the entire structure of the page, and you can then use methods like:


  • .find() → gets the first element that matches what you're looking for.

  • .find_all() → gets all elements that match the filter.

  • .select() → searches using CSS selectors (like .class, #id, tag tag).

  • .text or .get_text() → extracts only the text inside the element, without HTML tags.


🔍 Visual Example:
html = "<html><body><h1>Oi!</h1></body></html>"
soup = BeautifulSoup(html, 'html.parser')

print(soup.h1)           # <h1>hi!</h1>
print(soup.h1.text)      # Oi!
In a scraping context:
response = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(response.text, 'html.parser')

# Agora dá pra buscar qualquer tag:
title_tag = soup.find('title')
print(title_tag.text)  # Vai imprimir o título da página
 

  1. Understanding the Function – soup.find_all()


What does this line do?
books = soup.find_all('article', class_='product_pod')

It retrieves all the HTML elements that represent books on the page, using the HTML tag article and the CSS class product_pod as the basis. On the Books to Scrape website, each book is displayed in a structure like this:


<article class="product_pod">
	<h3><a title="Book Title"></a></h3>
	<p class="price_color">£51.77</p>
	<p class="instock availability">In stock</p>
</article>

So, this line is essentially saying:

"Hey, BeautifulSoup, get me all the article elements with the class product_pod, and return them in a list called books."

What kind of data does it return?

books will be a list of BeautifulSoup objects, each one representing a book. Then, we can loop through this list using a for loop and extract the details of each individual book (like the title, price, and availability).


[
  <article class="product_pod">...</article>,
  <article class="product_pod">...</article>,
  ...
  (20 x)
]
 

  1. Understanding the Function – book.find()


What does this line do?
price = book.find('p', class_='price_color').text

The .find() method is used to search for the first HTML element that matches the filter you provide.

The basic structure is:

element = objeto_soup.find(tag_name, optional_attributes)

In our case:

book.find('p', class_='price_color')

Means:

"Look inside the book for the first <p> tag that has the class price_color."

🔍Examples using .find():


Getting the price:
preco = book.find('p', class_='price_color').text
# Result: "£13.76"

Getting the title:
titulo = book.find('h3').a['title']
# The <h3> contains an <a> tag with a "title" attribute

 

Conclusion: Is It Worth Using Scraping with Python?


Absolutely! Web scraping with Python is an incredibly useful skill for anyone working with data, automation, or simply looking to optimize repetitive tasks. With just a few lines of code and libraries like requests and BeautifulSoup, you can extract valuable information from the web quickly and efficiently.

Plus, Python is accessible, has a massive community, and tons of tutorials and resources — so you're never alone on this journey.


However, it’s important to keep in mind:


  • Not all websites allow scraping — always respect the robots.txt file and the site's terms of use;

  • Changes to the HTML structure can break your code — so keep your scraper updated;

  • More complex websites (with JavaScript, logins, etc.) may require more advanced tools like Selenium or Scrapy.


If you're just getting started, this post was only your first step. From here, you can level up by saving your data into spreadsheets with Pandas, databases, integrating it with dashboards, or even building more complex automation bots.


See y'all Guys!



 
 
 

Comments


bottom of page