Member-only story

Web Scraping with Beautiful Soup — Siblings, CSS Selectors, and Node Manipulation

John Au-Yeung
3 min readJan 31, 2021

--

Photo by Giovanni Moschini on Unsplash

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

find_all_previous() and find_previous()

We can get all the nodes that comes before a given node with the find_all_previous method.

For example, if we have:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
first_link = soup.a
print(first_link.find_all_previous('p'))

Then we see:

[<p class="story">Once upon a time there were three little sisters; and their names…

--

--

No responses yet