Member-only story

Web Scraping with Beautiful Soup — Searching Nodes

John Au-Yeung

3 min readJan 31, 2021

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Searching Strings with Regex

We can search strings with regex.

For example, we can write:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(string=re.compile("Dormouse")))

We call re.compile to create our regex.

Also, we can search for strings with a function:

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p…

Web Scraping with Beautiful Soup — Searching Nodes

Searching Strings with Regex

Written by John Au-Yeung

No responses yet