Member-only story

Web Scraping with Beautiful Soup — Encoding

John Au-Yeung
3 min readJan 31, 2021

--

Photo by Louis Hansel @shotsoflouis on Unsplash

We can get data from web pages with Beautiful Soup.

It lets us parse the DOM and extract the data we want.

In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.

Output Formatters

We can format our output with Beautiful Soup.

For example, we can write:

from bs4 import BeautifulSoup
french = "<p>Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;</p>"
soup = BeautifulSoup(french, 'html.parser')
print(soup.prettify(formatter="html"))

to set the formatter to the one we want when we call prettify .

Also we can use the html5 formatter,

For example, we can write:

from bs4 import BeautifulSoup
br = BeautifulSoup("<br>", 'html.parser').br
print(br.prettify(formatter="html"))
print(br.prettify(formatter="html5"))

Then from the first print , we see:

<br/>

And from the 2nd print , we see:

<br>

Also, we can set the formatter to None :

from bs4 import BeautifulSoup
link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>', 'html.parser')…

--

--

No responses yet