web scraping

Web Scrapping, also known as screen scraping, data mining,or web harvesting is not new and has existed for a long time now. This article focuses on how to use python to request information from a web server, how to perform basic handling of the server’s response, and how to begin interacting with a website in an automated fashion. To be honest, web scraping is an excellent industry to enter if you want a high return on a small initial investment.

Prerequisites

Python 3 installed

What is Web Scraping?

Web scraping is the process of collecting and parsing raw data from the Web

## Why Web Scraping? web scrapers can go places that traditional search engines cannot

Introduction to BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

BeautifulSoup helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures.

Installing BeautifulSoup

We will be using the BeautifulSoup 4 library (also known as BS4) throughout this article. Check on the complete installation here For linux users:

$ sudo apt-get install python-bs4

To install using the pip package

$ pip install beautifulsoup4

To check for installation, open for the python terminal and import beautifulsoup.

If the import completes without error, Voila🎉🎉, you have beautifulsoup installed successfully.

Note 🔖: If you install any python library without a virtual environment, you are installing it globally. It is recommended to install python libraries via virtual enviroment to avoid potential conflicts between installed libraries, also it helps to easily bundle projects with associated libraries when working with multiple python projects.

To do this, install virtual environment

 pip3 install virtualenv

Create a virtualenvironment Linux/mac

python3 -m venv env

Windows

py -m venv env

This creates a new environment called env, which you must activate to use: Linux/MacOs

source env/bin/activate

Windows

.\env\Scripts\activate

After you have activated the environment, you can now install your library, in our case beautifulsoup

(env)emma$ pip install beautifulsoup4
(env)emma$ python
> from bs4 import BeautifulSoup
>

You can leave the environment with the deactivate command, after which you can no longer access any libraries that were installed inside the virtual environment:

(env)emma$ deactivate
emma$ python
> from bs4 import BeautifulSoup
Traceback (most recent call last):
File "<stdin>", line 1, in <module

Running BeautifulSoup

The beautifulsoup object is the most commonly used object in the beautifulsoup library, so we have to import it when running BeautifulSoup in our project.

scrapy.py

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

Output

Explanation

We are importing the urlopen function and calling html.read() in order to get the HTML content of the page.
When you create a BeautifulSoup object, two arguments are passed in:

bs = BeautifulSoup(html.read(), 'html.parser')

The first is the HTML text the object is based on, and the second specifies the parser that you want BeautifulSoup to use in order to create that object. NB: html.parser is a parser that is included with Python 3 and requires no extra installa‐ tions in order to use.

Tip💡: lxml is another example of a popular parser that has some advantage over the html parser.

bs = BeautifulSoup(html.read(), 'lxml')

It is generally better at parsing “messy” or malformed HTML code.
It is forgiving and fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags.
It is also some‐what faster than html.parser , although speed is not necessarily an advantage in web scraping, given that the speed of the network itself will almost always be your largest bottleneck. However, you have to install lxml first o use it. It depends on third-party C libraries to function which can cause problems for portability and ease of use, compared to html.parser.

Another example of parser include html5lib

bs = BeautifulSoup(html.read(), 'html5lib')

Conecting Reliably and Handling Exceptions

It is a good practice to always provide a way of handling errors and exception while doing we scraping. You can handle this exception in the following way:

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
print(e)
except URLError as e:
print('The server could not be found!')
else:
print('It Worked!')

• If the page is not found on the server (or there was an error in retrieving it), urlopen will throw an HTTPError, the program now prints the error, and does not execute the rest of the program under the else statement • If the server is not found at all, the site is down, or the URL is mistyped), urlopen will throw an URLError and prints the error.

Conclusion

I hope you enjoyed the article. Next i will be handling "Advanced HTML Parsing", where we will be looking at parsing complicated HTML pages in order to extract only the information we are looking for.

Thanks for reading, i would appreciate any feedback or comment.

Let's connect: Find me on twitter and Linkedin

Introduction to Web Scrapping in Python with BeautifulSoup

Prerequisites

What is Web Scraping?

Introduction to BeautifulSoup

Installing BeautifulSoup

Running BeautifulSoup

Conecting Reliably and Handling Exceptions

Conclusion

Comments

More from this blog

Data Fallacies: I6 common Data Fallacies and how to Avoid them

Building Basic RESTful API with Flask-RESTful

Guide to Becoming a Self-taught Software Engineer: Best tools and practices

FAILED INTERVIEW EXPERIENCE - Cracking Amazon without Luck

Command Palette

Prerequisites

What is Web Scraping?

Introduction to BeautifulSoup

Installing BeautifulSoup

Running BeautifulSoup

Conecting Reliably and Handling Exceptions

Conclusion

Comments

More from this blog