Scrap web content using Python Beautifulsoup

Hello to all,

Now I am sharing about Python BeautifulSoup package which helps to scrap html content from the web.
You can install BeautifulSoup by pip using the following command.

$ sudo pip install beautifulsoup4

the following simple program will scraps all links from index.html file


<title> Simple BeautifulSoup <title>
<a href="">Click</a>

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("./index.html"))
for anchor in soup.find_all('a'):
    print(anchor.get('href', '/'))

Save the both file into a single folder. Then run the python command “python“. The Beautifulsoup will open the index.html file and finds all a (anchor) tags in the html file. Then it’ll get the href in the anchor tag, after that it’ll print the href’s text.
The output will look like this.

$ python

And you can also scrap content from the web. The following is my code which gets all link from a website.

from bs4 import BeautifulSoup
import urllib2

url = ""
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
links = soup.find_all("a")

for link in links:
    address = ""+link.get('href')
    print address+ '\n'


  • url = -> This line stores the url.
  • page = urllib2.urlopen(url) -> The urllib2 will open the url from the web and stores it into page variable
  • soup = BeautifulSoup(page) -> The BeautifulSoup function will stores the content of the webpage into soup variable
  • links = soup.find_all(“a”) -> The soup.find_all function will get all “a” anchor tags into links varible
  • for link in links: -> This for loop will stores all the links variable into link variable for looping process.
  • address = “”+link.get(‘href’) -> The link.get(‘href’) function will get links from the anchor tag, and stores it into address variable.
  • print address -> This will prints the address finally.
  • Similarly we can scrap anything from web with the use of python BeautifulSoup.

    Any queries with BeautifulSoup, please comment.

    With Regards,
    S. Praveen


    Leave a Reply

    Please log in using one of these methods to post your comment: Logo

    You are commenting using your account. Log Out /  Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )


    Connecting to %s