Scrap web content using Python Beautifulsoup

Hello to all,

Now I am sharing about Python BeautifulSoup package which helps to scrap html content from the web.
You can install BeautifulSoup by pip using the following command.

$ sudo pip install beautifulsoup4


the following simple program will scraps all links from index.html file

index.html

<html>
<headl>
<title> Simple BeautifulSoup <title>
<body>
<a href="http://google.com">Click</a>
</html>



test.py

from bs4 import BeautifulSoup
 
soup = BeautifulSoup(open("./index.html"))
 
for anchor in soup.find_all('a'):
    print(anchor.get('href', '/'))


Save the both file into a single folder. Then run the python command “python test.py“. The Beautifulsoup will open the index.html file and finds all a (anchor) tags in the html file. Then it’ll get the href in the anchor tag, after that it’ll print the href’s text.
The output will look like this.

$ python test.py
http://google.com

And you can also scrap content from the web. The following is my code which gets all link from a website.


flipkart_href.py

from bs4 import BeautifulSoup
import urllib2

url = "http://flipkart.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
links = soup.find_all("a")

for link in links:
    address = "flipkart.com"+link.get('href')
    print address+ '\n'



Explanation->

  • url = http://flipkart.com -> This line stores the url.
  • page = urllib2.urlopen(url) -> The urllib2 will open the url from the web and stores it into page variable
  • soup = BeautifulSoup(page) -> The BeautifulSoup function will stores the content of the webpage into soup variable
  • links = soup.find_all(“a”) -> The soup.find_all function will get all “a” anchor tags into links varible
  • for link in links: -> This for loop will stores all the links variable into link variable for looping process.
  • address = “flipkart.com”+link.get(‘href’) -> The link.get(‘href’) function will get links from the anchor tag, and stores it into address variable.
  • print address -> This will prints the address finally.
  • Similarly we can scrap anything from web with the use of python BeautifulSoup.

    Any queries with BeautifulSoup, please comment.

    With Regards,
    S. Praveen

    Advertisements

    Leave a Reply

    Please log in using one of these methods to post your comment:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s