Hello to all,
Now I am sharing about Python BeautifulSoup package which helps to scrap html content from the web.
You can install BeautifulSoup by pip using the following command.
$ sudo pip install beautifulsoup4
the following simple program will scraps all links from index.html file
index.html
<html> <headl> <title> Simple BeautifulSoup <title> <body> <a href="http://google.com">Click</a> </html>
test.py
from bs4 import BeautifulSoup soup = BeautifulSoup(open("./index.html")) for anchor in soup.find_all('a'): print(anchor.get('href', '/'))
Save the both file into a single folder. Then run the python command “python test.py“. The Beautifulsoup will open the index.html file and finds all a (anchor) tags in the html file. Then it’ll get the href in the anchor tag, after that it’ll print the href’s text.
The output will look like this.
$ python test.py http://google.com
And you can also scrap content from the web. The following is my code which gets all link from a website.
flipkart_href.py
from bs4 import BeautifulSoup import urllib2 url = "http://flipkart.com" page = urllib2.urlopen(url) soup = BeautifulSoup(page) links = soup.find_all("a") for link in links: address = "flipkart.com"+link.get('href') print address+ '\n'
Explanation->
Similarly we can scrap anything from web with the use of python BeautifulSoup.
Any queries with BeautifulSoup, please comment.
With Regards,
S. Praveen