hii readers,
This Is Our Second Part Of Complete Beautiful Soup Tutorials Series And In This Part, I Am Going To Show Some Advance Practical Examples.
Introduction
Actually, As I already told you in my previous tutorials, Web scraping is a very Interesting topic and it's One of the main reason is possibilities, I mean with web scraping we can create various types of desktop apps that can help us to manage our online activities. In Simple words, Web scraping is a technique to collect data from all over the internet. With the help of web scraping, a users can extract and collect big data from the various online resource. Thus, Here I am going to provide you a complete tutorial that going to help you to understand the usages of the beautifulsoup module. basically, In today's topic, I am going to show you some more advanced usages of the beautifulsoup module. So, Let's Start But First If You Are New Visitor In Our Blog Then Definitely, I am going to Suggest You To take a look Or Check Our Index Because There You will really find various types of interesting stuff written in python and For first of part of beautifulsoup module tutorial
click here.Practical Example On IPython Notebook
Input : [1] #
# Author:
#
# surajsinghbisht054@gmail.com
# https://www.bitforestinfo.com
#
# Here, I am Using
# 1. Python 2.7 (Python Version)
# 2. BeautifulSoup 4 (bs4 Version)
# 3. Ipython Notebook (Code Editor)
# 4. Ubuntu (Operating System)
#
# So, Let's Start With Practical Examples
Input : [2] # Example 1.
#
# Here, We Are Trying To Extract All Links from Webpage
# So, let's start
#
# Imprt Module
import bs4
import urllib2, sys
if len(sys.argv)==1:
print "[*] Please Provide Domain Name:\n Usages: python link_bs4.py www.examplesite.com\n"
sys.exit(0)
def parse_url(url):
try:
html=urllib2.urlopen(url).read() # Reading Html Codes
except Exception as e:
print "[Error] ",e
sys.exit(0)
parse=bs4.BeautifulSoup(html,'html.parser') # Feed Data To bs4
for i in parse.findAll('a'): # Searching For link Tag
if 'href' in i.attrs.keys():# Searching For Href key
link=i.attrs['href']
print link
return
parse_url("https://www.bitforestinfo.com") # Enter Your Site Address
Input : [3] # Example 2.
# Here, In This Example
# We will try to scrap data from who.is website
# so, let's start
#
# Import module
from bs4 import BeautifulSoup
import urllib2, sys
# Who.is Url
url="http://who.is/whois/"
# Website Name
website="www.stackoverflow.com"
# Please Wait Message
print "[*] Please Wait.... Connecting To Who.is Server.."
# Download And Read Html Data
htmldata=urllib2.urlopen(url+website).read()
class_name="rawWhois" # Class Name For Extraction
# BeautifulSoup Contructor
try: # Check If lxml is installed
import lxml
# if installed then,use this
parse=BeautifulSoup(htmldata,'lxml')
print "[*] Using lxml Module For Fast Extraction"
except:
# if lxml not installed then try this
parse=BeautifulSoup(htmldata, "html.parser")
print "[*] Using built-in Html Parser [Slow Extraction. Please Wait ....]"
try:
container=parse.findAll("div",{'class':class_name}) # Extracting Class
sections=container[1:] # Remove First Value
for section in sections: # iter all values
extract=section.findAll('div') # Search for div tag
heading=extract[0].text # Extract Text
print '\n[ ',heading,' ]' # Heading
for i in extract[1].findAll('div'): # Find All div Tag
fortab='\t|' # print values
for j in i.findAll('div'):
fortab=fortab+'----'
line=j.text.replace('\n', ' ')
print fortab,'>', line
except Exception as e:
print "[ Error ] ", e
print "[ Last Update : 1 Jan 2017 ]"
print "[ Script Is Not Updated ]"
print "[ Sorry! ]"
Output : [3] [*] Please Wait.... Connecting To Who.is Server..
[*] Using lxml Module For Fast Extraction
[ Registrant Contact Information: ]
|---- > Name
|-------- > Sysadmin Team
|---- > Organization
|-------- > Stack Exchange, Inc.
|---- > Address
|-------- > 110 William St , Floor 28
|---- > City
|-------- > New York
|---- > State / Province
|-------- > NY
|---- > Postal Code
|-------- > 10038
|---- > Country
|-------- > US
|---- > Phone
|-------- > +1.2122328280
|---- > Email
|-------- >
[ istrative Contact Information: ]
|---- > Name
|-------- > Sysadmin Team
|---- > Organization
|-------- > Stack Exchange, Inc.
|---- > Address
|-------- > 110 William St , Floor 28
|---- > City
|-------- > New York
|---- > State / Province
|-------- > NY
|---- > Postal Code
|-------- > 10038
|---- > Country
|-------- > US
|---- > Phone
|-------- > +1.2122328280
|---- > Email
|-------- >
[ Technical Contact Information: ]
|---- > Name
|-------- > Sysadmin Team
|---- > Organization
|-------- > Stack Exchange, Inc.
|---- > Address
|-------- > 110 William St , Floor 28
|---- > City
|-------- > New York
|---- > State / Province
|-------- > NY
|---- > Postal Code
|-------- > 10038
|---- > Country
|-------- > US
|---- > Phone
|-------- > +1.2122328280
|---- > Email
|-------- >
For First Part
click here.
For Introduction, Part
click here.I hope you enjoyed this tutorial,
For any Query Or Suggestion
Comment below.
Have a nice day.