Python Beautiful Soup Module - Tutorial - Part 2

Posted by Suraj Singh on February 15, 2017 · 16 mins read
hii readers,



This Is Our Second Part Of Complete Beautiful Soup Tutorials Series And In This Part, I Am Going To Show Some Advance Practical Examples.


Introduction


Actually, As I already told you in my previous tutorials, Web scraping is a very Interesting topic and it's One of the main reason is possibilities, I mean with web scraping we can create various types of desktop apps that can help us to manage our online activities.  In Simple words, Web scraping is a technique to collect data from all over the internet. With the help of web scraping, a users can extract and collect big data from the various online resource. Thus, Here I am going to provide you a complete tutorial that going to help you to understand the usages of the beautifulsoup module. basically, In today's topic, I am going to show you some more advanced usages of the beautifulsoup module. So, Let's Start But First If You Are New Visitor In Our Blog Then Definitely, I am going to Suggest You To take a look Or Check Our Index Because There You will really find various types of interesting stuff written in python and For first of part of beautifulsoup module tutorial click here.


Practical Example On IPython Notebook



Input : [1]  

#
# Author:
#
# surajsinghbisht054@gmail.com
# https://www.bitforestinfo.com
#
# Here, I am Using
# 1. Python 2.7 (Python Version)
# 2. BeautifulSoup 4 (bs4 Version)
# 3. Ipython Notebook (Code Editor)
# 4. Ubuntu (Operating System)
#
# So, Let's Start With Practical Examples

Input : [2]  

# Example 1.
#
# Here, We Are Trying To Extract All Links from Webpage
# So, let's start
#
# Imprt Module
import bs4
import urllib2, sys

if len(sys.argv)==1:
print "[*] Please Provide Domain Name:\n Usages: python link_bs4.py www.examplesite.com\n"
sys.exit(0)

def parse_url(url):
try:
html=urllib2.urlopen(url).read() # Reading Html Codes
except Exception as e:
print "[Error] ",e
sys.exit(0)
parse=bs4.BeautifulSoup(html,'html.parser') # Feed Data To bs4

for i in parse.findAll('a'): # Searching For link Tag

if 'href' in i.attrs.keys():# Searching For Href key

link=i.attrs['href']

print link
return

parse_url("https://www.bitforestinfo.com") # Enter Your Site Address


Input : [3]  

# Example 2.
# Here, In This Example
# We will try to scrap data from who.is website
# so, let's start
#
# Import module
from bs4 import BeautifulSoup
import urllib2, sys

# Who.is Url
url="http://who.is/whois/"

# Website Name
website="www.stackoverflow.com"

# Please Wait Message
print "[*] Please Wait.... Connecting To Who.is Server.."

# Download And Read Html Data
htmldata=urllib2.urlopen(url+website).read()

class_name="rawWhois" # Class Name For Extraction

# BeautifulSoup Contructor
try: # Check If lxml is installed
import lxml
# if installed then,use this
parse=BeautifulSoup(htmldata,'lxml')
print "[*] Using lxml Module For Fast Extraction"
except:
# if lxml not installed then try this
parse=BeautifulSoup(htmldata, "html.parser")
print "[*] Using built-in Html Parser [Slow Extraction. Please Wait ....]"


try:
container=parse.findAll("div",{'class':class_name}) # Extracting Class

sections=container[1:] # Remove First Value

for section in sections: # iter all values

extract=section.findAll('div') # Search for div tag

heading=extract[0].text # Extract Text

print '\n[ ',heading,' ]' # Heading

for i in extract[1].findAll('div'): # Find All div Tag

fortab='\t|' # print values

for j in i.findAll('div'):

fortab=fortab+'----'

line=j.text.replace('\n', ' ')

print fortab,'>', line

except Exception as e:
print "[ Error ] ", e

print "[ Last Update : 1 Jan 2017 ]"

print "[ Script Is Not Updated ]"

print "[ Sorry! ]"

Output : [3]  

[*] Please Wait.... Connecting To Who.is Server..
[*] Using lxml Module For Fast Extraction

[ Registrant Contact Information: ]
|---- > Name
|-------- > Sysadmin Team
|---- > Organization
|-------- > Stack Exchange, Inc.
|---- > Address
|-------- > 110 William St , Floor 28
|---- > City
|-------- > New York
|---- > State / Province
|-------- > NY
|---- > Postal Code
|-------- > 10038
|---- > Country
|-------- > US
|---- > Phone
|-------- > +1.2122328280
|---- > Email
|-------- >

[ istrative Contact Information: ]
|---- > Name
|-------- > Sysadmin Team
|---- > Organization
|-------- > Stack Exchange, Inc.
|---- > Address
|-------- > 110 William St , Floor 28
|---- > City
|-------- > New York
|---- > State / Province
|-------- > NY
|---- > Postal Code
|-------- > 10038
|---- > Country
|-------- > US
|---- > Phone
|-------- > +1.2122328280
|---- > Email
|-------- >

[ Technical Contact Information: ]
|---- > Name
|-------- > Sysadmin Team
|---- > Organization
|-------- > Stack Exchange, Inc.
|---- > Address
|-------- > 110 William St , Floor 28
|---- > City
|-------- > New York
|---- > State / Province
|-------- > NY
|---- > Postal Code
|-------- > 10038
|---- > Country
|-------- > US
|---- > Phone
|-------- > +1.2122328280
|---- > Email
|-------- >



For First Part click here.
For Introduction, Part click here.

I hope you enjoyed this tutorial,
For any Query Or Suggestion
Comment below.

Have a nice day.

From Amazon :- Buy Best Book To Learn Web Scraping 


USA :     OR   India :