Python Beautiful Soup Module - Tutorial - Part 2

Namaste Friends,



This Is Our Second Part Of Complete Beautiful Soup Tutorials Series And In This Part, I Am Going To Show Some Advance Practical Examples.


Introduction


Actually, As I already told you in my previous tutorials, Web scraping is a very Interesting topic and it's One of the main reason is possibilities, I mean with web scraping we can create various types of desktop apps that can help us to manage our online activities.  In Simple words, Web scraping is a technique to collect data from all over the internet. With the help of web scraping, a users can extract and collect big data from the various online resource. Thus, Here I am going to provide you a complete tutorial that going to help you to understand the usages of the beautifulsoup module. basically, In today's topic, I am going to show you some more advanced usages of the beautifulsoup module. So, Let's Start But First If You Are New Visitor In Our Blog Then Definitely, I am going to Suggest You To take a look Or Check Our Index Because There You will really find various types of interesting stuff written in python and For first of part of beautifulsoup module tutorial click here.


Practical Example On IPython Notebook



Input : [1]  

#
# Author:
#       SSB
#       surajsinghbisht054@gmail.com
#       https://bitforestinfo.blogspot.com
#
# Here, I am Using
# 1. Python 2.7         (Python Version)
# 2. BeautifulSoup 4    (bs4 Version)
# 3. Ipython Notebook   (Code Editor)
# 4. Ubuntu             (Operating System)
#
# So, Let's Start With Practical Examples

Input : [2]  

# Example 1.
# 
# Here, We Are Trying To Extract All Links from Webpage
# So, let's start
#
# Imprt Module
import bs4
import urllib2, sys

if len(sys.argv)==1:
    print "[*] Please Provide Domain Name:\n Usages: python link_bs4.py www.examplesite.com\n"
    sys.exit(0)

def parse_url(url):
    try:
        html=urllib2.urlopen(url).read() # Reading Html Codes
    except Exception as e:
        print "[Error] ",e
        sys.exit(0)
    parse=bs4.BeautifulSoup(html,'html.parser')   # Feed Data To bs4
    
    for i in parse.findAll('a'): # Searching For link Tag
        
        if 'href' in i.attrs.keys():# Searching For Href key
            
            link=i.attrs['href']
            
            print link
    return 

parse_url("https://bitforestinfo.blogspot.com") # Enter Your Site Address


Input : [3]  

# Example 2.
# Here, In This Example 
# We will try to scrap data from who.is website
# so, let's start
#
# Import module
from bs4 import BeautifulSoup
import urllib2, sys

# Who.is Url
url="http://who.is/whois/"

# Website Name
website="www.stackoverflow.com"

# Please Wait Message
print "[*] Please Wait.... Connecting To Who.is Server.."

# Download And Read Html Data
htmldata=urllib2.urlopen(url+website).read()

class_name="rawWhois"  # Class Name For Extraction

# BeautifulSoup Contructor
try: # Check If lxml is installed
    import lxml
    # if installed then,use this
    parse=BeautifulSoup(htmldata,'lxml')
    print "[*] Using lxml Module For Fast Extraction"
except:
    # if lxml not installed then try this
    parse=BeautifulSoup(htmldata, "html.parser")
    print "[*] Using built-in Html Parser [Slow Extraction. Please Wait ....]"

    
try:
    container=parse.findAll("div",{'class':class_name}) # Extracting Class
    
    sections=container[1:]                              # Remove First Value
    
    for section in sections:                            # iter all values
        
        extract=section.findAll('div')                  # Search for div tag
        
        heading=extract[0].text                         # Extract Text
        
        print '\n[ ',heading,' ]'                       # Heading
        
        for i in extract[1].findAll('div'):             # Find All div Tag
            
            fortab='\t|'                                # print values
            
            for j in i.findAll('div'):
                
                fortab=fortab+'----'
                
                line=j.text.replace('\n', ' ')
                
                print fortab,'>', line
                
except Exception as e:
    print "[ Error ] ", e
    
    print "[ Last Update : 1 Jan 2017 ]"
    
    print "[ Script Is Not Updated ]"
    
    print "[ Sorry! ]"

Output : [3]  

[*] Please Wait.... Connecting To Who.is Server..
[*] Using lxml Module For Fast Extraction

[  Registrant Contact Information:  ]
 |---- > Name
 |-------- > Sysadmin Team
 |---- > Organization
 |-------- > Stack Exchange, Inc.
 |---- > Address
 |-------- > 110 William St , Floor 28
 |---- > City
 |-------- > New York
 |---- > State / Province
 |-------- > NY
 |---- > Postal Code
 |-------- > 10038
 |---- > Country
 |-------- > US
 |---- > Phone
 |-------- > +1.2122328280
 |---- > Email
 |-------- > 

[  Administrative Contact Information:  ]
 |---- > Name
 |-------- > Sysadmin Team
 |---- > Organization
 |-------- > Stack Exchange, Inc.
 |---- > Address
 |-------- > 110 William St , Floor 28
 |---- > City
 |-------- > New York
 |---- > State / Province
 |-------- > NY
 |---- > Postal Code
 |-------- > 10038
 |---- > Country
 |-------- > US
 |---- > Phone
 |-------- > +1.2122328280
 |---- > Email
 |-------- > 

[  Technical Contact Information:  ]
 |---- > Name
 |-------- > Sysadmin Team
 |---- > Organization
 |-------- > Stack Exchange, Inc.
 |---- > Address
 |-------- > 110 William St , Floor 28
 |---- > City
 |-------- > New York
 |---- > State / Province
 |-------- > NY
 |---- > Postal Code
 |-------- > 10038
 |---- > Country
 |-------- > US
 |---- > Phone
 |-------- > +1.2122328280
 |---- > Email
 |-------- > 



For First Part click here.
For Introduction, Part click here.

I hope you enjoyed this tutorial,
For any Query Or Suggestion
Comment below.

Have a nice day.

From Amazon :- Buy Best Book To Learn Web Scraping 


USA :     OR   India :   



Share this

Related Posts

Previous
Next Post »