Python Beautiful Soup Module - Tutorial - Part 1

Namaste friends,



                   This Is Our First Part Of Complete Beautiful Soup Tutorials Series And In This Part, I Am Going To Show You How To Use Beautiful Soup Module With Some Simple Practical Examples.


Introduction


Actually, Web scraping is a very Interesting topic. In Simple words, Web scraping is a technique to collect data from all over the internet. With the help of web scraping, a users can extract and collect big data from the various online resource. Thus, Here I am going to provide you a complete tutorial that going to help you to understand the usages of the beautifulsoup module. basically, In today's topic, you will see how we can extract various types of tags from raw HTML codes, how to find all sites links from raw HTML codes and how to find different types of contents from HTML.

So, Let's Start But First If You Are New Visitor In Our Blog Then Definitely, I am going to Suggest You To take a look Or Check Our Index Because There You will really find various types of interesting stuff written in python or for introduction part of beautifulsoup module click here.


Practical Example On IPython Notebook


Input : [1]  


#
# Author:
#       SSB
#       surajsinghbisht054@gmail.com
#       https://bitforestinfo.blogspot.com
#
#  In This Tutorial,
#  I Going To illustrate all features of
# python beautifulsoup modules
# 
# Here, I am Using
# 1. Python 2.7         (Python Version)
# 2. BeautifulSoup 4    (bs4 Version)
# 3. Ipython Notebook   (Code Editor)
# 4. Ubuntu             (Operating System)
#
# So, Let's Start



Input : [2]  

# Quick Start
# Here, For This Example, I am Using Own Small Html Codes.
html_data = """
<html>
<head><title>BitForestInfo</title></head>
<body>
<p class="title_body" id="stater"><div>This Is Title Body</div></p>
<p class="header_body"><div><h1>This Is Header Body</h1></div></p>
<p class="content_body"><table>This Is Content Body</table></p>
<p class="footer_body"><div>This Is Footer Body</div></p>
"""


Input : [3]  

# First Import Module
from bs4 import BeautifulSoup


Input : [4]  

# now feed data to beautifulsoup
# here, for extraction, i am using python built-in html.parser
# But If You Want Speed then, i will suggest you to install
# Python lxml Module 
# and After Installing lxml module. change "html.parser" into "lxml"
soup = BeautifulSoup(html_data, 'html.parser')


Input : [5]  

# Printing Feeded Data in Well-Managed Way.
print soup.prettify()


Output : [5]  

<html>
 <head>
  <title>
   BitForestInfo
  </title>
 </head>
 <body>
  <p class="title_body" id="stater">
   <div>
    This Is Title Body
   </div>
  </p>
  <p class="header_body">
   <div>
    <h1>
     This Is Header Body
    </h1>
   </div>
  </p>
  <p class="content_body">
   <table>
    This Is Content Body
   </table>
  </p>
  <p class="footer_body">
   <div>
    This Is Footer Body
   </div>
  </p>
 </body>
</html>

Input : [6]  

# Let's See some simple ways to navigate that data structure
# For Title
print soup.title

Output : [6]  

<title>BitForestInfo</title>

Input : [7]  

# For Title String
print soup.title.string

Output : [7]  

BitForestInfo

Input : [8]  

# For Extracting Content In List Type
print soup.title.contents

Output : [8]  

[u'BitForestInfo']

Input : [9]  

# For Title Parent
print soup.title.parent

Output : [9]  

<head><title>BitForestInfo</title></head>

Input : [10]  

# For Title  Parent Name
print soup.title.parent.name

Output : [10]  

head

Input : [11]  

# For Body
print soup.body

Output : [11]  

<body>
<p class="title_body" id="stater"><div>This Is Title Body</div></p>
<p class="header_body"><div><h1>This Is Header Body</h1></div></p>
<p class="content_body"><table>This Is Content Body</table></p>
<p class="footer_body"><div>This Is Footer Body</div></p>
</body>

Input : [12]  

# For Body <P> Tags
print soup.body.p.string

Output : [12]  

This Is Title Body

Input : [13]  

# For tag attr
print soup.p.attrs

Output : [13]  

{u'class': [u'title_body'], u'id': u'stater'}

Input : [14]  

# or directly
print soup.p['class']

Output : [14]  

[u'title_body']

Input : [15]  

# Find Tags With Attrs
print soup.find(id="stater") # Using Id But Without Tag Name
print soup.find(class_="title_body") # Using Class Name

Output : [15]  

<p class="title_body" id="stater"><div>This Is Title Body</div></p>
<p class="title_body" id="stater"><div>This Is Title Body</div></p>

Input : [16]  

print soup.find("",{'id':"stater"}) # Using Id  But Without Tag Name

Output : [16]  

<p class="title_body" id="stater"><div>This Is Title Body</div></p>

Input : [17]  

print soup.find("", {"class":"title_body"}) # Using Class But Without Tag Name

Output : [17]  

<p class="title_body" id="stater"><div>This Is Title Body</div></p>

Input : [18]  

print soup.find("", {"class":"title_body",'id':"stater"}) # Using Class and id both But Without Tag Name

Output : [18]  

<p class="title_body" id="stater"><div>This Is Title Body</div></p>

Input : [19]  

print soup.find("p")  # Using Tag Name

Output : [19]  

<p class="title_body" id="stater"><div>This Is Title Body</div></p>

Input : [20]  

print soup.find("p", {"class":"title_body",'id':"stater"}) # Using Class and id both With Tag Name

Output : [20]  

<p class="title_body" id="stater"><div>This Is Title Body</div></p>

Input : [21]  

print soup.findAll("p")  # Using findAll For All Tags

Output : [21]  

[<p class="title_body" id="stater"><div>This Is Title Body</div></p>, <p class="header_body"><div><h1>This Is Header Body</h1></div></p>, <p class="content_body"><table>This Is Content Body</table></p>, <p class="footer_body"><div>This Is Footer Body</div></p>]

Input : [22]  

# let's Try To Extract All P tags Strings
for link in soup.find_all('p'):
    print link.string

Output : [22]  

This Is Title Body
This Is Header Body
This Is Content Body
This Is Footer Body

Input : [23]  

# Get All Text Of Html Code
print soup.text

Output : [23]  

BitForestInfo

This Is Title Body
This Is Header Body
This Is Content Body
This Is Footer Body

Input : [24]  

# Get All Text With New Lines
for line in soup.stripped_strings:
    print line

Output : [24]  

BitForestInfo
This Is Title Body
This Is Header Body
This Is Content Body
This Is Footer Body

Input : [25]  

# Now,let's Some Tag Types 
example_soup = BeautifulSoup('<b class="firstclass">Try to learn</b>', "html.parser")
tag = example_soup.b
print type(tag)

Output : [25]  

<class 'bs4.element.Tag'>

Input : [26]  

tag = example_soup.b.string
print type(tag)

Output : [26]  

<class 'bs4.element.NavigableString'>

Input : [27]  

example_soup = BeautifulSoup('<b class="secondclass"><!--This Is Comment--></b>', "html.parser")
tag = example_soup.b.string
print type(tag)

Output : [27]  

<class 'bs4.element.Comment'>

Input : [28]  

print example_soup.prettify()

Output : [28]  

<b class="secondclass">
 <!--This Is Comment-->
</b>

Input : [29]  

# Now, Let's Check Some Real Examples
import urllib2

# Download Html Codes
data = urllib2.urlopen("https://bitforestinfo.blogspot.com").read()

Input : [30]  

print len(data)

Output : [30]  

121328

Input : [31]  

# Create BeautifulSOup Contructor
blog = BeautifulSoup(data, "html.parser")


Input : [32]  

# Now, Extract Title
print blog.title.string

Output : [32]  

Bitforestinfo

Input : [33]  

# Find Links
print blog.find('a')
# And Then Its Attr
print blog.find('a')['href']

Output : [33]  

<a href="https://bitforestinfo.blogspot.in/" style="display: block">
<img alt="Bitforestinfo" height="133px; " id="Header1_headerimg" src="//2.bp.blogspot.com/-0-iqsHblJI8/WG5eFDrd_nI/AAAAAAAAAos/yxe5BVZJ8BEhGVTpCiuhk3XUhJyWKTIZACK4B/s1600/logo.png" style="display: block" width="200px; ">
</img></a>
https://bitforestinfo.blogspot.in/

Input : [34]  

# Find all h1,h2,h3  tags
print blog.findAll({'h1','h2','h3'})

Output : [34]  

[<h2>Index</h2>, <h2>Pages</h2>, <h2 class="date-header"><span>Thursday, 9 February 2017</span></h2>, <h3 class="post-title entry-title" itemprop="name">\n<a href="https://bitforestinfo.blogspot.in/2017/02/how-to-use-python-beautiful-soup-module.html">How to Use Python Beautiful Soup Module - Complete Beautiful Soup Tutorial - Introduction</a>\n</h3>, <h2 class="title">Request</h2>, <h1>Back to Top &gt;&gt;&gt; </h1>, <h2 class="title">Search This Blog</h2>, <h2 class="title">Translate</h2>, <h2 class="title">Follow by Email</h2>, <h2 class="title">Subscribe To</h2>, <h2 class="title">Google+ Badge</h2>, <h2 class="title">Join Facebook</h2>, <h2 class="title">On Facebook</h2>, <h2 class="title">Share</h2>, <h2>What Other Peoples Reading</h2>, <h2>Labels</h2>, <h2 class="title">Followers</h2>, <h2 class="title">Google+ Followers</h2>, <h2>Contributors</h2>, <h2 class="title">Footer</h2>, <h2>Links</h2>, <h2>About</h2>, <h3>My Self Suraj Singh</h3>, <h2>Web Tools</h2>, <h2 class="title">Subscribe</h2>, <h2 class="title">request</h2>, <h2 class="title">Follow by Email</h2>, <h2 class="title">Google+ Followers</h2>, <h2 class="title">Contact Us</h2>, <h2>Blog Archive</h2>, <h2 class="title">Please Provide Your Suggestions Through Comments</h2>]

Input : [35]  

# For Extracting String 
for i in blog.findAll({'h1','h2','h3'}):
    print i.get_text()

Output : [35]  

Index
Pages
Thursday, 9 February 2017

How to Use Python Beautiful Soup Module - Complete Beautiful Soup Tutorial - Introduction

Request
Back to Top >>> 
Search This Blog
Translate
Follow by Email
Subscribe To
Google+ Badge
Join Facebook
On Facebook
Share
What Other Peoples Reading
Labels
Followers
Google+ Followers
Contributors
Footer
Links
About
My Self Suraj Singh
Web Tools
Subscribe
request
Follow by Email
Google+ Followers
Contact Us
Blog Archive
Please Provide Your Suggestions Through Comments

Input : [36]  

# For Extracting All Image Source Links 
for i in blog.findAll({'img'}):
    print i['src']

Output : [36]  

//2.bp.blogspot.com/-0-iqsHblJI8/WG5eFDrd_nI/AAAAAAAAAos/yxe5BVZJ8BEhGVTpCiuhk3XUhJyWKTIZACK4B/s1600/logo.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://2.bp.blogspot.com/-g79HDRM---E/WJyyCctqLwI/AAAAAAAAAvU/Vvq-Hw6M-3w0lrSDNNXdaUpS9mTcj9wlACLcB/s320/python.png
https://2.bp.blogspot.com/-6mcTMlpcGPg/WJyyCY9j7aI/AAAAAAAAAvQ/T9pJhZolgsUpabr1bbUJvdpOiefAHyV1ACLcB/s320/Python-Programming-Language.png
https://resources.blogblog.com/img/icon18_email.gif
https://resources.blogblog.com/img/icon18_edit_allbkg.gif
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img1.blogblog.com/img/widgets/subscribe-netvibes.png
https://img1.blogblog.com/img/widgets/subscribe-yahoo.png
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img1.blogblog.com/img/widgets/subscribe-netvibes.png
https://img1.blogblog.com/img/widgets/subscribe-yahoo.png
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://2.bp.blogspot.com/-obs52hAZL-E/WGD8_x9caLI/AAAAAAAAAhI/FgreOO7QK3Mkk8XRRNwo0cdPzLCOJNZXwCPcB/w72-h72-p-k-nu/pyt.png
https://4.bp.blogspot.com/-qn2ByjJxCs4/V8fmcLjnjJI/AAAAAAAAANI/glw5-c4ZOLoyo9qU1udd5XytBywa2gAWACLcB/w72-h72-p-k-nu/Screenshot%2Bfrom%2B2016-09-01%2B09-39-26.png
https://2.bp.blogspot.com/-obs52hAZL-E/WGD8_x9caLI/AAAAAAAAAhI/FgreOO7QK3Mkk8XRRNwo0cdPzLCOJNZXwCPcB/w72-h72-p-k-nu/pyt.png
https://3.bp.blogspot.com/-qT9FCTTPqu0/WHtLLx724RI/AAAAAAAAAtk/Y4aNQQako7k9FvS8QLmG6mGPjxFaIVH9gCEw/w72-h72-p-k-nu/home.png
https://2.bp.blogspot.com/-3qk81CD2Dyc/WG3tIeZhwDI/AAAAAAAAAn0/JLvVHbPUHC8o7sgvLdHxpDSA5VUNO4ZewCEw/w72-h72-p-k-nu/Screenshot%2Bfrom%2B2017-01-05%2B12-15-15.png
https://4.bp.blogspot.com/-VZUmAvbYz1A/WG-4ckwTqaI/AAAAAAAAAp4/SEgfINpX8I8kAWmc7fKGFf8y1CPlw0wqwCLcB/w72-h72-p-k-nu/test1.png
https://2.bp.blogspot.com/-obs52hAZL-E/WGD8_x9caLI/AAAAAAAAAhI/FgreOO7QK3Mkk8XRRNwo0cdPzLCOJNZXwCPcB/w72-h72-p-k-nu/pyt.png
https://3.bp.blogspot.com/-P9CY0WHbSLA/VpjgNwWfQ0I/AAAAAAAAABs/sWoXG9iBXts_NDLLMfxF6HzR7jXDPjI3ACPcB/w72-h72-p-k-nu/CISE-Exam-Image.jpg
https://2.bp.blogspot.com/-obs52hAZL-E/WGD8_x9caLI/AAAAAAAAAhI/FgreOO7QK3Mkk8XRRNwo0cdPzLCOJNZXwCPcB/w72-h72-p-k-nu/pyt.png
https://4.bp.blogspot.com/-rF0mXSzWcKY/Vz36AVhDgNI/AAAAAAAAADY/ZKegtQi0fEArmMbbNE_baX9dDanPJoJ_gCPcB/w72-h72-p-k-nu/download.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
//2.bp.blogspot.com/-0-iqsHblJI8/WG5eFDrd_nI/AAAAAAAAAos/yxe5BVZJ8BEhGVTpCiuhk3XUhJyWKTIZACK4B/s1600/logo.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img1.blogblog.com/img/widgets/subscribe-netvibes.png
https://img1.blogblog.com/img/widgets/subscribe-yahoo.png
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img1.blogblog.com/img/widgets/subscribe-netvibes.png
https://img1.blogblog.com/img/widgets/subscribe-yahoo.png
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png

Input : [37]  

# For Etracting Table Childrens
for i in blog.body.table.children:
    print i

Output : [37]  

<tr>
<td class="reactions-label-cell" nowrap="nowrap" valign="top" width="1%">
<span class="reactions-label">
Reactions:</span>Â </td>
<td><iframe allowtransparency="true" class="reactions-iframe" frameborder="0" name="reactions" scrolling="no" src="https://www.blogger.com/blog-post-reactions.g?options=%5Bfunny,+interesting,+cool%5D&amp;textColor=%23959595#https://bitforestinfo.blogspot.com/2017/02/how-to-use-python-beautiful-soup-module.html"></iframe></td>
</tr>





For Next Part click here.

I hope you enjoyed this tutorial,
For any Query Or Suggestion
Comment below.

Have a nice day.

From Amazon :- Buy Best Book To Learn Web Scraping 


USA :     OR   India :   


Share this

Related Posts

Previous
Next Post »