How to Use Python Beautiful Soup Module - Complete Beautiful Soup Tutorial - Part 1 | python web scraping | python example

Namaste friends,



                   This Is Our First Part Of Complete Beautiful Soup Tutorials. And In This Part, I Am Going To Show You How To Use Beautiful Soup Module With Practical Examples.

hmm, let's talk about today's topic.
in today's topic you will see how to extract tags from html codes, how to find links from html codes,
how to find different contents of html easily with beautiful soup.

So, Let's Start But First If You Are New Visitor In Our Blog Then I Will Suggest You To Check Our Index Because Their You can find many interesting stuff written in python or for introduction part click here.


beautifulsoup_part1 slides
In [1]:
#
# Author:
#       SSB
#       surajsinghbisht054@gmail.com
#       https://bitforestinfo.blogspot.com
#
#  In This Tutorial,
#  I Going To illustrate all features of
# python beautifulsoup modules
# 
# Here, I am Using
# 1. Python 2.7         (Python Version)
# 2. BeautifulSoup 4    (bs4 Version)
# 3. Ipython Notebook   (Code Editor)
# 4. Ubuntu             (Operating System)
#
# So, Let's Start
In [26]:
# Quick Start
# Here, For This Example, I am Using Own Small Html Codes.
html_data = """
<html>
<head><title>BitForestInfo</title></head>
<body>
<p class="title_body" id="stater"><div>This Is Title Body</div></p>
<p class="header_body"><div><h1>This Is Header Body</h1></div></p>
<p class="content_body"><table>This Is Content Body</table></p>
<p class="footer_body"><div>This Is Footer Body</div></p>
"""
In [27]:
# First Import Module
from bs4 import BeautifulSoup
In [28]:
# now feed data to beautifulsoup
# here, for extraction, i am using python built-in html.parser
# But If You Want Speed then, i will suggest you to install
# Python lxml Module 
# and After Installing lxml module. change "html.parser" into "lxml"
soup = BeautifulSoup(html_data, 'html.parser')
In [29]:
# Printing Feeded Data in Well-Managed Way.
print soup.prettify()
<html>
 <head>
  <title>
   BitForestInfo
  </title>
 </head>
 <body>
  <p class="title_body" id="stater">
   <div>
    This Is Title Body
   </div>
  </p>
  <p class="header_body">
   <div>
    <h1>
     This Is Header Body
    </h1>
   </div>
  </p>
  <p class="content_body">
   <table>
    This Is Content Body
   </table>
  </p>
  <p class="footer_body">
   <div>
    This Is Footer Body
   </div>
  </p>
 </body>
</html>
In [30]:
# Let's See some simple ways to navigate that data structure
# For Title
print soup.title
<title>BitForestInfo</title>
In [31]:
# For Title String
print soup.title.string
BitForestInfo
In [32]:
# For Extracting Content In List Type
print soup.title.contents
[u'BitForestInfo']
In [33]:
# For Title Parent
print soup.title.parent
<head><title>BitForestInfo</title></head>
In [34]:
# For Title  Parent Name
print soup.title.parent.name
head
In [35]:
# For Body
print soup.body
<body>
<p class="title_body" id="stater"><div>This Is Title Body</div></p>
<p class="header_body"><div><h1>This Is Header Body</h1></div></p>
<p class="content_body"><table>This Is Content Body</table></p>
<p class="footer_body"><div>This Is Footer Body</div></p>
</body>
In [36]:
# For Body <P> Tags
print soup.body.p.string
This Is Title Body
In [37]:
# For tag attr
print soup.p.attrs
{u'class': [u'title_body'], u'id': u'stater'}
In [38]:
# or directly
print soup.p['class']
[u'title_body']
In [98]:
# Find Tags With Attrs
print soup.find(id="stater") # Using Id But Without Tag Name
print soup.find(class_="title_body") # Using Class Name
<p class="title_body" id="stater"><div>This Is Title Body</div></p>
<p class="title_body" id="stater"><div>This Is Title Body</div></p>
In [53]:
print soup.find("",{'id':"stater"}) # Using Id  But Without Tag Name
<p class="title_body" id="stater"><div>This Is Title Body</div></p>
In [54]:
print soup.find("", {"class":"title_body"}) # Using Class But Without Tag Name
<p class="title_body" id="stater"><div>This Is Title Body</div></p>
In [55]:
print soup.find("", {"class":"title_body",'id':"stater"}) # Using Class and id both But Without Tag Name
<p class="title_body" id="stater"><div>This Is Title Body</div></p>
In [56]:
print soup.find("p")  # Using Tag Name
<p class="title_body" id="stater"><div>This Is Title Body</div></p>
In [57]:
print soup.find("p", {"class":"title_body",'id':"stater"}) # Using Class and id both With Tag Name
<p class="title_body" id="stater"><div>This Is Title Body</div></p>
In [58]:
print soup.findAll("p")  # Using findAll For All Tags
[<p class="title_body" id="stater"><div>This Is Title Body</div></p>, <p class="header_body"><div><h1>This Is Header Body</h1></div></p>, <p class="content_body"><table>This Is Content Body</table></p>, <p class="footer_body"><div>This Is Footer Body</div></p>]
In [62]:
# let's Try To Extract All P tags Strings
for link in soup.find_all('p'):
    print link.string
This Is Title Body
This Is Header Body
This Is Content Body
This Is Footer Body
In [64]:
# Get All Text Of Html Code
print soup.text

BitForestInfo

This Is Title Body
This Is Header Body
This Is Content Body
This Is Footer Body

In [69]:
# Get All Text With New Lines
for line in soup.stripped_strings:
    print line
BitForestInfo
This Is Title Body
This Is Header Body
This Is Content Body
This Is Footer Body
In [76]:
# Now,let's Some Tag Types 
example_soup = BeautifulSoup('<b class="firstclass">Try to learn</b>', "html.parser")
tag = example_soup.b
print type(tag)
<class 'bs4.element.Tag'>
In [77]:
tag = example_soup.b.string
print type(tag)
<class 'bs4.element.NavigableString'>
In [79]:
example_soup = BeautifulSoup('<b class="secondclass"><!--This Is Comment--></b>', "html.parser")
tag = example_soup.b.string
print type(tag)
<class 'bs4.element.Comment'>
In [80]:
print example_soup.prettify()
<b class="secondclass">
 <!--This Is Comment-->
</b>
In [82]:
# Now, Let's Check Some Real Examples
import urllib2

# Download Html Codes
data = urllib2.urlopen("https://bitforestinfo.blogspot.com").read()
In [83]:
print len(data)
121328
In [84]:
# Create BeautifulSOup Contructor
blog = BeautifulSoup(data, "html.parser")
In [99]:
# Now, Extract Title
print blog.title.string
Bitforestinfo
In [94]:
# Find Links
print blog.find('a')
# And Then Its Attr
print blog.find('a')['href']
<a href="https://bitforestinfo.blogspot.in/" style="display: block">
<img alt="Bitforestinfo" height="133px; " id="Header1_headerimg" src="//2.bp.blogspot.com/-0-iqsHblJI8/WG5eFDrd_nI/AAAAAAAAAos/yxe5BVZJ8BEhGVTpCiuhk3XUhJyWKTIZACK4B/s1600/logo.png" style="display: block" width="200px; ">
</img></a>
https://bitforestinfo.blogspot.in/
In [95]:
# Find all h1,h2,h3  tags
print blog.findAll({'h1','h2','h3'})
[<h2>Index</h2>, <h2>Pages</h2>, <h2 class="date-header"><span>Thursday, 9 February 2017</span></h2>, <h3 class="post-title entry-title" itemprop="name">\n<a href="https://bitforestinfo.blogspot.in/2017/02/how-to-use-python-beautiful-soup-module.html">How to Use Python Beautiful Soup Module - Complete Beautiful Soup Tutorial - Introduction</a>\n</h3>, <h2 class="title">Request</h2>, <h1>Back to Top &gt;&gt;&gt; </h1>, <h2 class="title">Search This Blog</h2>, <h2 class="title">Translate</h2>, <h2 class="title">Follow by Email</h2>, <h2 class="title">Subscribe To</h2>, <h2 class="title">Google+ Badge</h2>, <h2 class="title">Join Facebook</h2>, <h2 class="title">On Facebook</h2>, <h2 class="title">Share</h2>, <h2>What Other Peoples Reading</h2>, <h2>Labels</h2>, <h2 class="title">Followers</h2>, <h2 class="title">Google+ Followers</h2>, <h2>Contributors</h2>, <h2 class="title">Footer</h2>, <h2>Links</h2>, <h2>About</h2>, <h3>My Self Suraj Singh</h3>, <h2>Web Tools</h2>, <h2 class="title">Subscribe</h2>, <h2 class="title">request</h2>, <h2 class="title">Follow by Email</h2>, <h2 class="title">Google+ Followers</h2>, <h2 class="title">Contact Us</h2>, <h2>Blog Archive</h2>, <h2 class="title">Please Provide Your Suggestions Through Comments</h2>]
In [97]:
# For Extracting String 
for i in blog.findAll({'h1','h2','h3'}):
    print i.get_text()
Index
Pages
Thursday, 9 February 2017

How to Use Python Beautiful Soup Module - Complete Beautiful Soup Tutorial - Introduction

Request
Back to Top >>> 
Search This Blog
Translate
Follow by Email
Subscribe To
Google+ Badge
Join Facebook
On Facebook
Share
What Other Peoples Reading
Labels
Followers
Google+ Followers
Contributors
Footer
Links
About
My Self Suraj Singh
Web Tools
Subscribe
request
Follow by Email
Google+ Followers
Contact Us
Blog Archive
Please Provide Your Suggestions Through Comments
In [108]:
# For Extracting All Image Source Links 
for i in blog.findAll({'img'}):
    print i['src']
//2.bp.blogspot.com/-0-iqsHblJI8/WG5eFDrd_nI/AAAAAAAAAos/yxe5BVZJ8BEhGVTpCiuhk3XUhJyWKTIZACK4B/s1600/logo.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://2.bp.blogspot.com/-g79HDRM---E/WJyyCctqLwI/AAAAAAAAAvU/Vvq-Hw6M-3w0lrSDNNXdaUpS9mTcj9wlACLcB/s320/python.png
https://2.bp.blogspot.com/-6mcTMlpcGPg/WJyyCY9j7aI/AAAAAAAAAvQ/T9pJhZolgsUpabr1bbUJvdpOiefAHyV1ACLcB/s320/Python-Programming-Language.png
https://resources.blogblog.com/img/icon18_email.gif
https://resources.blogblog.com/img/icon18_edit_allbkg.gif
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img1.blogblog.com/img/widgets/subscribe-netvibes.png
https://img1.blogblog.com/img/widgets/subscribe-yahoo.png
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img1.blogblog.com/img/widgets/subscribe-netvibes.png
https://img1.blogblog.com/img/widgets/subscribe-yahoo.png
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://2.bp.blogspot.com/-obs52hAZL-E/WGD8_x9caLI/AAAAAAAAAhI/FgreOO7QK3Mkk8XRRNwo0cdPzLCOJNZXwCPcB/w72-h72-p-k-nu/pyt.png
https://4.bp.blogspot.com/-qn2ByjJxCs4/V8fmcLjnjJI/AAAAAAAAANI/glw5-c4ZOLoyo9qU1udd5XytBywa2gAWACLcB/w72-h72-p-k-nu/Screenshot%2Bfrom%2B2016-09-01%2B09-39-26.png
https://2.bp.blogspot.com/-obs52hAZL-E/WGD8_x9caLI/AAAAAAAAAhI/FgreOO7QK3Mkk8XRRNwo0cdPzLCOJNZXwCPcB/w72-h72-p-k-nu/pyt.png
https://3.bp.blogspot.com/-qT9FCTTPqu0/WHtLLx724RI/AAAAAAAAAtk/Y4aNQQako7k9FvS8QLmG6mGPjxFaIVH9gCEw/w72-h72-p-k-nu/home.png
https://2.bp.blogspot.com/-3qk81CD2Dyc/WG3tIeZhwDI/AAAAAAAAAn0/JLvVHbPUHC8o7sgvLdHxpDSA5VUNO4ZewCEw/w72-h72-p-k-nu/Screenshot%2Bfrom%2B2017-01-05%2B12-15-15.png
https://4.bp.blogspot.com/-VZUmAvbYz1A/WG-4ckwTqaI/AAAAAAAAAp4/SEgfINpX8I8kAWmc7fKGFf8y1CPlw0wqwCLcB/w72-h72-p-k-nu/test1.png
https://2.bp.blogspot.com/-obs52hAZL-E/WGD8_x9caLI/AAAAAAAAAhI/FgreOO7QK3Mkk8XRRNwo0cdPzLCOJNZXwCPcB/w72-h72-p-k-nu/pyt.png
https://3.bp.blogspot.com/-P9CY0WHbSLA/VpjgNwWfQ0I/AAAAAAAAABs/sWoXG9iBXts_NDLLMfxF6HzR7jXDPjI3ACPcB/w72-h72-p-k-nu/CISE-Exam-Image.jpg
https://2.bp.blogspot.com/-obs52hAZL-E/WGD8_x9caLI/AAAAAAAAAhI/FgreOO7QK3Mkk8XRRNwo0cdPzLCOJNZXwCPcB/w72-h72-p-k-nu/pyt.png
https://4.bp.blogspot.com/-rF0mXSzWcKY/Vz36AVhDgNI/AAAAAAAAADY/ZKegtQi0fEArmMbbNE_baX9dDanPJoJ_gCPcB/w72-h72-p-k-nu/download.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
//2.bp.blogspot.com/-0-iqsHblJI8/WG5eFDrd_nI/AAAAAAAAAos/yxe5BVZJ8BEhGVTpCiuhk3XUhJyWKTIZACK4B/s1600/logo.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img1.blogblog.com/img/widgets/subscribe-netvibes.png
https://img1.blogblog.com/img/widgets/subscribe-yahoo.png
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://img1.blogblog.com/img/widgets/subscribe-netvibes.png
https://img1.blogblog.com/img/widgets/subscribe-yahoo.png
https://img1.blogblog.com/img/icon_feed12.png
https://img2.blogblog.com/img/widgets/arrow_dropdown.gif
https://img1.blogblog.com/img/icon_feed12.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
https://resources.blogblog.com/img/icon18_wrench_allbkg.png
In [114]:
# For Etracting Table Childrens
for i in blog.body.table.children:
    print i
<tr>
<td class="reactions-label-cell" nowrap="nowrap" valign="top" width="1%">
<span class="reactions-label">
Reactions:</span> </td>
<td><iframe allowtransparency="true" class="reactions-iframe" frameborder="0" name="reactions" scrolling="no" src="https://www.blogger.com/blog-post-reactions.g?options=%5Bfunny,+interesting,+cool%5D&amp;textColor=%23959595#https://bitforestinfo.blogspot.com/2017/02/how-to-use-python-beautiful-soup-module.html"></iframe></td>
</tr>
In [ ]:
 



I think, Friends this is enough for today.

so, friends. 
               for next tutorial click here.

Thanks For Reading.

For More Update, Visit Our Blog Regularly. 
, Subscribe Our Blog, 
Follow Us and share it.
For Any Type of Suggestion, Help Or Question
Contact me:
S.S.B
surajsinghbisht054@gmail.com
or Comment Below

Share this

Related Posts

Previous
Next Post »