How To Scrap Html Forms Using Python Mechanize Module (Complete Mechanize Tutorial) - | python web scraping | python example - part 13

Hello Friends,


                           This is our 13th part of web scraping tutorials. and In this Tutorials, I am Going To Show You How To Use Python Mechanize Module. or You Can Say Today's Tutorials Is About How To Deal With HTML Forms like login form, Details Form etc.

Today's Tutorials really gonna very juice and very interesting because here, i am going to show you how to create web scraping on your own. its means you don't need to depend on other persons for creating web scraping for you.




Now, Let's Talk About Today's Topic.

In Today's Topic, we will cover

1. Some Previous Tutorials Stuff
2. Form Handling
3. Session Handling
4. Automation
5. Proxy Handling
 Etc.., Etc

so, don't skip any line or any content.
read carefully and try to understand these examples because i tried my best for creating these examples easy to understand and easy to remember.
and for future update follow us.
but first, if you are new visitor, then first check our index or For 12th Part Click Here

 so, let's start  




mechanize_manual slides
In [32]:
#!/usr/bin/python
# I Collected Many Content From Overall Internet Sites and Some Personal Experience Also.
# So, Let's Start
# HEre, I am Using Ubuntu
# With Python 2.7 
# With Ipython notebook
# and Latest Version Of Mechanize
#
# Installation
# For Installing Mechanize
#
# Open Terminal:
# And Type:
#       $ python -m pip install mechanize
#
# So, Let's Start
#
import mechanize
In [7]:
# For Deep Knowledge
#
#   visit Here : http://joesourcecode.com/Documentation/mechanize0.2.5/
#                http://wwwsearch.sourceforge.net/mechanize
#                http://wwwsearch.sourceforge.net/mechanize/doc.html
# Let's Start Our Tutorial
# 
# Creating Cookie Jar
cj = mechanize.CookieJar()

# Or You Can Also Use 
# import cookielib
# cj=cookielib.LWPCookieJar()

# Create Browser Object
br = mechanize.Browser()

# Connect Cookie Jar
br.set_cookiejar(cj)

# Always Use User-Agent Because This Will Help You To Mask Your Bot Identity With Any Browser.  
# Set User-Agent 
br.addheaders=[('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.8.0')]

# Some More User-Agents List. You Can Use Anyone from this list
[('Mozilla/5.0 (Amiga; U; AmigaOS 1.3; en; rv:1.8.1.19) Gecko/20081204 SeaMonkey/1.1.14'), 
    ('Mozilla/5.0 (AmigaOS; U; AmigaOS 1.3; en-US; rv:1.8.1.21) Gecko/20090303 SeaMonkey/1.1.15'), 
    ('Mozilla/5.0 (AmigaOS; U; AmigaOS 1.3; en; rv:1.8.1.19) Gecko/20081204 SeaMonkey/1.1.14'), 
    ('Mozilla/5.0 (Android 2.2; Windows; U; Windows NT 6.1; en-US'), 
    ('AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4'), 
    ('Mozilla/5.0 (BeOS; U; BeOS BeBox; fr; rv:1.9) Gecko/2008052906 BonEcho/2.0')]

#  ------------------------------------------------------------------
#  ============[ Some Useful Browser options ]=======================
#  ------------------------------------------------------------------

# Set whether to treat HTML http-equiv headers like HTTP headers.
br.set_handle_equiv(True) 

# Handle gzip transfer encoding.
br.set_handle_gzip(True)

# Set whether to handle HTTP 30x redirections.
br.set_handle_redirect(True)

# Set whether to add Referer header to each request.
br.set_handle_referer(True)

# Set whether to observe rules from robots.txt.
br.set_handle_robots(False)

# Set whether to handle HTTP Refresh headers.
br.set_handle_refresh(True)

# Work With Written Data
br.set_all_readonly(False)

# Open Any Website. But I am Open My Own Blog. hehe
response = br.open("http://bitforestinfo.blogspot.com/")
/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:30: UserWarning: gzip transfer encoding is experimental!
In [2]:
# Get Html Page Title 
print br.title()

# For Current Url 
print response.geturl()

# get Html Source
print response.read()

# Or Try This Also
print br.response().read()
In [3]:
# Show the response headers
print response.info()

# or Directly
print br.response().info()
In [18]:
# Let's Try Some Other Things
# Here, I will Try To Search To Something Related To PYthon... hmm, anything
# But First, Check How Many Forms Are Available 
# try  this
for availabe_form in br.forms():
    # Form
    print availabe_form
    
    # Form Attributes (helpful in selecting form)
    print availabe_form.attrs

# Let's Check The Output
<GET http://bitforestinfo.blogspot.in/search application/x-www-form-urlencoded
  <TextControl(q=)>
  <SubmitControl(<None>=Search) (readonly)>>
{'action': 'http://bitforestinfo.blogspot.in/search', 'class': 'gsc-search-box'}
<POST https://feedburner.google.com/fb/a/mailverify application/x-www-form-urlencoded
  <TextControl(email=)>
  <SubmitControl(<None>=Submit) (readonly)>
  <HiddenControl(uri=BitForest) (readonly)>
  <HiddenControl(loc=en_US) (readonly)>>
{'action': 'https://feedburner.google.com/fb/a/mailverify', 'onsubmit': 'window.open("https://feedburner.google.com/fb/a/mailverify?uri=BitForest", "popupwindow", "scrollbars=yes,width=550,height=520"); return true', 'method': 'post', 'target': 'popupwindow'}
<POST https://feedburner.google.com/fb/a/mailverify application/x-www-form-urlencoded
  <TextControl(email=)>
  <SubmitControl(<None>=Submit) (readonly)>
  <HiddenControl(uri=BitForest) (readonly)>
  <HiddenControl(loc=en_US) (readonly)>>
{'action': 'https://feedburner.google.com/fb/a/mailverify', 'onsubmit': 'window.open("https://feedburner.google.com/fb/a/mailverify?uri=BitForest", "popupwindow", "scrollbars=yes,width=550,height=520"); return true', 'method': 'post', 'target': 'popupwindow'}
<contact-form GET http://bitforestinfo.blogspot.in/ application/x-www-form-urlencoded
  <IgnoreControl(<None>=<None>)>>
{'name': 'contact-form'}
In [19]:
# Select the first form 
br.select_form(nr=0)   # Easy Method
In [27]:
# wait.. wait 
# More Examples For Form selection
# br.select_form("form1")         # only works when form has a name
# br.form = list(br.forms())[0]   # use when form is unnamed
In [26]:
# Methods For Finding Form Controls
for control in br.form.controls:
    print control # Control Name
    print control.attrs # Control attributes

# Let's Check The Output
<TextControl(q=)>
{'autocomplete': 'off', 'name': 'q', 'title': 'search', 'type': 'text', 'class': 'gsc-input', 'value': '', 'size': '10'}
<SubmitControl(<None>=Search) (readonly)>
{'type': 'submit', 'class': 'gsc-search-button', 'value': 'Search', 'title': 'search'}
In [30]:
# Let's search
br.form['q']='python'   # Value For Selected Input

# Clicking submit Button
br.submit()

# or 

# br.submit(name='Button_Name', label='button_label')

#print br.title()
In [ ]:
# Or Use Also Can Do This With Controls 
# HEre Controls Means like radio button, list box and many more

# Find Control Directly
control = br.form.find_control("control_name")

# Check if it's SelectControl
if control.type == "select":  
    print control.attrs

# Assign Value
br[control.name] = ["Item_Name"]  

# or Try This

control.value = ["Any_Value_here"]

# Check Value
print control

# check if it's TextControl
if control.type == "text":  
    control.value = "enter your text here or value"
    
# Some More Configurations One By One
control.readonly = False
control.disabled = True

# Or directly All
for control in br.form.controls:
   if control.type == "submit":
       control.disabled = True

# Clicking submit Button
br.submit()

print br.title()
In [ ]:
# hooo,
# Wait, There Are More Features also
# so let's check them fastly
#
# Downloading Files
Downloaded_file = br.retrieve('Enter_File_downloading_url_address_here')[0]
Open_Downloaded_file = open(Downloaded_file)
# or
print Downloaded_file
In [ ]:
# If You Need To Click On Any Linked Text Then Try This:
# But For This, First Search That Text 
br.find_link(text='Your_Linked_Text_Here')

# And Then click the link
req = br.click_link(text='Weekend codes')

# Open Clicked Requested Link
br.open(req)

# Already Explained Above
print br.response().read()
print br.geturl()

# Back
br.back()

# Or 
# You Can Also Try This
word = None

for link in br.links():
    linkMatch = re.compile( 'GitHub' ).search( link.url )

    if linkMatch:
        word = br.follow_link( link )
        break
        
content = word.get_data() # Get Inner Content
print content
In [ ]:
# If You Want To Use Proxy Then
# Proxy
# br.set_proxies({"http": "myproxy.example.com:0054"})
# Proxy password
# br.add_proxy_password("joe", "password")
# Proxy and user/password
# br.set_proxies({"http": "suraj:password@proxy.blogspot.com:0054"})
In [ ]:
# http://wwwsearch.sourceforge.net/mechanize
# http://wwwsearch.sourceforge.net/mechanize/doc.html


For More Update, Visit Our Blog Regularly. 
, Subscribe Our Blog, 
Follow Us and share it
.
For Any Type of Suggestion, Help Or Question
Contact me:
S.S.B
surajsinghbisht054@gmail.com


or Comment Below

Share this

Related Posts

Previous
Next Post »