Hello readers, This is our sixth part of web scraping tutorials. and in this tutorial, I am gonna to show you how to create google search result scrapper using python . but if you are new visitor then first check our index or For fifth Part Click Here.
So, Let;s Talk About Today's Topic.
readers,
Here We will try to create python script that can provide google search results in list form. So, In Today's topic, we will learn about how to create a urllib2.build_opener that can handle
1. HTTP Redirect Function2. Header (like User-agent)3. Cookies
etc. etcAnd readers, To Create This Script More Easy To Understand And More knowledgeable . I am only using built-in modules.
because with these built-in modules we can understand internal functions more easily and more deeply.
So, Let's Start.
Here, I am Sharing My Demo Codes But If You Want More Better Example Then, You Can Modify these codes yourself or Download This Script From My GitHub repository (link given at the end of these codes ).
1. google_search.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | # Import Modules import urllib2 import urllib import cookielib import re
# Google Url google ='https://www.google.com/search?'
# Search Query Query = "BitForestInfo"
# Set User Agent header = [('User-Agent','Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.8.0')]
# Create Cookie Handler cj = cookielib.CookieJar()
# Create Url Handler url_opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), # Connect Cookie Jar urllib2.HTTPRedirectHandler()) # Address Redirect Handling Function
# Connect Header With Opener url_opener.addheaders=header
# Encode Query With Url query = google + urllib.urlencode({'q':Query})
# Now Open Google Search Page html = url_opener.open(query)
# Collect Html Code codes = html.read()
# Compile Pattern pattern = re.compile('<h3(.*?)</h3')
# list For Collecting results collect_result = []
# Find Matches for i in pattern.findall(codes): result = re.search('href=.(.*)..(onmousedown).+(>)([^><]+)(<)',i).groups() print result[0],result[3]
|
Warning : I am Creating This Tutorial Only For Practise and Educational Purpose. I will not Take any type of responsibility about any illegal activities.For Downloading, Raw Script Click Here
For More Update, Visit Our Regularly.
And Subscribe Our Blog,
Follow Us and share it.
For Any Type of Suggestion Or Help
Contact me:
Suraj
surajsinghbisht054@gmail.com