Hello readers, Today, This is our fifth part of web scraping tutorials. and in this tutorial, I am gonna to show you how to create complete website crawler using python . but if you are new visitor then first check our index or For Fourth Part Click Here.
with this crawler, we can easily
1. extract links in text file,
2. extract image links and download all images,
3. extract links from sitemap and save then in text file
So,
here readers,
i am dividing all function of scripts in 3 parts:
1. Gethtml.py (For Downloading Codes)2. Getlink.py (For Extracting Links)3. main.py (For Controlling Main Function and Both Script)
Here, Gethtml.py script is for downloading all html codes of websites. Getlink.py script is for extracting links from Gethtml Provided Html Data. and another, main.py is for controlling both scripts.
Easy! readers,
So, let's see practical codes:
1. Gethtml.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | #!/usr/bin/python # -*- coding: utf-8 -*- # # # Suraj # surajsinghbisht054@gmail.com # www.bitforestinfo.com # import urllib2 import os
# Function For Downloading Html def main(url): try: print "[*] Downloading Html Codes ... ", page = urllib2.urlopen(url).read() print " Done!" except Exception as e: print "[Error Found] ",e page=None return page
# Function For Downloading Image def image_download(link): #print link Saved_in = "WebsitePicturesDownloaded" if not os.path.isdir(Saved_in): print "[+] Creating Directory... ",Saved_in, os.mkdir(Saved_in) print " Done" img=open(os.path.join(Saved_in, os.path.basename(link)),'wb') data=urllib2.urlopen(link) print "[+] Picture Saved As ",os.path.join(Saved_in, os.path.basename(link)) img.write(data.read()) img.close() return
|
2. Getlink.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | #!/usr/bin/python # -*- coding: utf-8 -*- # # # Suraj # surajsinghbisht054@gmail.com # www.bitforestinfo.com # import re
# Function For Extracting Html Link def main(html_data): # Filtering Url links print "[*] Extracting Html Links ..." pattern = re.compile('(<a .*?>)') a_tag_captured = pattern.findall(html_data) for i in a_tag_captured: href_raw=i[str(i).find('href'):] href=href_raw[:href_raw.find(' ')] yield href[6:-1] print " Done" return
# Function For Extracting Sitemap def main_sitemap(urls): print "[*] Extracting Sitemap ....", pattern = re.compile('<loc>(.*?)</loc>') data=pattern.findall(urls) print "Done!" return data
# Function For Extracting Image Link def main_img(html_data): print "[*] Extracting Image Links ....", link=[] pattern = re.compile('<img .*?>') for i in pattern.findall(html_data): i=i[i.find('src'):-2] img=i.split(' ')[0] if 'http' in img[4:10]: link.append(img[5:-1]) print ' Done' return link
|
3. main.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | #!/usr/bin/python # -*- coding: utf-8 -*- # # # Suraj # surajsinghbisht054@gmail.com # www.bitforestinfo.com # import sys import Gethtml import Getlink
links=[] if len(sys.argv)==2: print "\n[*] Usages : python {} http://www.examplesite.com option_number\n".format(sys.argv[0]) print "\t[-] Option [-]\n\n1. Current Page Links and Images \n2. Current Page Links \n3. Current Page Image \n4. Website Images \n5. Website Links" sys.exit(0) pass
# Option Argument option=int(sys.argv[2]) url=sys.argv[1]
# Current Page Links and Images if option==1: print "[*] Step (1/2)" html=Gethtml.main(url) for i in Getlink.main_img(html): Gethtml.image_download(i) print "[*] Step (2/2)" save_in='links.txt' html=Gethtml.main(url) data=Getlink.main(html) fileobj=open(save_in,'w') fileobj.write(''.join(i+'\n' for i in data)) fileobj.close() print "[*] Url Saved In : ",save_in print "[*] Done!"
# For Current Page All Links elif option==2: save_in='links.txt' html=Gethtml.main(url) data=Getlink.main(html) fileobj=open(save_in,'w') fileobj.write(''.join(i+'\n' for i in data)) fileobj.close() print "[*] Url Saved In : ",save_in print "[*] Done!" # For Current Page All Images elif option==3: html=Gethtml.main(url) for i in Getlink.main_img(html): Gethtml.image_download(i) # For All Website Image elif option==4: print "[+] Enter Sitemap Url : www.examplesite.com/sitemap?page=1" sitemap = raw_input("[+] Enter Sitemap Url : ") html=Gethtml.main(sitemap) for i in Getlink.main_sitemap(html): Gethtml.image_download(i) # For All Links For Website elif option==5: save_in='links.txt' print "[+] Enter Sitemap Url : www.examplesite.com/sitemap?page=1" sitemap = raw_input("[+] Enter Sitemap Url : ") html=Gethtml.main(sitemap) data=Getlink.main_sitemap(html) fileobj=open(save_in,'w') fileobj.write(''.join(i+'\n' for i in data)) fileobj.close() else: print " [*] Unknown Option : {}".format(str(option))
|
So, readers
Next time, we will create some more interesting web scraping scripts. and
Believe me, this journey is going to be very interesting. because in future tutorials,
you will see something really more interesting scripts and solutions.
For More Update, Visit Our Regularly.
And Subscribe Our Blog,
Follow Us and share it.
For Any Type of Suggestion Or Help
Contact me:
Suraj
surajsinghbisht054@gmail.com