How To Create Complete Website Crawler Using Python - | python web scraping | python example - part 5

Posted by Suraj Singh on January 23, 2017 · 16 mins read
Hello readers,


                           Today, This is our fifth part of web scraping tutorials. and in this tutorial, I am gonna to show you how to create complete website crawler using python . but if you are new visitor then first check our index or For Fourth Part Click Here


with this crawler, we can easily

1. extract links in text file,
2. extract image links and download all images,
3. extract links from sitemap and save then in text file


So,

here readers, 

i am dividing all function of scripts in 3 parts:

1.  Gethtml.py (For Downloading Codes)2.  Getlink.py   (For Extracting Links)3.  main.py       (For Controlling Main Function and Both Script)


Here, Gethtml.py script is for downloading all html codes of websites.  Getlink.py script is for extracting links from Gethtml Provided Html Data. and another, main.py is for controlling both scripts.


Easy! readers,


So, let's see practical codes:


1. Gethtml.py




 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/usr/bin/python
# -*- coding: utf-8 -*-
#
#
# Suraj
# surajsinghbisht054@gmail.com
# www.bitforestinfo.com
#
import urllib2
import os

# Function For Downloading Html
def main(url):
try:
print "[*] Downloading Html Codes ... ",
page = urllib2.urlopen(url).read()
print " Done!"
except Exception as e:
print "[Error Found] ",e
page=None
return page

# Function For Downloading Image
def image_download(link):
#print link
Saved_in = "WebsitePicturesDownloaded"
if not os.path.isdir(Saved_in):
print "[+] Creating Directory... ",Saved_in,
os.mkdir(Saved_in)
print " Done"
img=open(os.path.join(Saved_in, os.path.basename(link)),'wb')
data=urllib2.urlopen(link)
print "[+] Picture Saved As ",os.path.join(Saved_in, os.path.basename(link))
img.write(data.read())
img.close()
return


2. Getlink.py




 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#!/usr/bin/python
# -*- coding: utf-8 -*-
#
#
# Suraj
# surajsinghbisht054@gmail.com
# www.bitforestinfo.com
#
import re

# Function For Extracting Html Link
def main(html_data):
# Filtering Url links
print "[*] Extracting Html Links ..."
pattern = re.compile('(<a .*?>)')
a_tag_captured = pattern.findall(html_data)
for i in a_tag_captured:
href_raw=i[str(i).find('href'):]
href=href_raw[:href_raw.find(' ')]
yield href[6:-1]
print " Done"
return

# Function For Extracting Sitemap
def main_sitemap(urls):
print "[*] Extracting Sitemap ....",
pattern = re.compile('<loc>(.*?)</loc>')
data=pattern.findall(urls)
print "Done!"
return data

# Function For Extracting Image Link
def main_img(html_data):
print "[*] Extracting Image Links ....",
link=[]
pattern = re.compile('<img .*?>')
for i in pattern.findall(html_data):
i=i[i.find('src'):-2]
img=i.split(' ')[0]
if 'http' in img[4:10]:
link.append(img[5:-1])
print ' Done'
return link



3. main.py




 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
#!/usr/bin/python
# -*- coding: utf-8 -*-
#
#
# Suraj
# surajsinghbisht054@gmail.com
# www.bitforestinfo.com
#
import sys
import Gethtml
import Getlink

links=[]
if len(sys.argv)==2:
print "\n[*] Usages : python {} http://www.examplesite.com option_number\n".format(sys.argv[0])
print "\t[-] Option [-]\n\n1. Current Page Links and Images \n2. Current Page Links \n3. Current Page Image \n4. Website Images \n5. Website Links"
sys.exit(0)
pass

# Option Argument
option=int(sys.argv[2])
url=sys.argv[1]

# Current Page Links and Images
if option==1:
print "[*] Step (1/2)"
html=Gethtml.main(url)
for i in Getlink.main_img(html):
Gethtml.image_download(i)
print "[*] Step (2/2)"
save_in='links.txt'
html=Gethtml.main(url)
data=Getlink.main(html)
fileobj=open(save_in,'w')
fileobj.write(''.join(i+'\n' for i in data))
fileobj.close()
print "[*] Url Saved In : ",save_in
print "[*] Done!"

# For Current Page All Links
elif option==2:
save_in='links.txt'
html=Gethtml.main(url)
data=Getlink.main(html)
fileobj=open(save_in,'w')
fileobj.write(''.join(i+'\n' for i in data))
fileobj.close()
print "[*] Url Saved In : ",save_in
print "[*] Done!"

# For Current Page All Images
elif option==3:
html=Gethtml.main(url)
for i in Getlink.main_img(html):
Gethtml.image_download(i)

# For All Website Image
elif option==4:
print "[+] Enter Sitemap Url : www.examplesite.com/sitemap?page=1"
sitemap = raw_input("[+] Enter Sitemap Url : ")
html=Gethtml.main(sitemap)
for i in Getlink.main_sitemap(html):
Gethtml.image_download(i)

# For All Links For Website
elif option==5:
save_in='links.txt'
print "[+] Enter Sitemap Url : www.examplesite.com/sitemap?page=1"
sitemap = raw_input("[+] Enter Sitemap Url : ")
html=Gethtml.main(sitemap)
data=Getlink.main_sitemap(html)
fileobj=open(save_in,'w')
fileobj.write(''.join(i+'\n' for i in data))
fileobj.close()

else:
print " [*] Unknown Option : {}".format(str(option))

For Downloading, Raw Script Click Here


So, readers 

        Next time, we will create some more interesting web scraping scripts. and 
Believe me, this journey is going to be very interesting. because in future tutorials, 
you will see something really more interesting scripts and solutions.

For More Update, Visit Our Regularly. 


And Subscribe Our Blog, 

Follow Us and share it.


For Any Type of Suggestion Or Help


Contact me:
Suraj
surajsinghbisht054@gmail.com