How To Create Complete Website Crawler Using Python - | python web scraping | python example - part 5

Hello Friends,


                           Today, This is our fifth part of web scraping tutorials. and in this tutorial, I am gonna to show you how to create complete website crawler using python . but if you are new visitor then first check our index or For Fourth Part Click Here


with this crawler, we can easily

1. extract links in text file,
2. extract image links and download all images,
3. extract links from sitemap and save then in text file


So,

here friends, 

i am dividing all function of scripts in 3 parts:

1.  Gethtml.py (For Downloading Codes)2.  Getlink.py   (For Extracting Links)3.  main.py       (For Controlling Main Function and Both Script)


Here, Gethtml.py script is for downloading all html codes of websites.  Getlink.py script is for extracting links from Gethtml Provided Html Data. and another, main.py is for controlling both scripts.


Easy! Friends,


So, let's see practical codes:


1. Gethtml.py




 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/usr/bin/python
# -*- coding: utf-8 -*-
#
# Written By:
#       S.S.B
#       surajsinghbisht054@gmail.com
#       bitforestinfo.blogspot.com
#   
import urllib2
import os

# Function For Downloading Html
def main(url):
    try:
        print "[*] Downloading Html Codes ... ",
        page = urllib2.urlopen(url).read()
        print " Done!"
    except Exception as e:
        print "[Error Found] ",e
        page=None
    return page

# Function For Downloading Image
def image_download(link):
    #print link
    Saved_in = "WebsitePicturesDownloaded"
    if not os.path.isdir(Saved_in):
        print "[+] Creating Directory... ",Saved_in,
        os.mkdir(Saved_in)
        print " Done"
    img=open(os.path.join(Saved_in, os.path.basename(link)),'wb')
    data=urllib2.urlopen(link)
    print "[+] Picture Saved As ",os.path.join(Saved_in, os.path.basename(link))
    img.write(data.read())
    img.close()
    return


2. Getlink.py




 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#!/usr/bin/python
# -*- coding: utf-8 -*-
#
# Written By:
#       S.S.B
#       surajsinghbisht054@gmail.com
#       bitforestinfo.blogspot.com
#   
import re

# Function For Extracting Html Link
def main(html_data):
    # Filtering Url links
    print "[*] Extracting Html Links ..."
    pattern = re.compile('(<a .*?>)')
    a_tag_captured = pattern.findall(html_data)
    for i in a_tag_captured:
        href_raw=i[str(i).find('href'):]
        href=href_raw[:href_raw.find(' ')]
        yield href[6:-1]
    print " Done"
    return

# Function For Extracting Sitemap
def main_sitemap(urls):
    print "[*] Extracting Sitemap ....",
    pattern = re.compile('<loc>(.*?)</loc>')
    data=pattern.findall(urls)
    print "Done!"
    return data

# Function For Extracting Image Link
def main_img(html_data):
    print "[*] Extracting Image Links ....",
    link=[]
    pattern = re.compile('<img .*?>')
    for i in pattern.findall(html_data):
        i=i[i.find('src'):-2]
        img=i.split(' ')[0]
        if 'http' in img[4:10]:
            link.append(img[5:-1])
    print ' Done'
    return link
    


3. main.py




 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
#!/usr/bin/python
# -*- coding: utf-8 -*-
#
# Written By:
#       S.S.B
#       surajsinghbisht054@gmail.com
#       bitforestinfo.blogspot.com
#   
import sys
import Gethtml
import Getlink

links=[]
if len(sys.argv)==2:
    print "\n[*] Usages : python {} http://www.examplesite.com  option_number\n".format(sys.argv[0])
    print "\t[-] Option [-]\n\n1. Current Page Links and Images \n2. Current Page Links \n3. Current Page Image \n4. Website Images \n5. Website Links"
    sys.exit(0)
    pass

# Option Argument
option=int(sys.argv[2])
url=sys.argv[1]

# Current Page Links and Images
if option==1:
    print "[*] Step (1/2)"
    html=Gethtml.main(url)
    for i in Getlink.main_img(html):
        Gethtml.image_download(i)
    print "[*] Step (2/2)"
    save_in='links.txt'
    html=Gethtml.main(url)
    data=Getlink.main(html)
    fileobj=open(save_in,'w')
    fileobj.write(''.join(i+'\n' for i in data))
    fileobj.close()
    print "[*] Url Saved In : ",save_in
    print "[*] Done!"

# For Current Page All Links
elif option==2:
    save_in='links.txt'
    html=Gethtml.main(url)
    data=Getlink.main(html)
    fileobj=open(save_in,'w')
    fileobj.write(''.join(i+'\n' for i in data))
    fileobj.close()
    print "[*] Url Saved In : ",save_in
    print "[*] Done!"
    
# For Current Page All Images
elif option==3:
    html=Gethtml.main(url)
    for i in Getlink.main_img(html):
        Gethtml.image_download(i)
    
# For All Website Image
elif option==4:
    print "[+] Enter Sitemap Url : www.examplesite.com/sitemap?page=1"
    sitemap =  raw_input("[+] Enter Sitemap Url : ")
    html=Gethtml.main(sitemap)
    for i in Getlink.main_sitemap(html):
        Gethtml.image_download(i)
    
# For All Links For Website
elif option==5:
    save_in='links.txt'
    print "[+] Enter Sitemap Url : www.examplesite.com/sitemap?page=1"
    sitemap =  raw_input("[+] Enter Sitemap Url : ")
    html=Gethtml.main(sitemap)
    data=Getlink.main_sitemap(html)
    fileobj=open(save_in,'w')
    fileobj.write(''.join(i+'\n' for i in data))
    fileobj.close()
    
else:
    print " [*] Unknown Option : {}".format(str(option))

For Downloading, Raw Script Click Here


So, Friends 

        Next time, we will create some more interesting web scraping scripts. and 
Believe me, this journey is going to be very interesting. because in future tutorials, 
you will see something really more interesting scripts and solutions.

For More Update, Visit Our Regularly. 


And Subscribe Our Blog, 

Follow Us and share it.


For Any Type of Suggestion Or Help


Contact me:
S.S.B
surajsinghbisht054@gmail.com

Share this

Related Posts

Previous
Next Post »