How To Create website crawler for email harvesting Using python & urllib2 module - Web Scraping - part 12

Hello Friends,


                           This is our 12th part of web scraping tutorials. and in this tutorial, I am gonna to show you how to create python script for website crawling and for email harvesting. For this purpose, we will use python built-in modules only. but first, if you are new visitor then first check our index or For 11th Part Click Here


Now, Let's Talk About Today's Topic.

Friends,

In this tutorials, I am going to  Show You How To Create Python Website Crawler that can save all HTML data of webpages in a temporary file. and then, create another python script for finding all emails addresses from a temporary file.
so, here we will create these 2 scripts.

1.  crawler .py
2.  email_filter .py


So, Let's Start.

Here, I am Sharing My Demo Codes But If You Want More Better Example Then, You Can Modify these codes yourself or Download This Script From My GitHub repository (link given at the end of these codes ).

1. crawler.py




 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#!/usr/bin/python
# ---------------- READ ME ---------------------------------------------
# This Script is Created Only For Practise And Educational Purpose Only
# This Script Is Created For https://bitforestinfo.blogspot.in
# This Script is Written By
__author__='''

######################################################
                By S.S.B Group                          
######################################################

    Suraj Singh
    Admin
    S.S.B Group
    surajsinghbisht054@gmail.com
    https://bitforestinfo.blogspot.in/

    Note: We Feel Proud To Be Indian
######################################################
'''
# Import Module
import urllib2
import re

# Configuration
html_dump = "tempdump"  # Temp File Name

# Function For Extracting Html Link
def link(html_data):
    # Filtering Url links
    print "[*] Extracting Html Links ..."
    pattern = re.compile('(<a .*?>)')
    a_tag_captured = pattern.findall(html_data)
    for i in a_tag_captured:
        href_raw=i[str(i).find('href'):]
        href=href_raw[:href_raw.find(' ')]
        yield href[6:-1]
    return

# Function For Downloading Html
def main(url):
    try:
        print "[*] Downloading Html Codes ... ",
        header={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.8.0'}
        req=urllib2.Request(url, headers=header)
        page = urllib2.urlopen(req).read()
    except Exception as e:
        page='None'
    return page
temp = open(html_dump, 'a')     # Open Temp File

for i in link(main('https://bitforestinfo.blogspot.com')):   # enter you website address in place of https://bitforestinfo.blogspot.com  
 temp.write(main(i))        # Write Data On File

temp.close()                    # Closing File


2. email_filter.py



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/usr/bin/python
# ---------------- READ ME ---------------------------------------------
# This Script is Created Only For Practise And Educational Purpose Only
# This Script Is Created For https://bitforestinfo.blogspot.in
# This Script is Written By
__author__='''

######################################################
                By S.S.B Group                          
######################################################

    Suraj Singh
    Admin
    S.S.B Group
    surajsinghbisht054@gmail.com
    https://bitforestinfo.blogspot.in/

    Note: We Feel Proud To Be Indian
######################################################
'''
# Import Module
import re

# Configuration
html_dump = "tempdump"    # Temp File Name
temp_file = open(html_dump, 'r') # Open File
data = temp_file.read()    # Read Data
temp_file.close()     # Close File

# Function For Extracting Email Address
def email(data):
    # Filtering Url links
    pattern = re.compile('[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', re.MULTILINE)
    captured = pattern.findall(data)
    return captured

email_list = []     # LIst For Collecting Emails

# Collecting Emails
for i in email(data):
 if i not in email_list:
  email_list.append(i)
  
# Print Collected Emails
print email_list

Note: Save Both Script In Same Directory

Usages:
            first    :   python crawler.py
          second :   python email_filter.py

Warning I am Creating This Tutorial Only For Practise and Educational Purpose. I will not Take any type of responsibility about any illegal activities.


For Downloading, Raw Script Click Here


For More Update, Visit Our Regularly. 
And Subscribe Our Blog, 


Follow Us and share it.
For Any Type of Suggestion Or Help
Contact me:
S.S.B
surajsinghbisht054@gmail.com

Share this

Related Posts

Previous
Next Post »