Hello readers, This is our 12th part of web scraping tutorials. and in this tutorial, I am gonna to show you how to create python script for website crawling and for email harvesting. For this purpose, we will use python built-in modules only. but first, if you are new visitor then first check our index or For 11th Part Click Here.
Now, Let's Talk About Today's Topic.readers,
In this tutorials, I am going to Show You How To Create Python Website Crawler that can save all HTML data of webpages in a temporary file. and then, create another python script for finding all emails addresses from a temporary file.
so, here we will create these 2 scripts.
1. crawler .py
2. email_filter .py
So, Let's Start.
Here, I am Sharing My Demo Codes But If You Want More Better Example Then, You Can Modify these codes yourself or Download This Script From My GitHub repository (link given at the end of these codes ).
1. crawler.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | #!/usr/bin/python # ---------------- READ ME --------------------------------------------- # This Script is Created Only For Practise And Educational Purpose Only # This Script Is Created For https://www.bitforestinfo.com # This Script is Written By __author__='''
###################################################### By ######################################################
Suraj Singh surajsinghbisht054@gmail.com https://www.bitforestinfo.com/
###################################################### ''' # Import Module import urllib2 import re
# Configuration html_dump = "tempdump" # Temp File Name
# Function For Extracting Html Link def link(html_data): # Filtering Url links print "[*] Extracting Html Links ..." pattern = re.compile('(<a .*?>)') a_tag_captured = pattern.findall(html_data) for i in a_tag_captured: href_raw=i[str(i).find('href'):] href=href_raw[:href_raw.find(' ')] yield href[6:-1] return
# Function For Downloading Html def main(url): try: print "[*] Downloading Html Codes ... ", header={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.8.0'} req=urllib2.Request(url, headers=header) page = urllib2.urlopen(req).read() except Exception as e: page='None' return page temp = open(html_dump, 'a') # Open Temp File
for i in link(main('https://www.bitforestinfo.com')): # enter you website address in place of https://www.bitforestinfo.com temp.write(main(i)) # Write Data On File
temp.close() # Closing File
|
2. email_filter.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | #!/usr/bin/python # ---------------- READ ME --------------------------------------------- # This Script is Created Only For Practise And Educational Purpose Only # This Script Is Created For https://www.bitforestinfo.com # This Script is Written By __author__='''
###################################################### By ######################################################
Suraj Singh surajsinghbisht054@gmail.com https://www.bitforestinfo.com/
###################################################### ''' # Import Module import re
# Configuration html_dump = "tempdump" # Temp File Name temp_file = open(html_dump, 'r') # Open File data = temp_file.read() # Read Data temp_file.close() # Close File
# Function For Extracting Email Address def email(data): # Filtering Url links pattern = re.compile('[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', re.MULTILINE) captured = pattern.findall(data) return captured
email_list = [] # LIst For Collecting Emails
# Collecting Emails for i in email(data): if i not in email_list: email_list.append(i) # Print Collected Emails print email_list
|
Note: Save Both Script In Same Directory
Usages:
first : python crawler.py
second : python email_filter.py
Warning : I am Creating This Tutorial Only For Practise and Educational Purpose. I will not Take any type of responsibility about any illegal activities.
For Downloading, Raw Script Click Here
For More Update, Visit Our Regularly.
And Subscribe Our Blog,
Follow Us and share it.
For Any Type of Suggestion Or Help
Contact me:
Suraj
surajsinghbisht054@gmail.com