Hello readers,
Welcome readers, Today, This is my second part of web scraping tutorials. and in this tutorial, i am gonna to show you how to create a simple html link extractor using re modules.
For First Part
Click Here.
Actually, Guyz This Script is really very Easy To Understand That's Why I Don't Want To Waste My And Your Time In Writing And Reading Wasteful Words And Paragraph. So, To Understand These. Just Read All The Comments Carefully. You Will Get it Easily. nothing hard Guys.
Here, In This Script, I am Using urllib2 for downloading html data and then, regular expression for link extraction.
hmm, if you are new visitor then don't forget to check our blog index.
So, let's start:
In this code, I am trying to Download page html codes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # Import Module import urllib2 import sys import re
if len(sys.argv)==1: print "[*] Please Provide Domain Name:\n Usages: python link_re.py www.examplesite.com\n" sys.exit(0)
# Retrieve Html Data From Url def get_html(url): try: page = urllib2.urlopen(url).read() except Exception as e: print "[Error Found] ",e page=None return page
|
Next, In this Code, I am trying to extract links from html codes
1 2 3 4 5 6 7 8 9 10 | html_data=get_html(sys.argv[1])
# Condition if html_data: pattern = re.compile('(<a .*?>)') # First, Find all <a > tag a_tag_captured = pattern.findall(html_data) for i in a_tag_captured: # Second, Now Find href tag in all tag href=re.search('href=.*', i[1:-1]) if href: # If Tag Found print href.group().split(' ')[0] # Print Tag
|
Finally now, our script is ready to use
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | #!/usr/bin/python # ---------------- READ ME --------------------------------------------- # This Script is Created Only For Practise And Educational Purpose Only # This Script Is Created For https://www.bitforestinfo.com # This Script is Written By __author__='''
###################################################### By ######################################################
Suraj Singh surajsinghbisht054@gmail.com https://www.bitforestinfo.com/
###################################################### ''' # Import Module import urllib2 import sys import re
if len(sys.argv)==1: print "[*] Please Provide Domain Name:\n Usages: python link_re.py www.examplesite.com\n" sys.exit(0)
# Retrieve Html Data From Url def get_html(url): try: page = urllib2.urlopen(url).read() except Exception as e: print "[Error Found] ",e page=None return page
html_data=get_html(sys.argv[1])
# Condition if html_data: pattern = re.compile('(<a .*?>)') # First, Find all <a > tag a_tag_captured = pattern.findall(html_data) for i in a_tag_captured: # Second, Now Find href tag in all tag href=re.search('href=.*', i[1:-1]) if href: # If Tag Found print href.group().split(' ')[0] # Print Tag
|