How To Extract Html Link Using Python And re Module (python regex usages) - | python web scraping | python example - part 2

Hello Friends,


                           Today, This is my second part of web scraping tutorials. and in this tutorial, i am gonna to show you how to create a simple html link extractor using re modules. 

For First Part Click Here

Here, In This Script, 


I am Using urllib2 for downloading html data and then, regular expression for link extraction. 

hmm, 

i thinks some of you, noticed that my English is not so good. but friends, 

i am trying my best.

so, friends,  

if anywhere you found that i am typing incorrect then, 

please feel free to correct me.

and i believe in that, talent is more important than language. 

and

 if you are new visitor then don't forget to check our blog index.   

So, let's start:

In this code:

i am trying to Download page html codes


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Import Module
import urllib2
import sys
import re

if len(sys.argv)==1:
 print "[*] Please Provide Domain Name:\n Usages: python link_re.py www.examplesite.com\n"
 sys.exit(0)

# Retrieve Html Data From Url
def get_html(url):
 try:
   page = urllib2.urlopen(url).read()
 except Exception as e:
  print "[Error Found] ",e
  page=None
 return page

Next,
In this Code,

I am trying to 

extract links from html codes


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
html_data=get_html(sys.argv[1])

# Condition
if html_data:
 pattern = re.compile('(<a .*?>)')   # First, Find all <a > tag
 a_tag_captured = pattern.findall(html_data) 
 for i in a_tag_captured:     # Second, Now Find href tag in all tag
  href=re.search('href=.*', i[1:-1])
  if href:         # If Tag Found
   print href.group().split(' ')[0]  # Print Tag


Finally 

now, our script is ready to use



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/usr/bin/python
# ---------------- READ ME ---------------------------------------------
# This Script is Created Only For Practise And Educational Purpose Only
# This Script Is Created For https://bitforestinfo.blogspot.in
# This Script is Written By
__author__='''

######################################################
                By S.S.B Group                          
######################################################

    Suraj Singh
    Admin
    S.S.B Group
    surajsinghbisht054@gmail.com
    https://bitforestinfo.blogspot.in/

    Note: We Feel Proud To Be Indian
######################################################
'''
# Import Module
import urllib2
import sys
import re

if len(sys.argv)==1:
 print "[*] Please Provide Domain Name:\n Usages: python link_re.py www.examplesite.com\n"
 sys.exit(0)

# Retrieve Html Data From Url
def get_html(url):
 try:
   page = urllib2.urlopen(url).read()
 except Exception as e:
  print "[Error Found] ",e
  page=None
 return page

html_data=get_html(sys.argv[1])

# Condition
if html_data:
 pattern = re.compile('(<a .*?>)')   # First, Find all <a > tag
 a_tag_captured = pattern.findall(html_data) 
 for i in a_tag_captured:     # Second, Now Find href tag in all tag
  href=re.search('href=.*', i[1:-1])
  if href:         # If Tag Found
   print href.group().split(' ')[0]  # Print Tag




For Download Raw Script Click Here

In Our Next Tutorial,  


we will learn about how to create above given script using python htmlparser module. 
And Believe me this journey is going to be very interesting. because in future tutorials, 

you will see something really more interesting scripts and solutions.



For More Update, Visit Our Regularly. 
And Subscribe Our Blog, 
Follow Us and share it.
For Any Type of Suggestion Or Help
Contact me:
S.S.B
surajsinghbisht054@gmail.com

Share this

Related Posts

Previous
Next Post »