Skip to main content

Write Webpage Link Extractor - Python

Hello Friends,


                           Welcome Friends, Today, This is my second part of web scraping tutorials. and in this tutorial, i am gonna to show you how to create a simple html link extractor using re modules.

For First Part Click Here.


Actually, Guyz This Script is really very Easy To Understand That's Why I Don't Want To Waste My And Your Time In Writing And Reading Wasteful Words And Paragraph. So, To Understand These. Just Read All The Comments Carefully. You Will Get it Easily. nothing hard Guys.

Here, In This Script,  I am Using urllib2 for downloading html data and then, regular expression for link extraction.
hmm,  if you are new visitor then don't forget to check our blog index. 

So, let's start:
In this code, I am trying to Download page html codes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Import Module
import urllib2
import sys
import re

if len(sys.argv)==1:
 print "[*] Please Provide Domain Name:\n Usages: python link_re.py www.examplesite.com\n"
 sys.exit(0)

# Retrieve Html Data From Url
def get_html(url):
 try:
   page = urllib2.urlopen(url).read()
 except Exception as e:
  print "[Error Found] ",e
  page=None
 return page

Next, In this Code, I am trying to extract links from html codes


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
html_data=get_html(sys.argv[1])

# Condition
if html_data:
 pattern = re.compile('(<a .*?>)')   # First, Find all <a > tag
 a_tag_captured = pattern.findall(html_data) 
 for i in a_tag_captured:     # Second, Now Find href tag in all tag
  href=re.search('href=.*', i[1:-1])
  if href:         # If Tag Found
   print href.group().split(' ')[0]  # Print Tag


Finally  now, our script is ready to use

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/usr/bin/python
# ---------------- READ ME ---------------------------------------------
# This Script is Created Only For Practise And Educational Purpose Only
# This Script Is Created For https://bitforestinfo.blogspot.in
# This Script is Written By
__author__='''

######################################################
                By S.S.B Group                          
######################################################

    Suraj Singh
    Admin
    S.S.B Group
    surajsinghbisht054@gmail.com
    https://bitforestinfo.blogspot.in/

    Note: We Feel Proud To Be Indian
######################################################
'''
# Import Module
import urllib2
import sys
import re

if len(sys.argv)==1:
 print "[*] Please Provide Domain Name:\n Usages: python link_re.py www.examplesite.com\n"
 sys.exit(0)

# Retrieve Html Data From Url
def get_html(url):
 try:
   page = urllib2.urlopen(url).read()
 except Exception as e:
  print "[Error Found] ",e
  page=None
 return page

html_data=get_html(sys.argv[1])

# Condition
if html_data:
 pattern = re.compile('(<a .*?>)')   # First, Find all <a > tag
 a_tag_captured = pattern.findall(html_data) 
 for i in a_tag_captured:     # Second, Now Find href tag in all tag
  href=re.search('href=.*', i[1:-1])
  if href:         # If Tag Found
   print href.group().split(' ')[0]  # Print Tag



For Download Raw Script Click Here

Related Post

Top Visited

Create Simple Packet Sniffer Using Python

how to install burp suite in Linux/Ubuntu 16.04

Big List Of Google Dorks For Sqli Injection

List of Keyboard Shortcuts Keys for GNOME Desktop (Kali linux / Linux / Ubuntu/*nix )

How to create Phishing Page Using Kali Linux | Webpage Page Cloning Using Kali Linux Social Engineering Toolkit

Latest Google Dorks List

Best 1000 User-agents List For Web Scraping

How To Install GDB Peda?

What is the use of Pseudo header in TCP/UDP packets?

How To Create Snake Game Using Python And Tkinter - Simple python games