Extract links from webpage using beautifulsoup module

Namaste Friends,




 Today, In this tutorial, I am gonna to show you how to create a simple HTML link extractor using beautiful Soup Python Module.





Introduction



Python beautiful soup module is a very powerful module that provides us various facilities to handle raw HTML codes of web pages and link extraction facility is one of them. During web scraping many times we have to create programmes that can extract links from any particular webpage to procedure further function. so, In this post, i am going to share with you the simplest way to extract HTML link from webpages but if you are new visitor then don't forget to check our blog index.

Requirement


  • Python 2.x Or 3.x
  • BeautifulSoup 4 Or Higher
  • Basic Knowledge Of Python And bs4 module


How it's going to work

                   
basically friends, after getting the URL address of webpage from a user, we will download complete source codes from a webpage with the help of Urllib module. then we just going to feed all those HTML codes to Beautiful soup module object so that it can process. After That, we going to take help of beautifulsoup findall function to extract HTML codes directly from source codes. actually, the beautifulsoup findall function provides us a rich feature facilities to filter exact result from loaded HTML codes. In simple words, you don't need to worry about syntax, tags, and other problems of Html codes, you just need to use beautiful soup that going to make your programming like a piece of cakes. So, Take pill of chill and try to understand below provided very very easy codes.

hmm, let's start:

1. Website Link Extractor Written In Python 


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#!/usr/bin/python
# ---------------- READ ME ---------------------------------------------
# This Script is Created Only For Practise And Educational Purpose Only
# This Script Is Created For https://bitforestinfo.blogspot.in
# This Script is Written By
__author__='''

######################################################
                By S.S.B Group                          
######################################################

    Suraj Singh
    Admin
    S.S.B Group
    surajsinghbisht054@gmail.com
    https://bitforestinfo.blogspot.in/

    Note: We Feel Proud To Be Indian
######################################################
'''
# Imprt Module
import bs4
import urllib2, sys

if len(sys.argv)==1:
    print "[*] Please Provide Domain Name:\n Usages: python link_bs4.py www.examplesite.com\n"
    sys.exit(0)

def parse_url(url):
    try:
     html=urllib2.urlopen(url).read() # Reading Html Codes
    except Exception as e:
     print "[Error] ",e
     sys.exit(0)
    parse=bs4.BeautifulSoup(html)    # Feed Data To bs4
    for i in parse.findAll('a'):  # Searching For link Tag
        if 'href' in i.attrs.keys(): # Searching For Href key
            link=i.attrs['href']
            print link
    return 

parse_url(sys.argv[1])

Explanation


  • line 22, 23 is to import modules.
  • line 31 is to download raw HTML codes of the webpage.
  • line 35, is to feed raw HTML codes to beautiful Soup object
  • line 36, findall function to search 'a' tag.
  • line 37,38 to get exact HTML link from 'a' tags. 



For Download Raw Script Click Here

Check Next Tutorial To Make Above Program Without Beautifulsoup module.

Share this

Related Posts

Previous
Next Post »