hii readers,
Today, In this tutorial, I am gonna to show you how to create a simple HTML link extractor using beautiful Soup Python Module.
Introduction
Python beautiful soup module is a very powerful module that provides us various facilities to handle raw HTML codes of web pages and link extraction facility is one of them. During web scraping many times we have to create programmes that can extract links from any particular webpage to procedure further function. so, In this post, i am going to share with you the simplest way to extract HTML link from webpages but if you are new visitor then don't forget to check our blog index.
Requirement
- Python 2.x Or 3.x
- BeautifulSoup 4 Or Higher
- Basic Knowledge Of Python And bs4 module
How it's going to work
basically readers, after getting the URL address of webpage from a user, we will download complete source codes from a webpage with the help of Urllib module. then we just going to feed all those HTML codes to Beautiful soup module object so that it can process. After That, we going to take help of beautifulsoup findall function to extract HTML codes directly from source codes. actually, the beautifulsoup findall function provides us a rich feature facilities to filter exact result from loaded HTML codes. In simple words, you don't need to worry about syntax, tags, and other problems of Html codes, you just need to use beautiful soup that going to make your programming like a piece of cakes. So, Take pill of chill and try to understand below provided very very easy codes.
hmm, let's start:
1. Website Link Extractor Written In Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | #!/usr/bin/python # ---------------- READ ME --------------------------------------------- # This Script is Created Only For Practise And Educational Purpose Only # This Script Is Created For https://www.bitforestinfo.com # This Script is Written By __author__='''
###################################################### By ######################################################
Suraj Singh surajsinghbisht054@gmail.com https://www.bitforestinfo.com/
###################################################### ''' # Imprt Module import bs4 import urllib2, sys
if len(sys.argv)==1: print "[*] Please Provide Domain Name:\n Usages: python link_bs4.py www.examplesite.com\n" sys.exit(0)
def parse_url(url): try: html=urllib2.urlopen(url).read() # Reading Html Codes except Exception as e: print "[Error] ",e sys.exit(0) parse=bs4.BeautifulSoup(html) # Feed Data To bs4 for i in parse.findAll('a'): # Searching For link Tag if 'href' in i.attrs.keys(): # Searching For Href key link=i.attrs['href'] print link return
parse_url(sys.argv[1])
|
Explanation
- line 22, 23 is to import modules.
- line 31 is to download raw HTML codes of the webpage.
- line 35, is to feed raw HTML codes to beautiful Soup object
- line 36, findall function to search 'a' tag.
- line 37,38 to get exact HTML link from 'a' tags.
For Download Raw Script
Click HereCheck Next Tutorial To Make Above Program Without Beautifulsoup module.