close

Using Selenium To Interact And Beautiful Soup To Parse Data From Dynamic Web Page

Combining Using Selenium And Beautiful Soup To Scrape Dynamic Web Page:

– Beautiful Soup is Python library using for pulling data from HTML source or XML source.

– Support multiple type of parser include: Python’s HTML Parser, LXML parser and HTML5 parser. Usually use LXML parser for parsing HTML source page getting from Selenium tool.

– Selenium has full function to extract everything from a dynamic web page but it is slow. In reality, faster method using Selenium to interact with Dynamic Web Page (go to specific page and execute some action as search, scroll page, run JS scripts,..) and get page source and pull into Beautiful Soup for parsing information.

– View guide for installing and using Selenium for interacting with Dynamic Website at link:

https://tech2fun.net/automate-navigating-interacting-web-page-using-selenium-python-example/

– Install Beautiful Soup and LXML module using pip command

# pip install beautifulsoup4

# pip install lxml

*Simple example for combining Selenium and Beautiful Soup for parsing data:

  Browsing to page https://www.amazon.com/, go to Computers & Accessories page, search keywork Samsung SSD for listing SSD product of Samsung. Using Beautiful Soup to parsing product title and detail link of each product and show on screen (can store on some type of database for analyzing if need)

Link github for download source of this example:

https://github.com/vominhtri1991/Selenium_Example02.git

– Initialize using Selenium Web Driver Chrome and go to Amazon main page

options=Options()

options.add_argument(“–ignore-certificate-errors”)

options.add_argument(“–incognito”)

driver=webdriver.Chrome(options=options)

driver.get(“https://www.amazon.com/”)

– Locating element for clicking menu go to specific page (Computers & Accessories) and search “Samsung SSD” for listing products

ele=driver.find_element_by_xpath(“//a[@aria-label=’Computers & Accessories’]”)

ele.click()

ele=driver.find_element_by_id(“twotabsearchtextbox”)

ele.send_keys(“Samsung SSD”)

ele.send_keys(Keys.RETURN)

– Loading source from Selenium driver to Beautiful Soup objects and using find_all method for locating tag and parsing some useful data

source=driver.page_source

soup=BeautifulSoup(source,‘lxml’)

product_list_div=soup.find_all(“div”,class_=“a-section a-spacing-none”)

for a_product_div in product_list_div:

    product_title_tag=a_product_div.find_all(“span”,class_=“a-size-medium a-color-base a-text-normal”)

    product_link=a_product_div.find_all(“a”,class_=“a-link-normal a-text-normal”)

    product_title=“”

    for i in product_title_tag:

        product_title=i.string

– Printing product titles and links for accessing information to screen

if(product_title!=“”):

        print(product_title)

        for i in product_link:

            print(“Link Product: https://www.amazon.com”+i[‘href’])

        print(“—————————————————————“)

Tags : AutomationBeautifulSoupDevOpsLinux-UnixPythonSeleniumTesting

Leave a Response

error: Content is protected !!