vivfza
Last Updated: April 16, 2017
·
10.13K
· filosottile
A336b0fe148e72a16744805d059f812a

Fetch dynamic web pages with Selenium

Scraping is fun, but when the page loads via AJAX it starts to be boring with all that Javascript reverse engineering etc.

Selenium is a cool toolkit to drive the browser from your favorite programming language. Born for testing, it's perfect for scraping.

So, when I hit a dynamic page this is what I do

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver

# Start the WebDriver and load the page
wd = webdriver.Firefox()
wd.get(URL)

# Wait for the dynamically loaded elements to show up
WebDriverWait(wd, 10).until(
    EC.visibility_of_element_located((By.CLASS_NAME, "pricerow")))

# And grab the page HTML source
html_page = wd.page_source
wd.quit()

# Now you can use html_page as you like
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_page)

You could even do the scraping with Selenium, but I load the HTML into BeautifulSoup because:

  • I'm a BeautifulSoup junkie
  • Selenium API is pragmatic, a bit too much, and not Pytonic at all. Yeah, you pass a tuple to visibility_of_element_located...
  • Selenium docs are... umh, enterprisey
Say Thanks
Respond

2 Responses
Add your response

14055
2633474 1380986062 137379

:) what do you mean "enterprisey"

over 1 year ago ·
14936
0 o y3pb5pzp1c16k7yvgopznrzqlm1 37kneypqbgotay2f9ftqudtn12rj5lkkh 0ay01pvubb4e

If it is dynamic web crawl then how can you pass class name as "pricerow" -

EC.visibilityofelementlocated((By.CLASSNAME, "pricerow")))

over 1 year ago ·