Fetch dynamic web pages with Selenium
Scraping is fun, but when the page loads via AJAX it starts to be boring with all that Javascript reverse engineering etc.
Selenium is a cool toolkit to drive the browser from your favorite programming language. Born for testing, it's perfect for scraping.
So, when I hit a dynamic page this is what I do
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
# Start the WebDriver and load the page
wd = webdriver.Firefox()
wd.get(URL)
# Wait for the dynamically loaded elements to show up
WebDriverWait(wd, 10).until(
EC.visibility_of_element_located((By.CLASS_NAME, "pricerow")))
# And grab the page HTML source
html_page = wd.page_source
wd.quit()
# Now you can use html_page as you like
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_page)
You could even do the scraping with Selenium, but I load the HTML into BeautifulSoup because:
- I'm a BeautifulSoup junkie
- Selenium API is pragmatic, a bit too much, and not Pytonic at all. Yeah, you pass a tuple to
visibility_of_element_located
... - Selenium docs are... umh, enterprisey
Written by Filippo Valsorda
Related protips
2 Responses
:) what do you mean "enterprisey"
over 1 year ago
·
If it is dynamic web crawl then how can you pass class name as "pricerow" -
EC.visibilityofelementlocated((By.CLASSNAME, "pricerow")))
over 1 year ago
·
Have a fresh tip? Share with Coderwall community!
Post
Post a tip
Best
#Python
Authors
Sponsored by #native_company# — Learn More
#native_title#
#native_desc#