Scraping AJAX web pages (Part 5.5)
This post is very similar to the previous one (Part 5), which scraped a webpage using PhantomJS from the command line and sent the output to the stdout.
This time we use PhantomJS again, but we do it from a Python script and wrap Selenium around PhantomJS. The generated HTML source will be available in a variable. Here is the source:
#!/usr/bin/env python3 # encoding: utf-8 """ required packages: * selenium optional packages: * bs4 * lxml """ from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities # from bs4 import BeautifulSoup url = "http://simile.mit.edu/crowbar/test.html" dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = ( "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 " "(KHTML, like Gecko) Chrome/15.0.87" ) driver = webdriver.PhantomJS(desired_capabilities=dcap) driver.get(url) html = driver.page_source print(html) # soup = BeautifulSoup(driver.page_source, "lxml") #page_source fetches page after rendering is complete # driver.save_screenshot('screen.png') # save a screenshot to disk driver.quit()
The script sets the user agent (optional but recommended). The source is captured in a variable. The last two lines are in comments but they would work. You could feed the source to BeautifulSoup and then you could extract part of the HTML source. If you uncomment the last line, then you can create a screenshot of the webpage.