Serverless Chrome puppeteer
Say you want to build a scraper, automate manual testing, or generate custom social cards for your website. What do you do?
You could spin up a docker container, set up headless Chrome, add Puppeteer, write a script to run it all, add a server to create an API, and ...
Or you can set up Serverless Chrome with AWS Lambda. Write a bit of code, hit deploy, and get a Chrome browser running on demand.
That's what this chapter is about 🤘
You'll learn how to:
- configure Chrome Puppeteer on AWS
- build a basic scraper
- take website screenshots
- run it on-demand
We build a scraper that goes to google.com, types in a phrase, and returns the first page of results. Then reuse the same code to return a screenshot.
You can see full code on GitHub
Serverless Chrome
Chrome's engine ships as the open source Chromium browser. Other browsers use it and add their own UI and custom features.
You can use the engine for browser automation – scraping, testing, screenshots, etc. When you need to render a website, Chromium is your friend.
This means:
- download a chrome binary
- set up an environment that makes it happy
- run in headless mode
- configure processes that talk to each other via complex sockets
Others have solved this problem for you.
Rather than figure it out yourself, I recommend using chrome-aws-lambda. It's the most up-to-date package for running Serverless Chrome.
Here's what you need for a Serverless Chrome setup:
- install dependencies
$ yarn add chrome-aws-lambda@3.1.1 puppeteer@3.1.0 @types/puppeteer puppeteer-core@3.1.0
This installs everything you need to both run and interact with Chrome. ✌️
Check chrome-aws-lambda/README for the latest version of Chrome Puppeteer you can use. Make sure they match.
- configure serverless.yml
# serverless.ymlservice: serverless-chrome-exampleprovider:name: awsruntime: nodejs12.xstage: devpackage:exclude:- node_modules/puppeteer/.local-chromium/**
Configure a new service, make it run on AWS, use latest node.
The package
part is important. It tells Serverless not to package the chromium binary with your code. AWS rejects builds of that size.
You are now ready to start running Chrome ✌️
Chrome Puppeteer 101
Chrome Puppeteer is a set of tools to interact with Chrome programmatically.
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
Write code that interacts with a website like a person would. Anything a person can do on the web, you can do with Puppeteer.
Core syntax feels like jQuery, but the objects are different than what you're used to. I've found it's best not to worry about the details.
Here's how you click on a link:
const page = await browser.newPage() // open a "tab"page.goto("https://example.com") // navigates to URLconst div = await page.$("div#some_content") // grab a divawait div.click("a.target_link") // clicks link
Always open a new page for every new browser context.
Navigate to your URL then use jQuery-like selectors to interact with the page. You can feed selectors into click()
and other methods, or use the page.$
syntax to search around.
Build a scraper
Web scraping is fiddly but sounds simple in theory:
- load website
- find content
- read content
- return content in new format
But that doesn't generalize. Each website is different.
You adapt the core technique to each website you scrape and there's no telling when the HTML might change.
You might even find websites that actively fight against scraping. Block bots, limit access speed, obfuscate HTML, ...
Please play nice and don't unleash thousands of parallel requests onto unsuspecting websites.
You can watch me work on this project on YouTube, if you prefer video:
And you can try the final result here: https://4tydwq78d9.execute-api.us-east-1.amazonaws.com/dev/scraper
1. more dependencies
Start with the serverless.yml
and dependencies from earlier (chrome-aws-lambda and puppeteer).
Add aws-lambda
:
$ yarn add aws-lambda @types/aws-lambda
Installs the code you need to interact with the AWS Lambda environment.
2. add a scraper function
Define a new scraper function in serverless.yml
# serverless.ymlfunctions:scraper:handler: dist/scraper.handlermemorysize: 2536timeout: 30events:- http:path: scrapermethod: GETcors: true
We're saying code lives in the handler
method exported from scraper
. We ask for lots of memory and a long timeout. Chrome is resource intensive and our code makes web requests, which might take a while.
All this fires from a GET request on /scraper
.
3. getChrome()
The getChrome
method instantiates a new browser context. I like to put this in a util
file.
// src/util.tsimport chrome from "chrome-aws-lambda"export async function getChrome() {let browser = nulltry {browser = await chrome.puppeteer.launch({args: chrome.args,defaultViewport: {width: 1920,height: 1080,isMobile: true,deviceScaleFactor: 2,},executablePath: await chrome.executablePath,headless: chrome.headless,ignoreHTTPSErrors: true,})} catch (err) {console.error("Error launching chrome")console.error(err)}return browser}
We launch a Chrome Puppeteer instance with default config and specify our own screen size.
The isMobile
setting tricks many websites into loading faster. The deviceScaleFactor: 2
helps create better screenshots. It's like using a retina screen.
Adding ignoreHTTPSErrors
makes the process more robust.
If the browser fails to launch, we log debugging info.
4. a shared createHandler()
We're building 2 pieces of code that share a lot of logic – scraping and screenshots. Both need a browser, deal with errors, and parse URL queries.
We build a common createHandler()
method that deals with boilerplate and calls the important function when ready.
// src/util.tsimport { APIGatewayEvent } from "aws-lambda"import { Browser } from "puppeteer"// both scraper and screenshot have the same basic handler// they just call a different method to do thingsexport const createHandler = (workFunction: (browser: Browser, search: string) => Promise<APIResponse>) => async (event: APIGatewayEvent): Promise<APIResponse> => {const search =event.queryStringParameters && event.queryStringParameters.searchif (!search) {return {statusCode: 400,body: "Please provide a ?search= parameter",}}const browser = await getChrome()if (!browser) {return {statusCode: 500,body: "Error launching Chrome",}}try {// call the function that does the real workconst response = await workFunction(browser, search)return response} catch (err) {console.log(err)return {statusCode: 500,body: "Error scraping Google",}}}
We read the ?search=
param, open a browser, and verify everything's set up.
Then we call the passed-in workFunction
, which returns a response. If that fails, we throw a 500 error.
5. scrapeGoogle()
We're ready to scrape Google search results.
async function scrapeGoogle(browser: Browser, search: string) {const page = await browser.newPage()await page.goto("https://google.com", {waitUntil: ["domcontentloaded", "networkidle2"],})// this part is specific to the page you're scrapingawait page.type("input[type=text]", search)const [response] = await Promise.all([page.waitForNavigation(),page.click("input[type=submit]"),])if (!response.ok()) {throw "Couldn't get response"}await page.goto(response.url())// this part is very specific to the page you're scrapingconst searchResults = await page.$$(".rc")let links = await Promise.all(searchResults.map(async (result) => {return {url: await result.$eval("a", (node) => node.getAttribute("href")),title: await result.$eval("h3", (node) => node.innerHTML),description: await result.$eval("span.st", (node) => node.innerHTML),}}))return {statusCode: 200,body: JSON.stringify(links),}}export const handler = createHandler(scrapeGoogle)
Lots going on here. Let's go piece by piece.
const page = await browser.newPage()await page.goto("https://google.com", {waitUntil: ["domcontentloaded", "networkidle2"],})
Open a new page, navigate to google.com, wait for everything to load. I recommend waiting for networkidle2
, which means all asynchronous requests have finished.
Useful when dealing with complex webapps.
// this part is specific to the page you're scrapingawait page.type("input[type=text]", search)const [response] = await Promise.all([page.waitForNavigation(),page.click("input[type=submit]"),])if (!response.ok()) {throw "Couldn't get response"}await page.goto(response.url())
To scrape google, we type a search into the input field, then hit submit and wait for the page to load.
This part is different for every website.
// this part is very specific to the page you're scrapingconst searchResults = await page.$$(".rc")let links = await Promise.all(searchResults.map(async (result) => {return {url: await result.$eval("a", (node) => node.getAttribute("href")),title: await result.$eval("h3", (node) => node.innerHTML),description: await result.$eval("span.st", (node) => node.innerHTML),}}))return {statusCode: 200,body: JSON.stringify(links),}
When the results page loads, we:
- look for every
.rc
DOM element – best identifier of search results I could find - iterate through results
- get the info we want from each
You can use the page.$eval
trick to parse DOM nodes with the same API you'd use in a browser. Executes your method on the nodes it finds and returns the result.
6. hit deploy and try it out
You now have a bonafide web scraper. Wakes up on demand, runs chrome, turns Google search results into easy-to-use JSON.
this was fun, got a lambda that spits out JSON of the first page of google results
— Swizec Teller published ServerlessHandbook.dev (@Swizec) July 12, 2020
Here's #javascript for example
couldn't quite get the screenshot version to work yet pic.twitter.com/JzpMXVJqiU
We left out project configuration boilerplate. You can find those details in other chapters or see example code on GitHub.
Take screenshots
Taking screenshots is similar to scraping. Instead of parsing the page, you call .screenshot()
and get an image.
Our example returns that image directly. You'll want to store on S3 and return a URL in a real project. Lambda isn't a great fit for large files.
1. tell API Gateway to serve binary
First, we tell API Gateway that it's okay to serve binary data.
I don't recommend this in production unless you have a great reason. Like a dynamic image that changes every time.
# serverless.ymlprovider:name: awsruntime: nodejs12.xstage: devapiGateway:binaryMediaTypes:- "*/*"
You can limit binaryMediaTypes
to specific types you intend to use. */*
is easier.
2. add a new function
Next we define a new Lambda function
# serverless.ymlfunctions:screenshot:handler: dist/screenshot.handlermemorysize: 2536timeout: 30events:- http:path: screenshotmethod: GETcors: true
Same as before, different name. Needs lots of memory and a long timeout.
3. screenshotGoogle()
We're using similar machinery as before.
// src/screenshot.tsasync function screenshotGoogle(browser: Browser, search: string) {const page = await browser.newPage()await page.goto("https://google.com", {waitUntil: ["domcontentloaded", "networkidle2"],})// this part is specific to the page you're screenshottingawait page.type("input[type=text]", search)const [response] = await Promise.all([page.waitForNavigation(),page.click("input[type=submit]"),])if (!response.ok()) {throw "Couldn't get response"}await page.goto(response.url())// this part is specific to the page you're screenshottingconst element = await page.$("#main")if (!element) {throw "Couldn't find results div"}const boundingBox = await element.boundingBox()const imagePath = `/tmp/screenshot-${new Date().getTime()}.png`if (!boundingBox) {throw "Couldn't measure size of results div"}await page.screenshot({path: imagePath,clip: boundingBox,})const data = fs.readFileSync(imagePath).toString("base64")return {statusCode: 200,headers: {"Content-Type": "image/png",},body: data,isBase64Encoded: true,}}export const handler = createHandler(screenshotGoogle)
Same code up to when we load the results page. Type a query, hit submit, wait for reload.
Then we do something different – measure the size of our results div.
// this part is specific to the page you're screenshottingconst element = await page.$("#main")if (!element) {throw "Couldn't find results div"}const boundingBox = await element.boundingBox()const imagePath = `/tmp/screenshot-${new Date().getTime()}.png`if (!boundingBox) {throw "Couldn't measure size of results div"}
We look for results and grab their boundingBox()
. That tells us the x, y
coordinates and the width, height
size for a more focused screenshot.
We set up an imagePath
in /tmp
. We can write to a file on Lambda's hard drive, but it will not stay there. When our lambda turns off, the file is gone.
await page.screenshot({path: imagePath,clip: boundingBox,})
Take a screenshot with page.screenshot()
. Saves to a file.
const data = fs.readFileSync(imagePath).toString("base64")return {statusCode: 200,headers: {"Content-Type": "image/png",},body: data,isBase64Encoded: true,}
Read the file into a Base64-encoded string and return a response. It must contain a content type – image/png
in our case – and tell API Gateway that it's Base64-encoded.
This is where you'd upload to S3 in production.
You can try mine here: https://4tydwq78d9.execute-api.us-east-1.amazonaws.com/dev/screenshot
How to use this
The most common use cases for Chrome Puppeteer are:
- Running automated tests
- Scraping websites cheaply
- Generating dynamic HTML-to-PNG images
- Generating PDFs
3 and 4 are great because you can build a small website that renders a social card for your content and use this machinery to turn it into an image.
Same for PDFs – build dynamic website, print-to-PDF with Chrome. Easier than generating PDFs by hand.
Have fun 😊
Next chapter we look at handling secrets.