Serverless Handbook

Student Login

Serverless Chrome puppeteer

Say you want to build a scraper, automate manual testing, or generate custom social cards for your website. What do you do?

You could spin up a docker container, set up headless Chrome, add Puppeteer, write a script to run it all, add a server to create an API, and ...

Or you can set up Serverless Chrome with AWS Lambda. Write a bit of code, hit deploy, and get a Chrome browser running on demand.

That's what this chapter is about 🤘

You'll learn how to:

  • configure Chrome Puppeteer on AWS
  • build a basic scraper
  • take website screenshots
  • run it on-demand

We build a scraper that goes to google.com, types in a phrase, and returns the first page of results. Then reuse the same code to return a screenshot.

You can see full code on GitHub

Serverless Chrome

Chrome's engine ships as the open source Chromium browser. Other browsers use it and add their own UI and custom features.

You can use the engine for browser automation – scraping, testing, screenshots, etc. When you need to render a website, Chromium is your friend.

This means:

  • download a chrome binary
  • set up an environment that makes it happy
  • run in headless mode
  • configure processes that talk to each other via complex sockets

Others have solved this problem for you.

Rather than figure it out yourself, I recommend using chrome-aws-lambda. It's the most up-to-date package for running Serverless Chrome.

Here's what you need for a Serverless Chrome setup:

  1. install dependencies
$ yarn add chrome-aws-lambda@3.1.1 puppeteer@3.1.0 @types/puppeteer puppeteer-core@3.1.0

This installs everything you need to both run and interact with Chrome. ✌️

Check chrome-aws-lambda/README for the latest version of Chrome Puppeteer you can use. Make sure they match.

  1. configure serverless.yml
# serverless.yml
service: serverless-chrome-example
provider:
name: aws
runtime: nodejs12.x
stage: dev
package:
exclude:
- node_modules/puppeteer/.local-chromium/**

Configure a new service, make it run on AWS, use latest node.

The package part is important. It tells Serverless not to package the chromium binary with your code. AWS rejects builds of that size.

You are now ready to start running Chrome ✌️

Chrome Puppeteer 101

Chrome Puppeteer is a set of tools to interact with Chrome programmatically.

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

Write code that interacts with a website like a person would. Anything a person can do on the web, you can do with Puppeteer.

Core syntax feels like jQuery, but the objects are different than what you're used to. I've found it's best not to worry about the details.

Here's how you click on a link:

const page = await browser.newPage() // open a "tab"
page.goto("https://example.com") // navigates to URL
const div = await page.$("div#some_content") // grab a div
await div.click("a.target_link") // clicks link

Always open a new page for every new browser context.

Navigate to your URL then use jQuery-like selectors to interact with the page. You can feed selectors into click() and other methods, or use the page.$ syntax to search around.

Build a scraper

Web scraping is fiddly but sounds simple in theory:

  • load website
  • find content
  • read content
  • return content in new format

But that doesn't generalize. Each website is different.

You adapt the core technique to each website you scrape and there's no telling when the HTML might change.

You might even find websites that actively fight against scraping. Block bots, limit access speed, obfuscate HTML, ...

Please play nice and don't unleash thousands of parallel requests onto unsuspecting websites.

You can watch me work on this project on YouTube, if you prefer video:

And you can try the final result here: https://4tydwq78d9.execute-api.us-east-1.amazonaws.com/dev/scraper

1. more dependencies

Start with the serverless.yml and dependencies from earlier (chrome-aws-lambda and puppeteer).

Add aws-lambda:

$ yarn add aws-lambda @types/aws-lambda

Installs the code you need to interact with the AWS Lambda environment.

2. add a scraper function

Define a new scraper function in serverless.yml

# serverless.yml
functions:
scraper:
handler: dist/scraper.handler
memorysize: 2536
timeout: 30
events:
- http:
path: scraper
method: GET
cors: true

We're saying code lives in the handler method exported from scraper. We ask for lots of memory and a long timeout. Chrome is resource intensive and our code makes web requests, which might take a while.

All this fires from a GET request on /scraper.

3. getChrome()

The getChrome method instantiates a new browser context. I like to put this in a util file.

// src/util.ts
import chrome from "chrome-aws-lambda"
export async function getChrome() {
let browser = null
try {
browser = await chrome.puppeteer.launch({
args: chrome.args,
defaultViewport: {
width: 1920,
height: 1080,
isMobile: true,
deviceScaleFactor: 2,
},
executablePath: await chrome.executablePath,
headless: chrome.headless,
ignoreHTTPSErrors: true,
})
} catch (err) {
console.error("Error launching chrome")
console.error(err)
}
return browser
}

We launch a Chrome Puppeteer instance with default config and specify our own screen size.

The isMobile setting tricks many websites into loading faster. The deviceScaleFactor: 2 helps create better screenshots. It's like using a retina screen.

Adding ignoreHTTPSErrors makes the process more robust.

If the browser fails to launch, we log debugging info.

4. a shared createHandler()

We're building 2 pieces of code that share a lot of logic – scraping and screenshots. Both need a browser, deal with errors, and parse URL queries.

We build a common createHandler() method that deals with boilerplate and calls the important function when ready.

// src/util.ts
import { APIGatewayEvent } from "aws-lambda"
import { Browser } from "puppeteer"
// both scraper and screenshot have the same basic handler
// they just call a different method to do things
export const createHandler = (
workFunction: (browser: Browser, search: string) => Promise<APIResponse>
) => async (event: APIGatewayEvent): Promise<APIResponse> => {
const search =
event.queryStringParameters && event.queryStringParameters.search
if (!search) {
return {
statusCode: 400,
body: "Please provide a ?search= parameter",
}
}
const browser = await getChrome()
if (!browser) {
return {
statusCode: 500,
body: "Error launching Chrome",
}
}
try {
// call the function that does the real work
const response = await workFunction(browser, search)
return response
} catch (err) {
console.log(err)
return {
statusCode: 500,
body: "Error scraping Google",
}
}
}

We read the ?search= param, open a browser, and verify everything's set up.

Then we call the passed-in workFunction, which returns a response. If that fails, we throw a 500 error.

5. scrapeGoogle()

We're ready to scrape Google search results.

async function scrapeGoogle(browser: Browser, search: string) {
const page = await browser.newPage()
await page.goto("https://google.com", {
waitUntil: ["domcontentloaded", "networkidle2"],
})
// this part is specific to the page you're scraping
await page.type("input[type=text]", search)
const [response] = await Promise.all([
page.waitForNavigation(),
page.click("input[type=submit]"),
])
if (!response.ok()) {
throw "Couldn't get response"
}
await page.goto(response.url())
// this part is very specific to the page you're scraping
const searchResults = await page.$$(".rc")
let links = await Promise.all(
searchResults.map(async (result) => {
return {
url: await result.$eval("a", (node) => node.getAttribute("href")),
title: await result.$eval("h3", (node) => node.innerHTML),
description: await result.$eval("span.st", (node) => node.innerHTML),
}
})
)
return {
statusCode: 200,
body: JSON.stringify(links),
}
}
export const handler = createHandler(scrapeGoogle)

Lots going on here. Let's go piece by piece.

const page = await browser.newPage()
await page.goto("https://google.com", {
waitUntil: ["domcontentloaded", "networkidle2"],
})

Open a new page, navigate to google.com, wait for everything to load. I recommend waiting for networkidle2, which means all asynchronous requests have finished.

Useful when dealing with complex webapps.

// this part is specific to the page you're scraping
await page.type("input[type=text]", search)
const [response] = await Promise.all([
page.waitForNavigation(),
page.click("input[type=submit]"),
])
if (!response.ok()) {
throw "Couldn't get response"
}
await page.goto(response.url())

To scrape google, we type a search into the input field, then hit submit and wait for the page to load.

This part is different for every website.

// this part is very specific to the page you're scraping
const searchResults = await page.$$(".rc")
let links = await Promise.all(
searchResults.map(async (result) => {
return {
url: await result.$eval("a", (node) => node.getAttribute("href")),
title: await result.$eval("h3", (node) => node.innerHTML),
description: await result.$eval("span.st", (node) => node.innerHTML),
}
})
)
return {
statusCode: 200,
body: JSON.stringify(links),
}

When the results page loads, we:

  • look for every .rc DOM element – best identifier of search results I could find
  • iterate through results
  • get the info we want from each

You can use the page.$eval trick to parse DOM nodes with the same API you'd use in a browser. Executes your method on the nodes it finds and returns the result.

6. hit deploy and try it out

You now have a bonafide web scraper. Wakes up on demand, runs chrome, turns Google search results into easy-to-use JSON.

We left out project configuration boilerplate. You can find those details in other chapters or see example code on GitHub.

Take screenshots

Taking screenshots is similar to scraping. Instead of parsing the page, you call .screenshot() and get an image.

Our example returns that image directly. You'll want to store on S3 and return a URL in a real project. Lambda isn't a great fit for large files.

1. tell API Gateway to serve binary

First, we tell API Gateway that it's okay to serve binary data.

I don't recommend this in production unless you have a great reason. Like a dynamic image that changes every time.

# serverless.yml
provider:
name: aws
runtime: nodejs12.x
stage: dev
apiGateway:
binaryMediaTypes:
- "*/*"

You can limit binaryMediaTypes to specific types you intend to use. */* is easier.

2. add a new function

Next we define a new Lambda function

# serverless.yml
functions:
screenshot:
handler: dist/screenshot.handler
memorysize: 2536
timeout: 30
events:
- http:
path: screenshot
method: GET
cors: true

Same as before, different name. Needs lots of memory and a long timeout.

3. screenshotGoogle()

We're using similar machinery as before.

// src/screenshot.ts
async function screenshotGoogle(browser: Browser, search: string) {
const page = await browser.newPage()
await page.goto("https://google.com", {
waitUntil: ["domcontentloaded", "networkidle2"],
})
// this part is specific to the page you're screenshotting
await page.type("input[type=text]", search)
const [response] = await Promise.all([
page.waitForNavigation(),
page.click("input[type=submit]"),
])
if (!response.ok()) {
throw "Couldn't get response"
}
await page.goto(response.url())
// this part is specific to the page you're screenshotting
const element = await page.$("#main")
if (!element) {
throw "Couldn't find results div"
}
const boundingBox = await element.boundingBox()
const imagePath = `/tmp/screenshot-${new Date().getTime()}.png`
if (!boundingBox) {
throw "Couldn't measure size of results div"
}
await page.screenshot({
path: imagePath,
clip: boundingBox,
})
const data = fs.readFileSync(imagePath).toString("base64")
return {
statusCode: 200,
headers: {
"Content-Type": "image/png",
},
body: data,
isBase64Encoded: true,
}
}
export const handler = createHandler(screenshotGoogle)

Same code up to when we load the results page. Type a query, hit submit, wait for reload.

Then we do something different – measure the size of our results div.

// this part is specific to the page you're screenshotting
const element = await page.$("#main")
if (!element) {
throw "Couldn't find results div"
}
const boundingBox = await element.boundingBox()
const imagePath = `/tmp/screenshot-${new Date().getTime()}.png`
if (!boundingBox) {
throw "Couldn't measure size of results div"
}

We look for results and grab their boundingBox(). That tells us the x, y coordinates and the width, height size for a more focused screenshot.

We set up an imagePath in /tmp. We can write to a file on Lambda's hard drive, but it will not stay there. When our lambda turns off, the file is gone.

await page.screenshot({
path: imagePath,
clip: boundingBox,
})

Take a screenshot with page.screenshot(). Saves to a file.

const data = fs.readFileSync(imagePath).toString("base64")
return {
statusCode: 200,
headers: {
"Content-Type": "image/png",
},
body: data,
isBase64Encoded: true,
}

Read the file into a Base64-encoded string and return a response. It must contain a content type – image/png in our case – and tell API Gateway that it's Base64-encoded.

This is where you'd upload to S3 in production.

You can try mine here: https://4tydwq78d9.execute-api.us-east-1.amazonaws.com/dev/screenshot

How to use this

The most common use cases for Chrome Puppeteer are:

  1. Running automated tests
  2. Scraping websites cheaply
  3. Generating dynamic HTML-to-PNG images
  4. Generating PDFs

3 and 4 are great because you can build a small website that renders a social card for your content and use this machinery to turn it into an image.

Same for PDFs – build dynamic website, print-to-PDF with Chrome. Easier than generating PDFs by hand.

Have fun 😊

Next chapter we look at handling secrets.

Did you enjoy this chapter?

Thanks for supporting Serverless Handbook! Consider sharing it with friends ❤️

Add it to your bookshelf if you haven't.

buy now amazon

Serverless Handbook on your bookshelf
Serverless Handbook on your bookshelf

Cheers,
~Swizec

Previous:
Serverless performance
Next:
Handling secrets
Created bySwizecwith ❤️