Serverless Chrome puppeteer

Say you want to build a scraper, automate manual testing, or generate custom social cards for your website. What do you do?

You could spin up a docker container, set up headless Chrome, add Puppeteer, write a script to run it all, add a server to create an API, and ...

Or you can set up Serverless Chrome with AWS Lambda. Write a bit of code, hit deploy, and get a Chrome browser running on demand.

That's what this chapter is about 🤘

You'll learn how to:

configure Chrome Puppeteer on AWS
build a basic scraper
take website screenshots
run it on-demand

We build a scraper that goes to google.com, types in a phrase, and returns the first page of results. Then reuse the same code to return a screenshot.

You can see full code on GitHub

Serverless Chrome

Chrome's engine ships as the open source Chromium browser. Other browsers use it and add their own UI and custom features.

You can use the engine for browser automation – scraping, testing, screenshots, etc. When you need to render a website, Chromium is your friend.

This means:

download a chrome binary
set up an environment that makes it happy
run in headless mode
configure processes that talk to each other via complex sockets

Others have solved this problem for you.

Rather than figure it out yourself, I recommend using chrome-aws-lambda. It's the most up-to-date package for running Serverless Chrome.

Here's what you need for a Serverless Chrome setup:

install dependencies

$ yarn add chrome-aws-lambda@3.1.1 puppeteer@3.1.0 @types/puppeteer puppeteer-core@3.1.0

This installs everything you need to both run and interact with Chrome. ✌️

Check chrome-aws-lambda/README for the latest version of Chrome Puppeteer you can use. Make sure they match.

configure serverless.yml

# serverless.yml

service: serverless-chrome-example

provider:
  name: aws
  runtime: nodejs12.x
  stage: dev

package:
  exclude:
    - node_modules/puppeteer/.local-chromium/**

Configure a new service, make it run on AWS, use latest node.

The package part is important. It tells Serverless not to package the chromium binary with your code. AWS rejects builds of that size.

You are now ready to start running Chrome ✌️

Chrome Puppeteer 101

Chrome Puppeteer is a set of tools to interact with Chrome programmatically.

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

Write code that interacts with a website like a person would. Anything a person can do on the web, you can do with Puppeteer.

Core syntax feels like jQuery, but the objects are different than what you're used to. I've found it's best not to worry about the details.

Here's how you click on a link:

const page = await browser.newPage() // open a "tab"
page.goto("https://example.com") // navigates to URL

const div = await page.$("div#some_content") // grab a div
await div.click("a.target_link") // clicks link

Always open a new page for every new browser context.

Navigate to your URL then use jQuery-like selectors to interact with the page. You can feed selectors into click() and other methods, or use the page.$ syntax to search around.

Build a scraper

Web scraping is fiddly but sounds simple in theory:

load website
find content
read content
return content in new format

But that doesn't generalize. Each website is different.

You adapt the core technique to each website you scrape and there's no telling when the HTML might change.

You might even find websites that actively fight against scraping. Block bots, limit access speed, obfuscate HTML, ...

Please play nice and don't unleash thousands of parallel requests onto unsuspecting websites.

You can watch me work on this project on YouTube, if you prefer video:

And you can try the final result here: https://4tydwq78d9.execute-api.us-east-1.amazonaws.com/dev/scraper

1. more dependencies

Start with the serverless.yml and dependencies from earlier (chrome-aws-lambda and puppeteer).

Add aws-lambda:

$ yarn add aws-lambda @types/aws-lambda

Installs the code you need to interact with the AWS Lambda environment.

2. add a scraper function

Define a new scraper function in serverless.yml

# serverless.yml

functions:
  scraper:
    handler: dist/scraper.handler
    memorysize: 2536
    timeout: 30
    events:
      - http:
          path: scraper
          method: GET
          cors: true

We're saying code lives in the handler method exported from scraper. We ask for lots of memory and a long timeout. Chrome is resource intensive and our code makes web requests, which might take a while.

All this fires from a GET request on /scraper.

3. getChrome()

The getChrome method instantiates a new browser context. I like to put this in a util file.

// src/util.ts

import chrome from "chrome-aws-lambda"

export async function getChrome() {
  let browser = null

  try {
    browser = await chrome.puppeteer.launch({
      args: chrome.args,
      defaultViewport: {
        width: 1920,
        height: 1080,
        isMobile: true,
        deviceScaleFactor: 2,
      },
      executablePath: await chrome.executablePath,
      headless: chrome.headless,
      ignoreHTTPSErrors: true,
    })
  } catch (err) {
    console.error("Error launching chrome")
    console.error(err)
  }

  return browser
}

We launch a Chrome Puppeteer instance with default config and specify our own screen size.

The isMobile setting tricks many websites into loading faster. The deviceScaleFactor: 2 helps create better screenshots. It's like using a retina screen.

Adding ignoreHTTPSErrors makes the process more robust.

If the browser fails to launch, we log debugging info.

4. a shared createHandler()

We're building 2 pieces of code that share a lot of logic – scraping and screenshots. Both need a browser, deal with errors, and parse URL queries.

We build a common createHandler() method that deals with boilerplate and calls the important function when ready.

// src/util.ts

import { APIGatewayEvent } from "aws-lambda"
import { Browser } from "puppeteer"

// both scraper and screenshot have the same basic handler
// they just call a different method to do things
export const createHandler = (
  workFunction: (browser: Browser, search: string) => Promise<APIResponse>
) => async (event: APIGatewayEvent): Promise<APIResponse> => {
  const search =
    event.queryStringParameters && event.queryStringParameters.search

  if (!search) {
    return {
      statusCode: 400,
      body: "Please provide a ?search= parameter",
    }
  }

  const browser = await getChrome()

  if (!browser) {
    return {
      statusCode: 500,
      body: "Error launching Chrome",
    }
  }

  try {
    // call the function that does the real work
    const response = await workFunction(browser, search)

    return response
  } catch (err) {
    console.log(err)
    return {
      statusCode: 500,
      body: "Error scraping Google",
    }
  }
}

We read the ?search= param, open a browser, and verify everything's set up.

Then we call the passed-in workFunction, which returns a response. If that fails, we throw a 500 error.

5. scrapeGoogle()

We're ready to scrape Google search results.

async function scrapeGoogle(browser: Browser, search: string) {
  const page = await browser.newPage()
  await page.goto("https://google.com", {
    waitUntil: ["domcontentloaded", "networkidle2"],
  })

  // this part is specific to the page you're scraping
  await page.type("input[type=text]", search)

  const [response] = await Promise.all([
    page.waitForNavigation(),
    page.click("input[type=submit]"),
  ])

  if (!response.ok()) {
    throw "Couldn't get response"
  }

  await page.goto(response.url())

  // this part is very specific to the page you're scraping
  const searchResults = await page.$$(".rc")

  let links = await Promise.all(
    searchResults.map(async (result) => {
      return {
        url: await result.$eval("a", (node) => node.getAttribute("href")),
        title: await result.$eval("h3", (node) => node.innerHTML),
        description: await result.$eval("span.st", (node) => node.innerHTML),
      }
    })
  )

  return {
    statusCode: 200,
    body: JSON.stringify(links),
  }
}

export const handler = createHandler(scrapeGoogle)

Lots going on here. Let's go piece by piece.

const page = await browser.newPage()
await page.goto("https://google.com", {
  waitUntil: ["domcontentloaded", "networkidle2"],
})

Open a new page, navigate to google.com, wait for everything to load. I recommend waiting for networkidle2, which means all asynchronous requests have finished.

Useful when dealing with complex webapps.

// this part is specific to the page you're scraping
await page.type("input[type=text]", search)

const [response] = await Promise.all([
  page.waitForNavigation(),
  page.click("input[type=submit]"),
])

if (!response.ok()) {
  throw "Couldn't get response"
}

await page.goto(response.url())

To scrape google, we type a search into the input field, then hit submit and wait for the page to load.

This part is different for every website.

// this part is very specific to the page you're scraping
const searchResults = await page.$$(".rc")

let links = await Promise.all(
  searchResults.map(async (result) => {
    return {
      url: await result.$eval("a", (node) => node.getAttribute("href")),
      title: await result.$eval("h3", (node) => node.innerHTML),
      description: await result.$eval("span.st", (node) => node.innerHTML),
    }
  })
)

return {
  statusCode: 200,
  body: JSON.stringify(links),
}

When the results page loads, we:

look for every .rc DOM element – best identifier of search results I could find
iterate through results
get the info we want from each

You can use the page.$eval trick to parse DOM nodes with the same API you'd use in a browser. Executes your method on the nodes it finds and returns the result.

6. hit deploy and try it out

You now have a bonafide web scraper. Wakes up on demand, runs chrome, turns Google search results into easy-to-use JSON.

this was fun, got a lambda that spits out JSON of the first page of google results

Here's #javascript for example

couldn't quite get the screenshot version to work yet pic.twitter.com/JzpMXVJqiU
— Swizec Teller published ServerlessHandbook.dev (@Swizec) July 12, 2020

We left out project configuration boilerplate. You can find those details in other chapters or see example code on GitHub.

Take screenshots

Taking screenshots is similar to scraping. Instead of parsing the page, you call .screenshot() and get an image.

Our example returns that image directly. You'll want to store on S3 and return a URL in a real project. Lambda isn't a great fit for large files.

1. tell API Gateway to serve binary

First, we tell API Gateway that it's okay to serve binary data.

I don't recommend this in production unless you have a great reason. Like a dynamic image that changes every time.

# serverless.yml

provider:
  name: aws
  runtime: nodejs12.x
  stage: dev
  apiGateway:
    binaryMediaTypes:
      - "*/*"

You can limit binaryMediaTypes to specific types you intend to use. */* is easier.

2. add a new function

Next we define a new Lambda function

# serverless.yml

functions:
  screenshot:
    handler: dist/screenshot.handler
    memorysize: 2536
    timeout: 30
    events:
      - http:
          path: screenshot
          method: GET
          cors: true

Same as before, different name. Needs lots of memory and a long timeout.

3. screenshotGoogle()

We're using similar machinery as before.

// src/screenshot.ts

async function screenshotGoogle(browser: Browser, search: string) {
  const page = await browser.newPage()
  await page.goto("https://google.com", {
    waitUntil: ["domcontentloaded", "networkidle2"],
  })

  // this part is specific to the page you're screenshotting
  await page.type("input[type=text]", search)

  const [response] = await Promise.all([
    page.waitForNavigation(),
    page.click("input[type=submit]"),
  ])

  if (!response.ok()) {
    throw "Couldn't get response"
  }

  await page.goto(response.url())

  // this part is specific to the page you're screenshotting
  const element = await page.$("#main")

  if (!element) {
    throw "Couldn't find results div"
  }

  const boundingBox = await element.boundingBox()
  const imagePath = `/tmp/screenshot-${new Date().getTime()}.png`

  if (!boundingBox) {
    throw "Couldn't measure size of results div"
  }

  await page.screenshot({
    path: imagePath,
    clip: boundingBox,
  })

  const data = fs.readFileSync(imagePath).toString("base64")

  return {
    statusCode: 200,
    headers: {
      "Content-Type": "image/png",
    },
    body: data,
    isBase64Encoded: true,
  }
}

export const handler = createHandler(screenshotGoogle)

Same code up to when we load the results page. Type a query, hit submit, wait for reload.

Then we do something different – measure the size of our results div.

// this part is specific to the page you're screenshotting
const element = await page.$("#main")

if (!element) {
  throw "Couldn't find results div"
}

const boundingBox = await element.boundingBox()
const imagePath = `/tmp/screenshot-${new Date().getTime()}.png`

if (!boundingBox) {
  throw "Couldn't measure size of results div"
}

We look for results and grab their boundingBox(). That tells us the x, y coordinates and the width, height size for a more focused screenshot.

We set up an imagePath in /tmp. We can write to a file on Lambda's hard drive, but it will not stay there. When our lambda turns off, the file is gone.

await page.screenshot({
  path: imagePath,
  clip: boundingBox,
})

Take a screenshot with page.screenshot(). Saves to a file.

const data = fs.readFileSync(imagePath).toString("base64")

return {
  statusCode: 200,
  headers: {
    "Content-Type": "image/png",
  },
  body: data,
  isBase64Encoded: true,
}

Read the file into a Base64-encoded string and return a response. It must contain a content type – image/png in our case – and tell API Gateway that it's Base64-encoded.

This is where you'd upload to S3 in production.

You can try mine here: https://4tydwq78d9.execute-api.us-east-1.amazonaws.com/dev/screenshot

How to use this

The most common use cases for Chrome Puppeteer are:

Running automated tests
Scraping websites cheaply
Generating dynamic HTML-to-PNG images
Generating PDFs

3 and 4 are great because you can build a small website that renders a social card for your content and use this machinery to turn it into an image.

Same for PDFs – build dynamic website, print-to-PDF with Chrome. Easier than generating PDFs by hand.

Have fun 😊

Next chapter we look at handling secrets.

Did you enjoy this chapter?