Scraping

Scraping a web page.

Endpoint

Your First Request

Let's make GET request to scrape https://example.com:

bash

curl https://api.browserku.com \
    -G \
    --data-urlencode "url=https//example.com"

The result will be a JSON object which looks like this:

json

{
  "meta": {
    "lang": "en",
    "title": "NASA",
    "description": "NASA.gov brings you the latest news, images and videos from America's space agency, pioneering the future in space exploration, scientific discovery and aeronautics research."
  }
}

Browserku automatically extracts important information from this web page for you:

meta.title: The page title
meta.description: The page description
meta.lang: The page language, from lang attribute on the <html> element
meta.themeColor: The page theme color, from <meta name="theme-color"> element

If you want to retrive the entire html from this page, use includeHtml option, a new html property will be added to the result object.

Taking Screenshots

To take screenshots, using the screenshot option, the easiest way is to set it to true:

bash

curl https://api.browserku.com \
    -G \
    --data-urlencode "url=https//example.com" \
    --data-urlencode "screenshot=true"

Now the result will be a JSON object which looks like this:

json

{
  "meta": {
    "title": "NASA",
    "description": "NASA.gov brings you the latest news, images and videos from America's space agency, pioneering the future in space exploration, scientific discovery and aeronautics research.",
    "lang": "en"
  },
  "screenshot": {
    "url": "https://r2.browserku.com/beGjNFfLCLSanjFqIawPx.png",
    "width": 1280,
    "height": 720,
    "size": 794292
  }
}

The saved screenshot url will be valid for 7 days, after that it will be automatically deleted. Consult screenshot option to learn more.

If you want Browserku to directly return the screenshot image instead of the JSON object as the response, you can use response option and set it to screenshot.url:

bash

curl https://api.browserku.com \
    -G \
    --data-urlencode "url=https//example.com" \
    --data-urlencode "screenshot=true" \
    --data-urlencode "response=screenshot.url"

const { result } = await browserku.scrape({
  url: "https://example.com",
  screenshot: true,
  response: "screenshot.url",
})

Basically you can use response option to reference any value in the result object and make Browserku use it as the actuall response. Make sure the value you referenced is a URL.

Query Parameters / Body

url

Type: string?

The URL you want to scrape.

source

Type: string?

Instead of using an external url, you can provide a source HTML to scrape.

Learn more about Using Custom HTML.

includeHtml

Type: boolean?

Whether to include the scraped HTML in the response.

waitUntil

Type: One of load, domcontentloaded, networkidle0, networkidle2
Default: networkidle0

Define when the page is considered ready.

timeout

Type: int? >=0 <=60,000

Page nagivation timeout in milliseconds.

blockAds

Type: boolean?
Default: true

Whether to block ads.

proxy

Type: string?

Proxy server to use.

device

Type: string?

Simulate the viewport, userAgent of the given device.

userAgent

Type: string?

Override the user agent.

viewport

Type: object?

Property	Type	Description
width	`int`
height	`int`
deviceScaleFactor	`int?`	Must be betwwen 0 ~ 10
isMobile	`boolean?`
hasTouch	`boolean?`
isLandscape	`boolean?`

Override the viewport.

auth

Type: object?

Property	Type	Description
username	`string`
password	`string`

Basic authentication.

waitForSelector

Type: string?

Wait for a selector to appear to consider the page as ready.

waitForTimeout

Type: int? >=0 <=60,000

Wait for a timeout (in milliseconds) to consider the page as ready.

animations

Type: boolean?
Default: false

Whether to enable CSS animations.

scripts

Type: string[]?

Inject custom scripts into the page.

json

[
  "document.body.style.backgroundColor = 'red'",
  "https://your-website.com/custom-script.js"
]

Can be either a JavaScript string or a URL to your script.

styles

Type: string[]?

Inject custom styles into the page.

json

["body {background: red}", "https://your-website.com/custom-style.css"]

Can be either a CSS string or a URL to your CSS file.

click

Type: string?

Click an element when the page is ready.

cookies

Type: object[]?

Property	Type	Description
name	`string`
value	`string`
domain	`string?`
path	`string?`
secure	`boolean?`
httpOnly	`boolean?`
sameSite	`string?`	Can be either `Lax` or `Strict`

Attach cookies to the request.

ttl

Type: string?
Default: 1d

Define how long the response should be cached for.

We use ms to parse human-readable time into milliseconds.

noCache

Type: boolean

Skip cache (if exists).

rejectResourceTypes

Type: string[]?

Reject requests to the given resource types.

Available resource types:

type ResourceType =
  | "document"
  | "stylesheet"
  | "image"
  | "media"
  | "font"
  | "script"
  | "texttrack"
  | "xhr"
  | "fetch"
  | "eventsource"
  | "websocket"
  | "manifest"
  | "other"

allowResourceTypes

Type: string[]?

The reverse of rejectResourceTypes, only allow requests to the given resource types to be fetched. By default all requests are allowed.

javascript

Type: boolean?
Default: true

Whether to enable JavaScript.

screenshot

Type: boolean? object?

Property	Type	Description
type	`string?`	Can be either `jpeg` or `png`, `png` by default
quality	`int?`	Must be between 0 and 100
selector	`string?`	Only screenshot the specific element
fullPage	`boolean?`
omitBackground	`boolean?`

Take a screenshot of the page or the specified selector.

A new property screenshot will be added to the response:

json

{
  "screenshot": {
    "url": "https://r2.browserku.com/some_random_id.png",
    "width": 800,
    "height": 600,
    "size": 1024
  }
}

pdf

Type: boolean? object?

Property	Type	Description
scale	`int?`	Must be between 0 and 10
displayHeaderFooter	`boolean?`
headerTemplate	`string?`
footerTemplate	`string?`
format	`string?`	One of "Letter", "Legal", "Tabloid", "Ledger", "A0", "A1", "A2", "A3", "A4", "A5", "A6"
margin	`object?`
landscape	`boolean?`
pageRanges	`string?`
printBackground	`boolean?`
omitBackground	`boolean?`

response

Type: string?

Use a property in response data as the actual response.

For example, let's you have a request:

text

GET /?screenshot=true&url=https://example.com

You get a response like this:

json

{
  "screenshot": {
    "url": "https://r2.browserku.com/some_random_id.png"
  }
}

If you want to directly serve the the screenshot image as the response, you can append &response=screenshot.url to the request URL:

text

GET /?screenshot=true&url=https://example.com&response=screenshot.url

debug

Type: boolean?

Add debug information to the result.

Scraping #

Endpoint #

Your First Request #

Taking Screenshots #

Query Parameters / Body #

url #

source #

includeHtml #

waitUntil #

timeout #

blockAds #

proxy #

device #

userAgent #

viewport #

auth #

waitForSelector #

waitForTimeout #

animations #

scripts #

styles #

click #

cookies #

ttl #

noCache #

rejectResourceTypes #

allowResourceTypes #

javascript #

screenshot #

pdf #

response #

debug #

Scraping

Endpoint

Your First Request

Taking Screenshots

Query Parameters / Body

url

source

includeHtml

waitUntil

timeout

blockAds

proxy

device

userAgent

viewport

auth

waitForSelector

waitForTimeout

animations

scripts

styles

click

cookies

ttl

noCache

rejectResourceTypes

allowResourceTypes

javascript

screenshot

pdf

response

debug