Skip to content

Scraping

Scraping a web page.

Endpoint

GET or POST https://api.browserku.com

Your First Request

Let's make GET request to scrape https://example.com:

bash
curl https://api.browserku.com \
    -G \
    --data-urlencode "url=https//example.com"

The result will be a JSON object which looks like this:

json
{
  "meta": {
    "lang": "en",
    "title": "NASA",
    "description": "NASA.gov brings you the latest news, images and videos from America's space agency, pioneering the future in space exploration, scientific discovery and aeronautics research."
  }
}

Browserku automatically extracts important information from this web page for you:

  • meta.title: The page title
  • meta.description: The page description
  • meta.lang: The page language, from lang attribute on the <html> element
  • meta.themeColor: The page theme color, from <meta name="theme-color"> element

If you want to retrive the entire html from this page, use includeHtml option, a new html property will be added to the result object.

Taking Screenshots

To take screenshots, using the screenshot option, the easiest way is to set it to true:

bash
curl https://api.browserku.com \
    -G \
    --data-urlencode "url=https//example.com" \
    --data-urlencode "screenshot=true"

Now the result will be a JSON object which looks like this:

json
{
  "meta": {
    "title": "NASA",
    "description": "NASA.gov brings you the latest news, images and videos from America's space agency, pioneering the future in space exploration, scientific discovery and aeronautics research.",
    "lang": "en"
  },
  "screenshot": {
    "url": "https://r2.browserku.com/beGjNFfLCLSanjFqIawPx.png",
    "width": 1280,
    "height": 720,
    "size": 794292
  }
}

The saved screenshot url will be valid for 7 days, after that it will be automatically deleted. Consult screenshot option to learn more.

If you want Browserku to directly return the screenshot image instead of the JSON object as the response, you can use response option and set it to screenshot.url:

bash
curl https://api.browserku.com \
    -G \
    --data-urlencode "url=https//example.com" \
    --data-urlencode "screenshot=true" \
    --data-urlencode "response=screenshot.url"
ts
const { result } = await browserku.scrape({
  url: "https://example.com",
  screenshot: true,
  response: "screenshot.url",
})

Basically you can use response option to reference any value in the result object and make Browserku use it as the actuall response. Make sure the value you referenced is a URL.

Query Parameters / Body

url

  • Type: string?

The URL you want to scrape.

source

  • Type: string?

Instead of using an external url, you can provide a source HTML to scrape.

Learn more about Using Custom HTML.

includeHtml

  • Type: boolean?

Whether to include the scraped HTML in the response.

waitUntil

  • Type: One of load, domcontentloaded, networkidle0, networkidle2
  • Default: networkidle0

Define when the page is considered ready.

timeout

  • Type: int? >=0 <=60,000

Page nagivation timeout in milliseconds.

blockAds

  • Type: boolean?
  • Default: true

Whether to block ads.

proxy

  • Type: string?

Proxy server to use.

device

  • Type: string?

Simulate the viewport, userAgent of the given device.

userAgent

  • Type: string?

Override the user agent.

viewport

  • Type: object?
PropertyTypeDescription
widthint
heightint
deviceScaleFactorint?Must be betwwen 0 ~ 10
isMobileboolean?
hasTouchboolean?
isLandscapeboolean?

Override the viewport.

auth

  • Type: object?
PropertyTypeDescription
usernamestring
passwordstring

Basic authentication.

waitForSelector

  • Type: string?

Wait for a selector to appear to consider the page as ready.

waitForTimeout

  • Type: int? >=0 <=60,000

Wait for a timeout (in milliseconds) to consider the page as ready.

animations

  • Type: boolean?
  • Default: false

Whether to enable CSS animations.

scripts

  • Type: string[]?

Inject custom scripts into the page.

json
[
  "document.body.style.backgroundColor = 'red'",
  "https://your-website.com/custom-script.js"
]

Can be either a JavaScript string or a URL to your script.

styles

  • Type: string[]?

Inject custom styles into the page.

json
["body {background: red}", "https://your-website.com/custom-style.css"]

Can be either a CSS string or a URL to your CSS file.

click

  • Type: string?

Click an element when the page is ready.

cookies

  • Type: object[]?
PropertyTypeDescription
namestring
valuestring
domainstring?
pathstring?
secureboolean?
httpOnlyboolean?
sameSitestring?Can be either Lax or Strict

Attach cookies to the request.

ttl

  • Type: string?
  • Default: 1d

Define how long the response should be cached for.

We use ms to parse human-readable time into milliseconds.

noCache

  • Type: boolean

Skip cache (if exists).

rejectResourceTypes

  • Type: string[]?

Reject requests to the given resource types.

Available resource types:

ts
type ResourceType =
  | "document"
  | "stylesheet"
  | "image"
  | "media"
  | "font"
  | "script"
  | "texttrack"
  | "xhr"
  | "fetch"
  | "eventsource"
  | "websocket"
  | "manifest"
  | "other"

allowResourceTypes

  • Type: string[]?

The reverse of rejectResourceTypes, only allow requests to the given resource types to be fetched. By default all requests are allowed.

javascript

  • Type: boolean?
  • Default: true

Whether to enable JavaScript.

screenshot

  • Type: boolean? object?
PropertyTypeDescription
typestring?Can be either jpeg or png, png by default
qualityint?Must be between 0 and 100
selectorstring?Only screenshot the specific element
fullPageboolean?
omitBackgroundboolean?

Take a screenshot of the page or the specified selector.

A new property screenshot will be added to the response:

json
{
  "screenshot": {
    "url": "https://r2.browserku.com/some_random_id.png",
    "width": 800,
    "height": 600,
    "size": 1024
  }
}

pdf

  • Type: boolean? object?
PropertyTypeDescription
scaleint?Must be between 0 and 10
displayHeaderFooterboolean?
headerTemplatestring?
footerTemplatestring?
formatstring?One of "Letter", "Legal", "Tabloid", "Ledger", "A0", "A1", "A2", "A3", "A4", "A5", "A6"
marginobject?
landscapeboolean?
pageRangesstring?
printBackgroundboolean?
omitBackgroundboolean?

response

  • Type: string?

Use a property in response data as the actual response.

For example, let's you have a request:

text
GET /?screenshot=true&url=https://example.com

You get a response like this:

json
{
  "screenshot": {
    "url": "https://r2.browserku.com/some_random_id.png"
  }
}

If you want to directly serve the the screenshot image as the response, you can append &response=screenshot.url to the request URL:

text
GET /?screenshot=true&url=https://example.com&response=screenshot.url

debug

  • Type: boolean?

Add debug information to the result.