Crawl

Base Path: /v1/crawl

Note on Responsible Crawling: This API is designed for generating context in LLM applications, not bulk data collection. Please:

  • Respect the target website's robots.txt and crawl limits

  • Use reasonable delays between requests (we enforce minimum delays)

  • Only crawl publicly accessible pages

  • Consider using official APIs when available

  • Cache results when possible to minimize repeat crawls

Endpoints

Start Crawl

Method: POST Endpoint: https://api.tokensource.com/v1/crawl Description: Initiates a multi-page crawl starting from a URL. Supports depth control, path filtering, and notifications.

Request Body:

{
  "url": "https://coinmarketcap.com",
  "include_paths": ["currencies/", "exchanges/", "nft/"],
  "exclude_paths": ["login/", "settings/", "api/"],
  "max_depth": 2,
  "ignore_sitemap": true,
  "limit": 10,
  "allow_backward_links": true,
  "allow_external_links": true,
  "scrape_options": {
    "formats": ["markdown", "html", "raw_html", "links", "screenshot", "extract"],
    "headers": { "Authorization": "Bearer XYZ" },
    "include_selectors": [".protocol-details", ".market-data", ".exchange-info"],
    "exclude_selectors": [".advertisement", ".user-menu"],
    "main_only_content": true,
    "wait_for": 2000,
    "timeout": 30000,
    "extract": {
      "schema": {
        "coin_name": "string",
        "price_usd": "number",
        "market_cap": "number",
        "volume_24h": "number",
        "change_24h": "number"
      },
      "prompt": "Extract cryptocurrency market data including price, market cap, and 24h volume."
    }
  },
  "callback_url": "https://api.yourservice.com/crypto-webhook"
}

Response (202 Accepted):

Get Crawl Status

Method: GET Endpoint: https://api.tokensource.com/v1/crawl/{crawl_id} Description: Retrieves the current status of a crawl operation.

Parameters:

  • crawl_id (string, required)

Response (200 OK):

Webhook Events

The crawl endpoint sends updates to your webhook URL with the following event types:

crawl.started:

crawl.page:

crawl.completed:

crawl.failed:

Error Responses

Response (400 Bad Request):

Response (401 Unauthorized):

Response (429 Too Many Requests):

Last updated