Scrape Web URL

Scrape Web URL

Extract data from web pages by scraping their content.

Using the Scrape Web URL Node

The Scrape Web URL node's functionality includes fetching the content of a web page for the provided url, the node accepts below parameters:

  • URL (required): The URL of the page you want to scrape. Example: https://docs.buildship.com (opens in a new tab).

  • Selector (optional): Specific HTML selector you want to extract text content from (by default body will be used).

  • Steps (optional): List of steps to follow after loading the page in given url.

    Usage Example: Suppose you want to scrape the below information:

    Buildship google search

    So, the steps input value would look something like below:

    {
      "url": "https://www.google.com/",
      "selector": "#result-stats",
      "steps": [
        // type "buildship" in google search input box
        {
          "action": "type",
          "params": ["#APjFqb", "buildship"]
        },
     
        // click on google search button
        {
          "action": "click",
          "params": [".gNO89b"]
        },
     
        // wait for searched query to load and if page doesn't load within 3 seconds, move to next step
        {
          "action": "waitForNavigation",
          "params": [
            {
              "timeout": 3000,
              "waitUntil": "load"
            }
          ]
        }
      ]
    }

    The root selector value is the selector from which you want to extract the text-content, after all steps are executed.

    Each step object in steps list consists of action and params.

    The action parameter is the name of any method from puppeteer-page-methods (opens in a new tab) list. And, the params is list of parameters required in the selected action (a puppeteer method name).

    For one of the action - type, the parameters for the puppeteer-type-method (opens in a new tab) are:

    Puppeteer type method parameters

    Hence, the step object for type action would look like:

    {
      // puppeteer method name
      "action": "type",
     
      // "#APjFqb" is the "selector" (the selector to find <input>)
      // "buildship" is the "text" (the value to be typed in <input>)
      // As per parameters list of "type" method, the third parameter is optional,
      // hence we can either include or exclude it from "params" list
      "params": ["#APjFqb", "buildship"]
    }