Web Scraper

Updated 3 weeks ago by Karan

Introduction

You can easily use the Web Scraper node from Byteline to extract content and data from a website. In this documentation, you will understand how to extract data from any website using its underlying HTML to specify the elements you want to extract. We will use the Byteline Web Scraper Chrome Extension for configuring the data to be scraped.

Note: Web scraper can extract elements like text, links, rich text, and images from the website.

For this documentation, we are assuming a flow is initiated with a simple scheduler node, but it is to be noted that you can use a web scraper node with any trigger. For more details, you can check How to Create your First Flow Design.

Follow the steps outlined below to extract data from any website. 

Configure

Step 1: Select the Web Scraper node from the Select Node window. 

Step 2: Click on the Edit button to open the Web Scraper node configuration window. 

Step 3: Launch the website you want to scrape in a separate tab of your browser to copy its URL. For this documentation, we are scraping the prices of cryptocurrencies from Coinbase.  

Step 4: Enter the Website URL you want to scrape in the Web Scraper URL field in the Byteline console. 

Step 5: Download and install the chrome extension of Byteline in your browser. 

Note: Download the Byteline Web Scraper Chrome Extension from here.

Click on the puzzle piece-shaped extension button on the top right corner of the interface. 

After that, click on the Pin button as shown below to pin the extension to your browser.  

Step 6: Click on the toggle button to enable the Byteline extension.

Step 7: Select either Capture Single Element or Capture List Elements as per requirement.

Capture Single Element

Select Capture Single Element to copy a single element.

Once selected, all you need to do is to hover the cursor over the element you want to copy and it will be highlighted in yellow color.

Click on the element you want to copy, and a dialog box will appear. Select from the following options as per your requirements.

Here the Text option is chosen as we have selected the cryptocurrency entry - 'Bitcoin' to scrape.

Click on the Copy button to copy the Xpath.

Capture List Elements

Select Capture List Elements in case you want to scrape repeating elements.

Once selected, all you need to do is to hover the cursor over the element you want to copy and the whole list will be highlighted in yellow color.

Once you're happy with the list selection, perform a single click to confirm the list selection and the list turns to green.

Now for element selection within the list, click the element you want to copy, and a dialog box will appear. Select from the following options as per your requirements.

Here Text is chosen as we have selected the cryptocurrency entry - 'Bitcoin' to scrape.

Note: Users need to make sure that the selection has all of the elements selected in the list. If all elements are not selected users must move the cursor a little bit to select all the elements.

Click on the Copy to copy the Xpath.

Step 8: Switch to the Byteline console and click on the 'Paste from the Chrome Extension' button, and the console will automatically paste the copied value of the element in the XPath field.  

Step 9: Once you click the 'Paste from the Chrome Extension' button, it creates a value based on the field you have selected from the website you want to scrape.

Give a name to the list.

Step 10: Specify the name for the column you want to scrape.

Step 11: Now, switch back to the Coinbase tab to copy another element you want to scrape. We have selected the cryptocurrency ticker symbol - 'BTC' to scrape.

Step 12: Click on the 'Paste from the Chrome Extension' button and give a name to the field you are scraping.

Step 13: Repeat steps 11 and 12 to scrape the prices column of Coinbase and for filling the Field Name.

Step 14: Click on the Save button to save changes.  

Deploy

If auto-deploy is enabled, you can skip this step.

After the configuration of the flow, you will need to deploy it by clicking on the Deploy button in the top right corner of the interface.

Run

Run the created flow by clicking on the Test Run button in the top right corner of the interface.

Now, click on the 'i' (more information) button on the top-right corner of the Web Scraper node to check the data content extracted. 

You will see an output window as illustrated below: 

Scraping behind a login wall

Byteline has added support to scrape pages behind a login wall. It works by using the Byteline Web Scraper Chrome extension to copy the cookies and then paste them on the Byteline Console. Please follow the below steps:

Copying Cookies using the Chrome extension
  1. Login to the website to scrape and go to the web page for which you're configuring the scraper.
  2. Click on the Byteline Web Scraper Chrome extension installed on your Chrome browser, and then hit the "Capture Cookies" button
    You don't need to enable the toggle button to capture cookies
    Copy cookies using Byteline Web Scraper Chrome extension
Pasting Cookies on the Byteline Console

Now go to the Byteline Console, open your flow, and edit the web scraper task. Go to the Cookies tab, and click on the "Paste from chrome extension" button.

That's it! Now when you run the flow, it can scrape the data behind the login wall.

Scraping a list of URLs

You can scrape a list of URLs in the following ways:

  1. Configure Scheduler Trigger node with data retrieved from a Google Spreadsheet. You can then use any Spreadsheet column in the Web Scraper URL field. For step-by-step instructions, check out how to build a lists crawler use case page.
  2. Retrieve a list of URLs from any Byteline integration e.g., Airtable, and then use the Byteline loop over in the Web Scraper config.
Instead of retrieving complete URLs, you can also retrieve just the dynamic URL parts and then create a URL using Byteline expressions.

Using expressions for the URL field

You can either an expression to specify the complete URL, as below.

Or you can just specify the changing part of the URL using an expression, such as below

Pagination

Web Scraper supports the pagination method that lets you scrape multiple pages from a website.

Pagination is used when all data is not on a single page. There is a certain pagination mechanism based on which you get the first page, then the second page, then the third page, and so on.

Web Scraper supports three different types of pagination.

  1. Horizontal Scrolling
  2. Vertical Scrolling
  3. Infinite Scrolling

Pagination - Horizontal Scrolling

Horizontal scrolling is used for websites where you have to scroll pages horizontally to access more data.

Step 1: Go to the website and enable the Byteline Web Scraper Chrome extension.

Step 2: Double click on the next page link/button.

Step 3: Select the Text radio button, and click the Single Element button to copy the XPath.

Step 4: Paste the XPath in the Next page button/link XPath box.

Step 5: Select the maximum number of pages you want to scrape.

Note: Users can also select all pages for scraping by clicking on the box before All pages. In which case “Max pages to scrape” value is ignored.

Step 6: Click on the Save button to save the changes.

Pagination - Vertical Scrolling

Vertical scrolling is used for websites where you have to scroll pages vertically to access more data.

Step 1: Go to the website and enable the Byteline Web Scraper Chrome extension.

Step 2: Double click on the next page link/button

Step 3: Select the Text radio button, and click the Single Element button to copy the XPath.

Step 4: Paste the XPath in the Next page button/link XPath box.

Step 5: Select the maximum number of pages you want to scrape.

Note: Users can also select all pages for scraping by clicking on the box before All pages. In which case the “Max pages to scrape” value is ignored.

Step 6: Click on the Save button to save the changes.

Pagination - Infinite Scrolling

Infinite scrolling is used for websites where you have to scroll pages infinitely to access more data. It is used to scrape data from websites like Twitter, Facebook, Quora, and other sites alike. In simple words, it is used in such cases where pages keep on loading with scrolling.

With infinite scrolling there is no button to click, so you just need to enable the scroll by clicking on the box before Enable Infinite Scrolling.

Note: Users can select the maximum number of pages they want to scrape.

Once done, click on the Save button to save the changes.

Note: Leave the “Max pages to scrape” text field blank to scrape data from all pages.

You have successfully saved your pagination settings.

Feel free to connect us for any doubts.


How did we do?