Web Crawler_ How to Connect and Collect
  • 26 Jun 2024
  • 1 Minute to read
  • Dark
    Light
  • PDF

Web Crawler_ How to Connect and Collect

  • Dark
    Light
  • PDF

Article summary

We

In this article:

  • Web Crawler Overview

  • Web Crawler Requirements

  • How to Connect and Collect Using Web Crawler

Web Crawler Overview

Onna's web crawler was created to index web pages.

Connector Features

Authorized Connection Required? No

Is identity mapping supported? No

Audit logs available? Yes

Admin Access? No

Supports a full archive? No

Custodian based collections? No

Preserve in place with ILH? No

Resumable sync supported? No

Supports Onna preservation? No

Syncs future users automatically? No

Sync modes supported:

  • One-time sync

Is file versioning supported? No

Types of Data Collected

Metadata Collected

  • Headings

  • Subheadings

  • Paragraphs

  • Metadata

  • Pictures

  • Links

  • Text links

  • There is no Web Crawler specific metadata.

Web Crawler Considerations

  • The web crawler does not currently support password-protected websites or Captcha protected websites.

  • You’re not able to collect files in their native format from the links on a web page during collection. Web crawler links are embedded so they will not be able to pull files in their native formats.

Web Crawler Requirements

  • When adding a new Web Crawler sync, you have to introduce the URLs with the protocol (http, https)

How to Connect and Collect Using Web Crawler

To create a new Web Crawler collection follow the steps below:

Step 1

Click on ‘Workspaces’ in the main menu (a), then click on the workspace where you’d like to add a new sync (b).

Step 2

Click on the ‘+’ icon in the upper right corner to add a new source.

Step 3

Select the Web Crawler connector from your list of available connectors.

Step 4

To configure your sync start by entering a name for your source in the ‘Name’ field (a). Then, enter the URL you want to collect from (b). Finally, click the blue ‘Done’ button (c).

Step 5

You’ll now see your new source appear alphabetically in the list of ‘Connected sources’ in your workspace.


ESC

Eddy AI, facilitating knowledge discovery through conversational intelligence