Web Crawler_ How to Connect and Collect

Updated on 26 Jun 2024
1 Minute to read

Print
Share
Dark
Light
PDF

Article summary

Did you find this summary helpful?

Thank you for your feedback

Web Crawler Overview

Onna's web crawler was created to index web pages.

Connector Features
Authorized Connection Required? No	Is identity mapping supported? No
Audit logs available? Yes	Admin Access? No
Supports a full archive? No	Custodian based collections? No
Preserve in place with ILH? No	Resumable sync supported? No
Supports Onna preservation? No	Syncs future users automatically? No
Sync modes supported: One-time sync	Is file versioning supported? No

Types of Data Collected	Metadata Collected
Headings Subheadings Paragraphs Metadata Pictures Links Text links	There is no Web Crawler specific metadata.

Web Crawler Considerations

The web crawler does not currently support password-protected websites or Captcha protected websites.
You’re not able to collect files in their native format from the links on a web page during collection. Web crawler links are embedded so they will not be able to pull files in their native formats.

Web Crawler Requirements

When adding a new Web Crawler sync, you have to introduce the URLs with the protocol (http, https)

How to Connect and Collect Using Web Crawler

To create a new Web Crawler collection follow the steps below:

Step 1

Click on ‘Workspaces’ in the main menu (a), then click on the workspace where you’d like to add a new sync (b).

Step 2

Click on the ‘+’ icon in the upper right corner to add a new source.

Step 3

Select the Web Crawler connector from your list of available connectors.

Step 4

To configure your sync start by entering a name for your source in the ‘Name’ field (a). Then, enter the URL you want to collect from (b). Finally, click the blue ‘Done’ button (c).