Search Engine Indexing

Purpose

To give administrators control over the behaviour of search engines when the crawl the site.

Details

Search engines (e.g. Google, etc) search the internet for content to include in their index:

  • Crawling: visiting a website at its root, analysing the index.html page and following every link it contains; searching for a sitemap, visiting every URL it contains
  • Indexing: following a link from a (remote) website, the crawler ends up on a local page (which may or may not be accessible through crawling)

The feature instructs crawlers to either include or not include the data in its index.

A request for GET /robots.txt is responded to dynamically:

  • enabled: Disallow access to /my/transfers/*, disallow /my/drive, the rest is not explicitly disallowed
  • disabled: Disallow all access

Configuration

  • Scope: Configured on Adminunit-level, applies to all Storagehosts of an Adminunit.
  • Privileges: Configurable by an Admin
  • Default: disabled; search engines are instructed not to store the visited pages in its index or follow links, regardless of the page.

The feature is enabled or disabled in the admin interface:

Image

Dependencies

None

Conflicts

None