ZoomInfo Crawler FAQ.

What is Zoominfobot and why is it accessing my web server?

Zoominfobot is an indexing robot for a web search engine, similar to Google. Created by Zoom Information Inc.(www.zoominfo.com), Zoominfobot’s patented technology continually scans millions of corporate websites, press releases, electronic news services, SEC filings and other online sources. Using advanced natural language processing algorithms, ZoomInfo has created a next generation search engine focused on finding pages with information about businesses and business professionals.

It is very important to us that Zoominfobot respect access restrictions of other websites. To this end, it obeys all robot exclusion files (robots.txt), and when the website includes more than a few dozen pages, it spaces out requests to reduce load. Additionally, it never opens more than one request to a website at a time. If Zoominfobot was accessing a private portion of your website, it would be because it saw a link from one of your pages to this area and there was no restriction in a robots.txt file (see below for more information on what a robots.txt file is). Zoominfobot also checks the ip address of the website it is visiting to ensure that different instances of Zoominfobot do not try to access multiple websites hosted on the same machine at the same time.

How can I get Zoominfobot to visit my website?

If you would like to include your website in Zoominfobot’s list of sites to visit, send an email to sitesubmit@zoominfo.com.  Please note that it may take several weeks for Zoominfobot to visit and index your website.

Why doesn’t Zoominfobot visit all of the pages on my site? It seems to be skipping certain sections.

Zoominfobot is designed to search for business-related information, and may skip URLs from websites if it determines that the pages are unlikely to contain interesting content. If there are pages you particularly want indexed, make sure that these pages are either linked from the home page or have link text that describes what type of information is on the pages (for example, a link to press releases should have the words “Press Release” somewhere in the link or alt text).

Why does Zoominfobot sometimes try to access non-existent links from my server?

Increasingly, more and more web content is embedded inside of javascript code and other client-side scripting languages in order to create more visually pleasing websites. Zoominfobot sometimes processes javascript code looking for navigation maps and links. Occasionally it will misinterpret a script and try to visit a non-existent link. This is unintentional and is not an attempt to penetrate the security of your webserver.

How do I prevent Zoominfobot (or other robots) from indexing my site?

The quickest way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server:

User-agent: Zoominfobot
Disallow: /

For more information on how to exclude all robots or only exclude them from certain sections of your site, see below.

Why is Zoominfobot asking for a file called robots.txt that isn’t on my server?

robots.txt is a standard document that can tell robots not to index some or all information from your web server. For information on how to create a robots.txt file, see the section below.

Where do I find out how robots.txt files work?

For a complete overview of how robots.txt exclusion files work, vist: http://www.robotstxt.org/wc/norobots.html

The basic concept is simple. By writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example:

# /robots.txt file for controlling indexing
User-agent: webcrawler
Disallow:

User-agent: googlebot
Disallow: /

User-agent: *
Disallow: /tmp
Disallow: /logs

The first line, starting with ‘#’, specifies a comment.

The first paragraph specifies that the robot called ‘webcrawler’ has nothing disallowed: it may go anywhere.

The second paragraph indicates that the robot called ‘googlebot’ has all relative URLs starting with ‘/’ disallowed. Because all relative URLs on a server start with ‘/’, this means the entire site is closed off.

The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the ‘*’ is a special token, meaning “any other User-agent”; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.

Two common errors:

  • Wildcards are not supported: instead of ‘Disallow: /tmp/*’ just say ‘Disallow: /tmp/’.
  • You shouldn’t put more than one path on a Disallow line.

Where do I find out more about robots?

There is a Web robots home page on: http://www.robotstxt.org/wc/robots.html