| copyright |
|
||
|---|---|---|---|
| lastupdated | 2017-08-18 |
{:shortdesc: .shortdesc} {:new_window: target="_blank"} {:tip: .tip} {:pre: .pre} {:codeblock: .codeblock} {:screen: .screen} {:javascript: .ph data-hd-programlang='javascript'} {:java: .ph data-hd-programlang='java'} {:python: .ph data-hd-programlang='python'} {:swift: .ph data-hd-programlang='swift'}
The Data Crawler collects the raw data that is eventually used to form search results for the {{site.data.keyword.discoveryshort}} service. When crawling data repositories, the crawler downloads documents and metadata, starting from a user-specified seed URL. The crawler discovers documents in a hierarchy, or otherwise linked from the seed URL, and enqueues these for retrieval. {: shortdesc}
-
Java Runtime Environment version 8 or higher
Your
JAVA_HOMEenvironment variable must be set correctly, or not be set at all, in order to run the Crawler. {: tip} -
Red Hat Enterprise Linux 6 or 7, or Ubuntu Linux 15 or 16. For optimal performance, the Data Crawler should run on its own instance of Linux, whether it is a virtual machine, a container, or hardware.
-
Minimum 2 GB RAM on the Linux system
-
Open a browser and log into your {{site.data.keyword.Bluemix}} account
{: new_window}.
-
From your {{site.data.keyword.Bluemix_notm}} Dashboard, select the {{site.data.keyword.discoveryshort}} service you previously created.
-
Click the Download Data Crawler link to download the Data Crawler.
-
Verify that you are running Java Runtime Environment version 8 or higher. Run the command
java -version, and look for 1.8. If you are running something earlier than 1.8, you need to upgrade Java by installing the Java Developer Kit (JDK) 8 from your package management system, from the IBM JDK{: new_window} website, or from java.com
{: new_window}.
Your
JAVA_HOMEenvironment variable must be set correctly, or not be set at all, in order to run the Crawler. {: tip} -
As an administrator, use the appropriate commands to install the archive file that you downloaded:
- On systems such as Red Hat and CentOS that use rpm packages, use a command such as the following:
rpm -i /full/path/to/rpm/package/rpm-file-name - On systems such as Ubuntu and Debian that use deb packages, use a command such as the following:
dpkg -i /full/path/to/deb/package/deb-file-name - The Crawler scripts are installed into
{installation_directory}/bin; for example,/opt/ibm/crawler/bin. Ensure that{installation_directory}/binis in yourPATHenvironment variable for the Crawler commands to work correctly.
Crawler scripts are also installed to
/usr/local/bin, so this can be added to yourPATHenvironment variable as well. {: tip} - On systems such as Red Hat and CentOS that use rpm packages, use a command such as the following:
-
Create your working directory by copying the contents of the
{installation_directory}/share/examples/configdirectory to a working directory on your system, for example/home/config.Warning: Do not modify the provided configuration example files directly. Copy and then edit them. If you edit the example files in-place, your configuration may be overwritten when upgrading the Data Crawler, or may be removed when uninstalling it.
Note: References in the rest of this guide to files in the
configdirectory, such asconfig/crawler.conf, refer to that file in your working directory, and NOT in the installed{installation_directory}/share/examples/configdirectory. -
You are now ready to configure the Data Crawler to connect to your repository
The Data Crawler download places the following folders on your system:
doc- Contains files with copyright and licensing information.bin- Script files for running the crawler.connectorFramework- The files in this directory are what allow you to talk to your data, whether internal data within the enterprise, or external data on the web or in the cloud.lib- Library files used by the crawler.sharedoc- Provides both HTML and Markdown-formatted documentation files.examples/config- Files that let you tell the crawler which data to use for its crawl, where to send your collection of crawled data once the crawl has been completed, and other crawl management options.man- In-product manual page crawler documentation.
- The Data Crawler may hang when running the Filesystem connector with an invalid or missing URL.
- Configure the
urls_to_filtervalue in thecrawler.conffile, such that all the whitelist URLs or RegExes are included in a single RegEx expression. See Configuring crawl options for more information. - The path to the configuration file passed in the
--config -coption must be a qualified path. That is, it must be in the relative formatsconfig/crawler.confor./crawler.conf, or absolute path/path/to/config/crawler.conf. Specifying justcrawler.confis only possible if theorchestration_service.conffile is in-lined instead of referenced usingincludein thecrawler.conffile.