Table of Contents
· Overview
· Prerequisites
· About ZooKeeper
· Solr with ZooKeeper
· ZooKeeper on Windows Azure
· Distributed SolrCores using ZooKeeper and Solr
· Getting started with ZooKeeper on Azure
· Setting up Java, ZooKeeper and Solr on the VMs
· Start up ZooKeeper to sync SolrCores across Multiple VMs
· Crawling Sites and Building Indexes with Nutch
· Getting Nutch data into Solr
· Testing your configuration – bringing down a server
· Conclusion
Overview
This tutorial shows you how to set up multiple Solr instances (called SolrCores) synchronized across more than one server, with synchronization managed by Apache ZooKeeper. At the end of this tutorial you will have one ZooKeeper Server synchronizing Two Solr Servers that are indexing data using Nutch.
Prerequisites
This tutorial requires a Windows Azure subscription. If you don’t have a subscription, you can sign up for a free trial subscription on WindowsAzure.com.
About ZooKeeper
ZooKeeper is a great tool to synchronize virtual machines and services on Windows Azure, as well as on-premise servers, and even other cloud providers. ZooKeeper maintains the state of systems in memory as well as storing status information in local log files for session and system persistence. It’s designed to keep large numbers of processes on servers running as nodes in a cluster.
ZooKeeper creates what it calls znodes, which are in their simplest form files persisting in memory on each ZooKeeper server and persisted locally on disk. These status files can be updated by other znodes in a defined cluster. Any node in the cluster can also register to “watch” other znodes. Applications can synchronize tasks across the ZooKeeper cluster by updating their znode status and informing “watchers” of a specific status change. Status is centralized via a master server to manage and serialize tasks across a distributed set of znodes.
Solr with ZooKeeper
Solr uses a customized version of ZooKeeper for cluster configuration and coordination in its new distributed SolrCloud offering. SolrCloud distributes indexing and search capabilities across a cluster for availability and performance.
Solr calls each server instance a SolrCore, which contains a single index. Multiple SolrCores are called a collection and share a single index that spans multiple SolrCores. These collections can be organized into a single SolrCloud with multiple collections, manage by the internal, customized ZooKeeper.
While using Solr’s zookeeper instance may be good for testing, however, there may be instances where you want to synchronize multiple Solr instances as well as plug-ins, add-ons or processes from other applications, as well as server processes. This tutorial describes how to set up a Zookeeper implementation that can address these requirements.
Distributed SolrCores using ZooKeeper and Solr
By default, Solr’s internal ZooKeeper will synchronize SolrCloud shards on the same sever, but for our ZooKeeper example, we’re going to show multiple SolrCores distributed across servers. That means that multiple cores at multiple IP addresses are made to look like one server.
Our external implementation of ZooKeeper manages the SolrCores and keeps them in sync. That way, Solr data gets indexed, loaded and searched using multiple SolrCores, but if one of those SolrCores goes down, functionality is not affected.
Here’s a graphical representation of what we build in this example, to get us started:
Getting started with ZooKeeper on Azure
To get this example started, we created three VMs:
· One Running ZooKeeper
· Two Running Solr
Each is a Separate, independent VM with its own IP address. ZooKeeper is synchronizing the two Solr instances and managing the infrastructure, so that if one or more of the indexes go down, the entire data set can still be searched, as the indexes on each of the Solr VMs is synchronized on a regular basis.
We set up each Virtual Machine from the Azure management portal, using the virtual machines tab, and created a VM using a Linux image form the gallery:
In this case we used the Ubuntu Server 12.10 image:
As this is a demo we used the small Azure VM size, which supports 1 core and 1.75 GB memory, and we automatically generated the storage account in the West US Region. Repeat these steps for the other two images.
Configuring your VM
Once the images are set up and you have access to them via the dashboard, you now need to configure the ports so that the VMs can see one another and the outside world. Here’s how we set up the ports for our VMs:
NAME PROTOCOL PUBLIC PORT PRIVATE PORT LOAD BALANCED
ZooKeeper TCP 2181 2181 NO
Solr TCP 8983 8983 NO
Solr2 TCP 9983 9983 NO
Setting up Java, ZooKeeper and Solr on the VMs
Next, we will be installing and configuring ZooKeeper and Solr on the VMs, along with Nutch on the Solr servers to crawl web sites for some sample data to work with.
Now we need to SSH into each VM to place and run files. There are as many command line interfaces for Linux as there are opinions about which is best. I tend to prefer Putty when I’m working on tasks that require pure command line processing, and for file transfers and package manipulation I like the UI in WinSCP. For both tools, you can get the connections info for host and IP address in the Azure dashboard, on the right under the quick glance section:
Oh, and one more thing - hopefully you recorded the user name and password when you created the VMs.
Installing Java
Before we can run ZooKeeper or Solr, java needs to be installed on each machine. At time of writing, this is the current version that is compatible with Solr and zookeeper. Choose the Linux x64 package from the list of options.
There are a couple of java variables that need to be set. You can put these in your CLASSPATH file or execute them from the command line:
export NUTCH_JAVA_HOME='/usr/lib/jvm/java/java/jdk1.7.0_21/jre'
export JAVA_HOME='/usr/lib/jvm/java/java/jdk1.7.0_21/jre'
Setting up ZooKeeper
Zookeeper will synchronize services between machines. The first step is download and placement – We used the OCD-compliant version 3.4.5 for this example.
Run the following command from your home directory on each VM:
wget http://apache.mirrors.hoobly.com/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz
Then unpack the downloaded software:
tar zxfv zookeeper-3.4.5.tar.gz
From a directory where you want to run ZooKeeper, copy the zoo_sample.cfg file and edit it to point to a data directory that you create (ours is /var/lib/zookeeper). Rename it zoo.cfg (the default config file for ZooKeeper):
cp conf/zoo_sample.cfg conf/zoo.cfg
dataDir=/var/lib/zookeeper
Updating the hosts file with the Public Virtual IP Address
Solr uses the localhost URI for some internal functions. Because we’re sharing SolrCores across servers, we need to use the Public Virtual IP Address for all functions on the Solr server that rely on localhost. This can be done by editing the VM’s hosts file (located at /etc/hosts) and entering the Public Virtual IP Address for that VM.
On the Azure management portal control panel, Get the Public Virtual IP Address from the below location:
Update the IP address in the hosts file:
<Public Virtual IP Address> localhost
Starting ZooKeeper
Once the files are set up on the server, you have a data directory ready to go, and you have configured zoo.cfg and edited your hosts file to show the Public Virtual IP Address as the localhost value, you’re ready to start ZooKeeper:
bin/zkServer.shstart
Setting up the Solr Servers
From the second VM, Download and unzip Solr using the following commands:
wget http://mirror.cc.columbia.edu/pub/software/apache/ lucene/solr/4.2.1/solr-4.2.1.tgz
tar zxfv solr-4.2.1.tgz
You need to explicitly specify the Public Virtual IP Address of the current VM as the default IP address. Otherwise the internal IP may be used on startup, which cannot be seen by the other ZooKeeper and Solr servers.
Open the solr.xml file and find the cores config settings. It should look like this:
<coresadminPath="/admin/cores"defaultCoreName="collection1"host="${host:}"hostPort="${jetty.port:}"hostContext="${hostContext:}"zkClientTimeout="${zkClientTimeout:15000}">
<corename="collection1"instanceDir="collection1"/>
</cores>
Change the host="${host:}" to match your Solr VM’s Public Virtual IP Address.
host="<enter Public Virtual IP Address>”
Setting up the second Solr Server.
When done, make a full copy of the solr-4.2.1/ directory and simply copy it to the other Solr server and follow the steps above to change the public IP address.
Starting the Servers
To start ZooKeeper, enter this command:
zkServer.shstart
Start the first Solr server with this command:
java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkHost=<ZooKeeper VM Public Virtual IP Address>:2181 -DnumShards=2 -jar start.jar
Start the second Solr server with this command:
java -Djetty.port=8983 -DzkHost==<ZooKeeper VM2 Public Virtual IP Address>:2181 -jar start.jar
Start up ZooKeeper to sync SolrCores across Multiple VMs
As I mentioned in the introduction, we’re going to show multiple SolrCores distributed across servers, with an external implementation of ZooKeeper managing SolrCores and keeping them in sync, for redundancy and failover.
Let’s start up the cross-server SolrCores now. On Server 1 and 2, enter the following command:
java -Djetty.port=9983 -DzkHost=<ZooKeeper Public Virtual IP Address>:2181 -jar start.jar
You now have access to two Solr servers at the following addresses.
http:// <Solr VM1 Public Virtual IP Address>:8983/solr/#/
http://<Solr VM2 Public Virtual IP Address>:8983/solr/#/
We chose 9983 for the SorlCloud shards but you can use any port that is not in use.
Here’s a screen shot of the finished configuration.
Crawling Sites and Building Indexes with Nutch
Now that we have the server running, then next step is to crawl a couple of Web sites and build indexes of items to search. The logical choice for our crawler is Nutch, which is a crawler that was created for Solr’s predecessor search engine, Lucene. An interesting bit of history – Nutch was the code kernel that was used to make Hadoop, with the addition of a MapReduce engine and distributed file system.
Nutch is one of many tools that enhance the Solr ecosystem, which includes connectors for all kinds of data types. Each one of these tools builds and index in a way that Solr can consume it, and the means to load the index into Solr’s database for searching.
Download and Setup Nutch
We downloaded and unzipped Nutch 1.6 from the Apache Nutch site on the first Linux VM with these commands:
wget http://mirror.sdunix.com/apache/nutch/1.6/apache-nutch-1.6-bin.tar.gz
tar zxfv apache-nutch-1.6-bin.tar.gz
Configuring Nutch
Before indexing any data, you need to set some default properties on Nutch.
Open the nutch-site.xml file and insert the below xml between the configuration tags.
This sets the userAgent property used in the HTTP request headers when Nutch crawls a Web site.
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
Defining Web sites to crawl
The first step is to tell Nutch where to crawl by creating a flat file containing URLS you wish to crawl. Create a new directory under the /nutch folder called /urls, then create a text file called seed.txt. Your URLs for crawling go in this file, each on a separate line.
For our example, we crawled two URLS:
http://www.windowsazure.com
http://www.msopentech.com
Configuring characters to Index
Nutch is instructed on which characters to index using another configuration file called regex-urlfilter.txt in the /conf folder.
Add the following values below the line # accept anything else
+^http://([a-z0-9]*\.)*windowsazure.com/
+^http://([a-z0-9]*\.)*msopentech.com/
Nutch is now ready to run:
<Nutch Install Dir>/bin/nutch crawl urls -dir crawl -depth 2 -topN 50
-dir tells Nutch which directory to place crawl results in, -depth tells Nutch how many link levels to crawl, and -topN tells Nutch the maximum number of pages to index for each URL.
Getting Nutch data into Solr
Solr communicates with search queries using schemas, defined in a file called schema.xml. You can define a list of fields within the schema and these fields will be filled with data ready to be searched. We need to tell Solr about the fields we just created with Nutch, so we’ll add the following values to our Schema.xml file on each server.
Editing the Schema.xml file
Find the below text, which indicates the beginning of the field section of Schema.xml.
<!-- field names should consist of alphanumeric or underscore characters only and
not start with a digit. This is not currently strictly enforced,
but other field names will not have first class support from all components
and back compatibility is not guaranteed. Names with both leading and
trailing underscores (e.g. _version_) are reserved.
-->
Under this, add the following to define the fields for Solr to index and search:
<fieldname="digest"type="text_general"stored="true"indexed="true"/>
<fieldname="boost"type="text_general"stored="true"indexed="true"/>
<fieldname="segment"type="text_general"stored="true"indexed="true"/>
<fieldname="host"type="text_general"stored="true"indexed="true"/>
<fieldname="site"type="text_general"stored="true"indexed="true"/>
<fieldname="tstamp"type="text_general"stored="true"indexed="false"/>
<fieldname="anchor"type="text_general"stored="true"indexed="true"multiValued="true"/>
<fieldindexed="true"multiValued="true"name="body"omitNorms="false"stored="true"type="text_general"/>
<fieldindexed="true"multiValued="true"name="dateCreated"omitNorms="true"stored="true"type="text_general"/>
<fieldindexed="true"multiValued="false"name="lastModified"omitNorms="true"stored="true"type="text_general"/>
<fieldindexed="true"multiValued="false"name="pageCount"omitNorms="true"stored="true"type="int"/>
<fieldindexed="true"multiValued="false"name="mimeType"omitNorms="true"stored="true"type="string"/>
<fieldindexed="true"multiValued="true"name="author_display"omitNorms="true"stored="false"type="string"/>
Configure Solr Data Loading via Solr Web Services
To load data into Solr, we will use Solr’s Web service interface, configured by requestHandlers. First we add a new requestHandler so Solr knows how to listen for requests from Nutch.
Edit the solrconfig.xml file on each server, adding the following requestHandler:
<requestHandlername="/nutch"class="solr.SearchHandler">
<lstname="defaults">
<strname="defType">dismax</str>
<strname="echoParams">explicit</str>
<floatname="tie">0.01</float>
<strname="qf">
content^0.5 anchor^1.0 title^1.2
</str>
<strname="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str>
<strname="fl"> url </str>
<intname="ps">100</int>
<boolname="hl">true</bool>
<strname="q.alt">*:*</str>
<strname="hl.fl">title url content</str>
<strname="f.title.hl.fragsize">0</str>
<strname="f.title.hl.alternateField">title</str>
<strname="f.url.hl.fragsize">0</str>
<strname="f.url.hl.alternateField">url</str>
<strname="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>
Load Nutch crawl data into Solr
Run the following command to load Nutch data into Solr:
nutch solrindex <VMx Public Virtual IP Address>/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
You should get something like this as a response:
Testing your configuration – bringing down a server
To shut down the server 1 replica 2 SolrCloud shard, enter the following command:
java -DSTOP.PORT=9983 -jar start.jar –stop
Because ZooKeeper running outside of Solr, if a replica goes down, your data can still be queried with full result sets. For example, if I shut down server 1 replica 2 using the above command then take a look at the server tree in Solr, it should look like this:
But if I run a query I will still get full results as expected.
Conclusion
For the purposes of this example we’ve configured one ZooKeeper instance and two Solr VMs as the minimum to test our configuration, but you could scale up much more than that. With ZooKeeper managing Solr, as long as at least one SolrCloud Shard is accessible anywhere that ZooKeeper is keeping things in Sync, you will still have ability to index documents and run queries.
Try it out yourself, and let us know what you think!