Skip to content

Latest commit

 

History

History
39 lines (19 loc) · 8.04 KB

solr_README.md

File metadata and controls

39 lines (19 loc) · 8.04 KB

A solr cloud consists of solr running on multiple hosts, the more the merrier. Coordination between the hosts is provided by Zookeepers running on multiple hosts, typically three. I have provided scripts to deploy and localize solrs and Zookeepers, scripts to start and stop the solrs and zookeepers and a script to create a configset for the cord database.

To make the scripts simpler, I expect the hosts to have names that include their number in the set of hosts: e.g. solr-01, solr-02, ... The scripts parse the hostname of the machine they are running on to identify the host in the set. The deploy, start and stop scripts are run on each host. They have names that include the word "Node" in them to indicate that they run on the hosts. To simplify the running of the scripts on many hosts, there are scripts that include the word "Cloud" in them. Those scripts use the "parallel" command to run the "Node" scripts on the multiple hosts. There is a file, named solrCloudHostlist.txt, that lists the hosts, one per line, and is used by the "parallel" command.

All of the node specific scripts start by querying for the hostname and parsing that name to extract a hostnumber. They also set environment variables to point at the location of the java executable and to provide a list of the Zookeepers.

The deploySolrNode.sh script downloads versions 8.5.2 of solr from an Apache mirror and unpacks it. It edits the solrconfig.xml file to increase the amount of memory the solr can use. It makes a directory on /export for the files that this solr will use/create. Each solr gets its own directory and the host number is used in the directory name. It creates a configset for the cord database, starting by copying solr's default solrconfig.xml file to the cord configset directory. That file is edited to point the solr at the directory it should use for its data. The script copies the DataImportHandler's configuration file, DIHconfigfile.xml, that Art created to the cord configset directory. (This file controls how the records are imported from the Reader relational database into the solr database.) The solrconfig.xml file is edited to add a DataImportHandler and points it at the DIHconfigfile. The (static, controlled) cord indexing schema is copied into the cord configset directory. Some utility files that the schema expects are copied from the solr distribution into the cord configset directory. The default solr.xml file is copied to the new solr data directory. Version 3.30.1 of the sqllite jar, required by the DataImportHandler, is downloaded from Maven and put into a directory on solr's default classpath. That completes the solr deployment.

The deploySolrNode.sh script then downloads version 3.6.1 of Zookeeper from an Apache mirror. This and the subsequent Zookeeper deploy steps are only run on the first three solr nodes. That's Apache's recommended number of Zookeepers. The Zookeeper's example configuration file is made its real configuration file by being renamed to zoo.cfg. That file is then edited to provide the Zookeeper with the list of all the Zookeepers and their port number. That completes the Zookeeper deployment.

The startSolrNode.sh script starts by testing if it is running on one of the first three solr nodes. if so, it cd's to the Zookeeper directory and runs "bin/zkserver.sh start". Then, regardless of the hostnumber, it cd's to the solr directory and runs "bin/solr start -c -z $zklist -s /export/solr/node$hostnumber/data -DzkClientTimeout=600000" This starts the solr in cloud mode ("-c"), points it at the list of Zookeepers ("-z $zklist"), tells it where its data goes ("-s /export/solr/node$hostnumber/data") and tells it to wait for 5 minutes before giving up on a non-responsive Zookeeper ("-DzkClientTimeout=600000").

The stopSolrNode.sh script cd's to the solr directory and runs "bin/solr stop -all". Then, if it is running on one of the first three solr nodes, it cd's to the Zookeeper directory and runs "bin/zkServer.sh stop".

The createNewCollections.sh script tells the Zookeepers about the cord configset. The Zookeepers will tell the various solr instances about the configset. This script needs to be run once on the solr-01 server after a new solr/Zookeeper deploy. This script needs to be run with at least the solr and zookeper on solr-01 running, though really, why aren't they all up and running? The script cd's to the solr directory and runs " bin/solr zk upconfig -n cord -d /export/solr/node$hostnumber/configsets/cord/conf -z $zklist". I know that seems backwards. Why are we asking the solr to tell the Zookeepers something for us? Beats me, that's just how they do it. solr tells Zookeeper to update its configuration information ("zk upconfig") the new configuration will be named "cord" and the directory with the configset for cord is /export/solr/node$hostnumber/configsets/cord/conf. The last parameter tells the solr where the Zookeepers can be found ("-z $zklist").

How do we use this stuff to run the solr cloud?

If this is a fresh system or the system configuration has changed and you need to add or remove solr nodes, then run deploySolrCloud.sh

Run startSolrCloud.sh to start the system.

If this is the first time you've started the system since a deploy, then ssh to solr-01 and run createNewCollections.sh.

You can test if the system is up by poking http://solr-01:8983/solr. If you point wget or curl at that URL, you should get some sort of response. If you point a browser at that URL, you'll get the solr administrator console. Poke the "Cloud" button on the left and you'll see the status of all the nodes in the cloud.

If this is the first time you've run the system after a deploy, you need to create the cord database. This could be scripted in a static environment (unchanging names and number of hosts), but it's just as easy to use the administrator interface. Click on "Collections" on the left. If the page that comes up already has the cord database listed, then you don't have to do anything. If it is missing, click the "Add Collection" button. A dialog pops up asking for the name of the collection ("cord" without quotes), configset (pull down the list and select "cord"), numShards (set to 4 today, but should be the number of hosts in the cloud) and replicationFactor (1). Poke the "Add Collection" button in the dialog box and you're done.

You use the administrator interface to load the database (yes, this could be scripted too). Pull down the "Collection Selector" list on the left and select "cord". Click on "Dataimport" in the list that appears. The page that appears is all set to do a full import of the Reader database. Simply press the "Execute" button. It will seem as if nothing happens for a few minutes, but the DataImportHandler is busy querying the Reader database and assembling the list of records to be added. Eventually you'll see a status showing records being added. Last time I ran this, it took 7 minutes for the whole process to complete.

Run stopSolrCloud.sh to stop the system. I'd do this if the system was misbehaving and needed to be restarted. I suppose you might want to do this if you'd been warned that the computer system was going down, but honestly, there's not much to corrupt. The database is static after it has been loaded. I'd just leave it running and start it all up again after the computers are up. The startSolrCloud script could probably be added to the system restart script, but given all the dependencies on all the other hosts, I probably wouldn't do that.

That replicationFactor of 1 that was set when the database was built is important. Replication is how you make the system robust if a server fails. But, replication slows down the database build. After the database is build, the ModifyCollection and AddReplica commands can be used to create replicas of the data on each node onto other nodes. This should all happen in the background and not impact the usability of the system. I've not implemented any of this as the criticality of the database is low and the robustness of the servers is high. (It seems easier to just rebuild after a catastrophe than try to make the system survive a catastrophe.)

date: 7/10/2020 author: Ralph LeVan