How to set up and exploit an Apache Solr environment on Amazon EC2
Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. An Amazon Machine Image (AMI) is a special type of pre-configured operating system and virtual application software which is used to create a virtual machine within the Amazon Elastic Compute Cloud (EC2). It serves as the basic unit of deployment for services delivered using EC2.
This tutorial uses:
- An AMI consisting of a 32bit base Fedora 8 install (most of the Linux based AMI will work fine)
- Java version: 1.6.0_29
- Solr version: 3.4.0
Let’s now look at the actual installation steps:
0) Open 8983 port
Before starting it’s necessary to open port 8983, which is the port that Solr listens on by default.
This can be done by adding a rule to the security group which the chosen AMI belongs to.
1) Install java
To check if java is already installed in the machine, enter the command:
if the response is:
-bash: java: command not found
java is not installed in the machine.
To download and install java, enter the following commands
chmod +x jdk-6u29-linux-i586-rpm.bin
2) Install Solr
To get Solr, enter:
unzip the downloaded file:
tar xzf apache-solr-3.4.0.tgz
3) Start Solr
Solr comes with its own servlet container, Jetty, bundled withe package above.
To start Solr, go to the example directory:
java -jar start.jar
To verify that the server is running correctly, open the web browser and enter the following URL:
(where xxx-xxx-xxx-xxx stands for the public DNS address)
4) Start Indexing
Now that the Solr server is running, it is possible to start building a simple index.
To create a new index go to the exampledocs folder and enter:
java -jar post.jar *.xml
This command will index all the .xml documents contained in the exampledocs folder.
Solr allows you to import data in many different ways and formats. It is possible to index CSV files, JSON documents, .pdf and .doc documents through Solr Cell (http://wiki.apache.org/solr/ExtractingRequestHandler) and it is possible to get records directly from the database through the Data Import Handler (http://wiki.apache.org/solr/DataImportHandler).
One way to search over the just indexed files is to go to the admin page (http://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com:8983/solr/admin/) and query the index through with the Solr query language (http://wiki.apache.org/solr/SolrQuerySyntax).
This language is an extension of the Lucene query language and allows you to exploit the powerful searching features of Solr by simply adding some parameters to the query.
A more interactive way to search over the created index is to open the browser and go to the following URL:
An example of search interface will let you search over the documents and show some of the most relevant features of Solr (autocomplete, faceting, highlighting…)
Several clients can be used to distribute Solr client software depending on the developer preferences. A list of some popular clients follows:
These simple steps are all you need to run a Solr instance on Amazon EC2.
Now that the project is set up you can take advantage of your Solr-AMI to perform all kinds of smart searches for your own projects and websites.
Don’t forget to visit and explore the Solr Wiki Page for more advanced uses and customizations of Solr.