Solr vs elasticsearch
If you need good full-text search or faceted search, Apache Solr has been the standard. I’ve used it myself for three projects in the past, and it is indeed a very powerful and stable product.
But there’s a new kid in town – named elasticsearch, that one of my friends pointed me to recently. Although Solr isn’t broken, who can resist evaluating new technology! First impressions – the elasticsearch website looks very nice and clean, and more modern than Solr’s site. From a functional perspective I found the documentation to be less comprehensive, but I’m sure this will be improved with time.
Bring on the benchmarks!
I decided to start simple and test out of the box indexing performance w/o any tweaking of either system. I’ll try to describe the exact steps so that the results should be reproducible by others.
I downloaded the latest versions of each product, Solr 1.3 and elasticsearch 0.15.2. Both servers are really easy to start.
Solr
cd apache-solr-3.1.0/example/ java -jar start.jar
elasticsearch
cd elasticsearch-0.15.2/bin ./elasticsearch
I’ve used Solr’s handy CSV import feature in the past to load a database dump, but since elasticsearch only speaks JSON, I figured I’d use JSON for Solr as well in an attempt to keep things even. I whipped up a quick python script to create a 10 million fake documents to index into both systems. Solr’s JSON format is just one big structure, as opposed to elasticsearch’s bulk API which takes individual newline-separated JSON structures.
The documents consist of just an id and a single string field value, which is really just a random number. For Solr, the file looks like this:
$ head -3 solr.json
{"add":{"doc":{ "id":"1582039702", "field1_s":"1184645701" }}
,"add":{"doc":{ "id":"937868144", "field1_s":"410491235" }}
,"add":{"doc":{ "id":"1754417430", "field1_s":"763134804" }}
For elasticsearch, the file looks like this:
$ head -3 es.json
{"index": {"_index":"test", "_type":"type1", "_id":"1582039702", "field1":"1184645701" }}
{"index": {"_index":"test", "_type":"type1", "_id":"937868144", "field1":"410491235" }}
{"index": {"_index":"test", "_type":"type1", "_id":"1754417430", "field1":"763134804" }}
For Solr, I called the data field “field1_s”. Solr has a (configurable) dynamic field mapping that by default treats everything ending with _s as type “string”. elasticsearch is schemaless, so I just used “field1″ as the name. If I had wanted to use the exact same name, I could have added “field1″ to Solr’s schema, but I’m trying to avoid tweaking config. You will also note that elasticsearch has an additional “_type” field. While the documentation indicated that it was optional, I received errors if I left it out.
Solr indexing:
time curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @solr.json -H 'Content-type:application/json'
elasticsearch indexing:
time curl -XPUT 'http://localhost:9200/_bulk/' --data-binary @es.json curl: (56) Failure when receiving data from the peer
Oops! After consulting the elasticsearch documentation, it appears that it can’t stream large files with it’s bulk API. But since it’s a newline separated file, it’s easy to split it up into million document chunks. I don’t know what the limit is, but this appeared to work.
split es.json -l 1000000 #this splits into files xaa, xab, xac, etc time for f in x??; do curl -XPUT 'http://localhost:9200/_bulk/' --data-binary @$f done
Solr also allows you to bypass HTTP and directly stream a file from disk using stream.url=file:/path_to_file. Trying this method yielded less than a 2% improvement, a bit of a dissapointment.
I ran each test 4 times, killing the JVM and removing the data directory for both Solr and elasticsearch. The final averaged results expressed as throughputs were 43204 docs/sec for Solr, 44052 docs/sec for Solr direct streaming, and 9823 docs/sec for elasticsearch.
I was somewhat surprised at the results, but indexing performance isn’t super important for everyone, and I’m sure some tuning of the systems could help too. Stay tuned, as I plan to do more elasticsearch vs Solr performance posts in the future that will include queries and faceted search.
