Solr vs elasticsearch Deathmatch!

Solr vs elasticsearch

If you need good full-text search or faceted search, Apache Solr has been the standard.  I’ve used it myself for three projects in the past, and it is indeed a very powerful and stable product.

But there’s a new kid in town – named elasticsearch, that one of my friends pointed me to recently.  Although Solr isn’t broken, who can resist evaluating new technology!  First impressions – the elasticsearch website looks very nice and clean, and more modern than Solr’s site.  From a functional perspective I found the documentation to be less comprehensive, but I’m sure this will be improved with time.

Bring on the benchmarks!

I decided to start simple and test out of the box indexing performance w/o any tweaking of either system.  I’ll try to describe the exact steps so that the results should be reproducible by others.

I downloaded the latest versions of each product, Solr 1.3 and elasticsearch 0.15.2.  Both servers are really easy to start.

Solr

cd apache-solr-3.1.0/example/
java -jar start.jar

elasticsearch

cd elasticsearch-0.15.2/bin
./elasticsearch

I’ve used Solr’s handy CSV import feature in the past to load a database dump, but since elasticsearch only speaks JSON, I figured I’d use JSON for Solr as well in an attempt to keep things even. I whipped up a quick python script to create a 10 million fake documents to index into both systems. Solr’s JSON format is just one big structure, as opposed to elasticsearch’s bulk API which takes individual newline-separated JSON structures.

The documents consist of just an id and a single string field value, which is really just a random number. For Solr, the file looks like this:

$ head -3 solr.json
{"add":{"doc":{ "id":"1582039702", "field1_s":"1184645701" }}
,"add":{"doc":{ "id":"937868144", "field1_s":"410491235" }}
,"add":{"doc":{ "id":"1754417430", "field1_s":"763134804" }}

For elasticsearch, the file looks like this:

$ head -3 es.json
{"index": {"_index":"test", "_type":"type1", "_id":"1582039702", "field1":"1184645701" }}
{"index": {"_index":"test", "_type":"type1", "_id":"937868144", "field1":"410491235" }}
{"index": {"_index":"test", "_type":"type1", "_id":"1754417430", "field1":"763134804" }}

For Solr, I called the data field “field1_s”. Solr has a (configurable) dynamic field mapping that by default treats everything ending with _s as type “string”. elasticsearch is schemaless, so I just used “field1″ as the name. If I had wanted to use the exact same name, I could have added “field1″ to Solr’s schema, but I’m trying to avoid tweaking config. You will also note that elasticsearch has an additional “_type” field. While the documentation indicated that it was optional, I received errors if I left it out.

Solr indexing:

time curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @solr.json -H 'Content-type:application/json'

elasticsearch indexing:

time curl -XPUT 'http://localhost:9200/_bulk/' --data-binary @es.json
curl: (56) Failure when receiving data from the peer

Oops!  After consulting the elasticsearch documentation, it appears that it can’t stream large files with it’s bulk API. But since it’s a newline separated file, it’s easy to split it up into million document chunks. I don’t know what the limit is, but this appeared to work.

split es.json -l 1000000   #this splits into files xaa, xab, xac, etc
time for f in x??; do
  curl -XPUT 'http://localhost:9200/_bulk/' --data-binary @$f
done

Solr also allows you to bypass HTTP and directly stream a file from disk using stream.url=file:/path_to_file. Trying this method yielded less than a 2% improvement, a bit of a dissapointment.

I ran each test 4 times, killing the JVM and removing the data directory for both Solr and elasticsearch. The final averaged results expressed as throughputs were 43204 docs/sec for Solr, 44052 docs/sec for Solr direct streaming, and 9823 docs/sec for elasticsearch.

I was somewhat surprised at the results, but indexing performance isn’t super important for everyone, and I’m sure some tuning of the systems could help too. Stay tuned, as I plan to do more elasticsearch vs Solr performance posts in the future that will include queries and faceted search.

About these ads
This entry was posted in elasticsearch, lucene, performance, search, solr, Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s