Solr vs elasticsearch Deathmatch!

Solr vs elasticsearch

If you need good full-text search or faceted search, Apache Solr has been the standard.  I’ve used it myself for three projects in the past, and it is indeed a very powerful and stable product.

But there’s a new kid in town – named elasticsearch, that one of my friends pointed me to recently.  Although Solr isn’t broken, who can resist evaluating new technology!  First impressions – the elasticsearch website looks very nice and clean, and more modern than Solr’s site.  From a functional perspective I found the documentation to be less comprehensive, but I’m sure this will be improved with time.

Bring on the benchmarks!

I decided to start simple and test out of the box indexing performance w/o any tweaking of either system.  I’ll try to describe the exact steps so that the results should be reproducible by others.

I downloaded the latest versions of each product, Solr 1.3 and elasticsearch 0.15.2.  Both servers are really easy to start.


cd apache-solr-3.1.0/example/
java -jar start.jar


cd elasticsearch-0.15.2/bin

I’ve used Solr’s handy CSV import feature in the past to load a database dump, but since elasticsearch only speaks JSON, I figured I’d use JSON for Solr as well in an attempt to keep things even. I whipped up a quick python script to create a 10 million fake documents to index into both systems. Solr’s JSON format is just one big structure, as opposed to elasticsearch’s bulk API which takes individual newline-separated JSON structures.

The documents consist of just an id and a single string field value, which is really just a random number. For Solr, the file looks like this:

$ head -3 solr.json
{"add":{"doc":{ "id":"1582039702", "field1_s":"1184645701" }}
,"add":{"doc":{ "id":"937868144", "field1_s":"410491235" }}
,"add":{"doc":{ "id":"1754417430", "field1_s":"763134804" }}

For elasticsearch, the file looks like this:

$ head -3 es.json
{"index": {"_index":"test", "_type":"type1", "_id":"1582039702", "field1":"1184645701" }}
{"index": {"_index":"test", "_type":"type1", "_id":"937868144", "field1":"410491235" }}
{"index": {"_index":"test", "_type":"type1", "_id":"1754417430", "field1":"763134804" }}

For Solr, I called the data field “field1_s”. Solr has a (configurable) dynamic field mapping that by default treats everything ending with _s as type “string”. elasticsearch is schemaless, so I just used “field1” as the name. If I had wanted to use the exact same name, I could have added “field1” to Solr’s schema, but I’m trying to avoid tweaking config. You will also note that elasticsearch has an additional “_type” field. While the documentation indicated that it was optional, I received errors if I left it out.

Solr indexing:

time curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @solr.json -H 'Content-type:application/json'

elasticsearch indexing:

time curl -XPUT 'http://localhost:9200/_bulk/' --data-binary @es.json
curl: (56) Failure when receiving data from the peer

Oops!  After consulting the elasticsearch documentation, it appears that it can’t stream large files with it’s bulk API. But since it’s a newline separated file, it’s easy to split it up into million document chunks. I don’t know what the limit is, but this appeared to work.

split es.json -l 1000000   #this splits into files xaa, xab, xac, etc
time for f in x??; do
  curl -XPUT 'http://localhost:9200/_bulk/' --data-binary @$f

Solr also allows you to bypass HTTP and directly stream a file from disk using stream.url=file:/path_to_file. Trying this method yielded less than a 2% improvement, a bit of a dissapointment.

I ran each test 4 times, killing the JVM and removing the data directory for both Solr and elasticsearch. The final averaged results expressed as throughputs were 43204 docs/sec for Solr, 44052 docs/sec for Solr direct streaming, and 9823 docs/sec for elasticsearch.

I was somewhat surprised at the results, but indexing performance isn’t super important for everyone, and I’m sure some tuning of the systems could help too. Stay tuned, as I plan to do more elasticsearch vs Solr performance posts in the future that will include queries and faceted search.

Posted in elasticsearch, lucene, performance, search, solr, Uncategorized | Tagged , , , , | Leave a comment

XORShift vs Random performance in Java

Java’s Random class uses a linear congruential generator, which has mediocre quality at best,  and then to add insult to injury, wrapped more crap around it (configurable number of bits per call, useless thread safety) to make it slow to boot!

I had a need for medium quality 64 bit random numbers, but speed was of the essence!  Enter, a class of clever random number generators that use just  a few shift and xor operations, which are very fast on processors.

While there are many possible Xorshift generators, the simplest one I found for 64 bit numbers was here.  It’s incredibly short!

public long randomLong() {
  x ^= (x << 21);
  x ^= (x >>> 35);
  x ^= (x << 4);
  return x;

This is a full period generator that will generate all bit patterns except for 0.  The period is 2^64-1.  One must also avoid using 0 as a seed.  Now all we have to do is wrap this up in a class and benchmark it vs Java’s Random class:

class XORShift64 {
  long x;
  public XORShift64(long seed) {
    x = seed==0 ? 0xdeadbeef : seed;
  public long randomLong() {
    x ^= (x << 21); 
    x ^= (x >>> 35);
    x ^= (x << 4);
    return x;

And the results on a 64 bit JVM:

Java’s Random:  45.7 million longs per second

XORShift64: 320.6 million longs per second, or 7 times faster!

Posted in java, performance | Tagged , , , , | Leave a comment