Adding a search engine in your project.

#search

#lucene

#elasticsearch

#index

We have a project that one their main features relies on more or less advanced search operations over a big amount of information.
In the first iteration of the project, all this information needed to be indexed somehow, and was indexed in a MySQL table, with the corresponding database indexes, using the relevant key fields where the search can be performed.

Databases can filter by ranges or matching exact values with relative good performance, but they aren't the best alternative to perform full-text search, analyse text, handle synonyms, score documents by relevance, highlight search results and a lot of other interesting features regarding searches. So, although the search performance was more than enough for the amount of information for that concrete moment, we decided to test how the integration of a search engine would affect the application in terms of features and performance.

The first step, was to decide what search engine to introduce: The open source search engines we found were Apache Lucene, Apache Solr and Elasticsearch. There is important to mention that Lucene is very different to Solr and Elasticsearch. While Lucene is a search engine library, that you can integrate in your application adding it as a dependency, Solr and Elasticsearch, are search engine platforms, that work over Lucene wrap it and, although they can also be embedded into your app, they are really oriented to run as a external application that offer an API to access them.

Lucene

Taking into account that both platforms and Lucene itself rely over the same library, I decided to start making a little prototype of our indexer and search module with Lucene and one finished these were the bad feelings and problems I found:

Lucene is quite complex and has a wide semantic that you will have to learn and understand to deal with it. If you don't have good knowledge about information indexing and retrieval techniques to understand how it works, you will probably need some long time to learn it and probably you will not be able to set it up properly, optimizing the configuration and taking profit of all the features it offers.
Lucene requires working with the low-level behaviour of indexing and retrieval behaviour (like configure when to commit changes, how to store information in disk, how to exactly map the indexed information, how to access and read the indexes, etc...), needing to configure a lot of stuff that for most of the projects could be more or less standardized by convention.
If you are starting with Lucene, at the beginning you will get crazy trying to know what and how is exactly stored into the index. Before I understood what was internally happening, for me, it was like a black box that sometimes I succeed obtaining the desirable result and sometimes inexplicably failed. All started to get sense when I found Luke, a tool to accesses already existing Lucene indexes allowing to display and modify their content. I think it is a must if you are going to use Lucene. You can find here updated versions of Luke for last versions of Lucene.
Since Lucene is a library that has to be embedded in your app that will store the index either in memory or disk, in theory, it doesn't easily allow to scale your app adding new nodes and sharing the index and is not suitable if you plan to deploy your application in a cloud platform like Heroku. There are some alternatives to solve this, like sharing the index files between nodes using NFS or storing the index in a database shared by nodes using something like JDBCDirectory implemented in the Compass project (previous project of the creator of Elasticsearch), but this project is no longer being maintained since it was abandoned in 2010. There is a very interesting post of Elasticsearch creator explaining the motivations of going one step further and switch Compass by Elasticsearch. Both of this solutions produce performance reductions since Lucene is not thought to work in this way.

Although this Lucene prototype provided a very valuable knowledge about the internals of search engines and their wide and complex related semantics. We decided to check one of the search engine platforms that seemed a better approach for our project.

There is a very interesting series of blog posts analysing the differences between Solr and Elasticseach.

After analysing differences I decided to choose Elasticsearch.

Elasticsearch

Elasticsearch is a real-time distributed search and analytics engine used for full text search, structured search and analytics. It uses Lucene internally, but it aims to make full different search features easy by hiding the complexities of Lucene behind a simple and coherent RESTful API.

My conclusions after had used the Lucene prototype in my project:

It provides a lot of default configurations suitable for most of the projects, using a convention over configuration paradigm, and hides complicated search theory away from beginners. It is much easier to set it up and just works right out of the box and without low-level search engines understanding you can start being productive. However, that doesn't mean it doesn't provide all the advanced features of Lucene (sometimes even more so), the engine is very configurable and very flexible, but you will only need to deal with that just if your domain requires it.
From the scalability point of view, Elasticsearch is designed to work well isolated with small datasets and to scale to distributed nodes of big data. It offers an advanced cluster system, horizontally splitting indexes into shards that are automatically allocated and replicated along cluster nodes and making scaling very easy. This post contains a very good explanation of Elasticsearch cluster system.

I will explain in future posts further details of the process of adding Elasticsearch in the project: steps, used libraries and tools and conclusions about features and performance improvements.