After a year working with elasticsearch I decided to write about it mainly as a self reminder of what I liked what I did not and the choices I made in the process. Everything is a personal opinion and someone might choose to disagree but it’s cool :) . All this experience was a part of creating a search implementation for an e-commerce application which in the process along with a mysql full text search was contributed back to sylius, the “base” for our solution.
First things first, why elasticsearch compared to other search engines?
What I loved in elasticsearch is that it’s “modern”. For me the rest api it exposes it’s just the way I though a search engine for a web app will be and elasticsearch is a cluster by default compared to other search engines. Also it can work as a document database or caching layer but my approach towards caching is that you must be available at any given time to rebuild it so I wouldn’t like to use it for persistence.
Elasticsearch is massive!! And it changes fast! Documentation is good enough but I would like a bit more details on day-to-day tasks. What confused me initially was the fact that all the examples provided in the documentation must be wrapped in a “query” clause.
Setting up ES:
Starting with elasticsearch is really easy. I prefer downloading the zip containing the executables and if you have mac I like homebrew. When we started with es I wanted to automate the setup of a simple cluster and I choose vagrant with puppet for this. The first version was just setting up a 2 nodes cluster in VMs just for the development. Later on I’ve build a more generic environment using the latest practices.
Libraries:
Our application is a symfony one. That said I just started using the FosElasticaBundle which is a wrapper for the elastica library. What it provides you is yml configuration for your indexes and mechanisms like update/delete/insert listeners for your indexed data.
FosElastica work amazingly if you have a small application or a model that is not so complicated. When I say complicated I mostly mean relations between objects. Our application is an e-commerce one and our model is quite challenging to index. I started with the default settings and soon enough I realised that the level of nested objects will do the querying quite complicated. By using a mechanism called property accessor I took the decision to get all what I need on the indexing time and save them in a single indexing level were attributes will be either scalar or arrays. This helped me to keep the query code really simple and straightforward. But this didn’t end there….
Now imagine an object which has options and attributes…By using listeners when an option changes you have to go and search for all the objects which includes it and re-index them. Think that maybe you have 3 or 4 options changing all the time. Suddenly indexing with listeners becomes totally inefficient. This is where we decided to drop FosElastica for our app and use just the elastica lib for quering ES and build our own custom indexer. One thing that makes indexing a bit slow it’s also the ORM. Doctrine in this case. We dumped that as well. After some series of refactoring and optimisations and by using raw mysql queries we managed to index the whole catalog for multiple regions in under 9 seconds. We accepted that the indexer will not have fresh data at any given time but we managed to do the re-indexing so fast that will be hardly noticeable by the end user. Not mention that we always go to the database when it comes to add something to the cart.
Workflow:
I always have a rest client open to have direct access to the indexed data. I tried out several chrome plugins including postman but I ended up using the HEAD plugin of the elasticsearch itself. It really has everything you need to execute queries and check the status of your indexes.
I started developing the app trying to query data from the elastica library itself. This proven to be slightly wrong…elastica is a nice object oriented library around creating arrays that afterwards are json-encoded before the query. This interface can be confusing to someone which is not able to compile a complicated query properly, like I was 1 year ago. What I like doing now it to start always with the raw json query, after I get the result I want it’s dead obvious how to transfer the query to the library.
A must for an e-commerce application is the faceted result set. Or in elasticsearch terms the aggregations. I was developing the app using the facets and it was that time when ES took the decision to deprecate facets for the favour of aggregations. Is was not a massive change to do in the code but it added a level of frustration, since you want to deliver something in time. A small issue also occurred with the default behaviour of setFilter which now points to setPostFilter and affects the counts of the aggregations. Again I had to adapt my code for this change since I wanted the aggregations to be applied to the unfiltered result set but now ES is doing this for me.
Usage:
We use elasticsearch for search, app logging and analytics. For search we index our catalog, store infos, web content etc and there is a smart mechanism to filter those different result sets and load separate views and represent them on the same result set. We have an ObjectPresentationHandler which abstract a bit the way things are shown on the gui.
For app logging we use monolog and we post data through channels mostly for payments and internal app procedures. MonologBundle initially was not supporting ES engine and because we needed it I’ve contributed it back to the bundle.
For analytics the idea is to throw in ES all the raw data within a specific date range and use kibana afterwards to extract useful information. We use it mostly for sales analytics, order states etc.
Elasticsearch is a great tool, easy to setup with good documentation that fits really well in a lot of use cases!!
Cheers, Argyris