Sphinx Search - why and how to use sphinx search delta indexes
Problem:
Anybody who used sphinx search knows that reindexing big indexes takes a long time. The main problem here is that the whole index is recreated every time you execute a reindexing.
Solution:
The way of handling this is by using delta index updates with index merging. The idea is to have 2 indexes:
the "main" and big index for the old (unchanged) data, and
a small "delta" for the new (recently changed) data. So, instead of reindexing the "main" index, you should only reindex the "delta" one every few minutes.
After a while (once a day) you should merge the "delta" and the "main" index (depending of the size of the delta).
This is called "main+delta" scheme.
How-to:
Having a table "documents" with fields: "id, title, body", we create the sphinx search index as follows:
# in sphinx.conf
source main
{
type = mysql
...
sql_query = SELECT id, title, body FROM documents
}
index main
{
source = main
...
}
As you can see the main sql_query give us all the documents from the documents table. The idea of the "main+delta" scheme is that every time you reindex main, you will have to store the last id processed somewhere, so delta can start from there and process a little amount of records.
So first, create this table in mysql to store that id:
CREATE TABLE sph_counter
(
counter_id INTEGER PRIMARY KEY NOT NULL,
max_doc_id INTEGER NOT NULL
);
Then in sphinx.conf:
source main
{
# ...
sql_query_pre = SET NAMES utf8
sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
sql_query = SELECT id, title, body FROM documents \
WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
source delta : main
{
sql_query_pre = SET NAMES utf8
sql_query = SELECT id, title, body FROM documents \
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
index main
{
source = main
path = /path/to/main
# ... all the other settings
}
# note how all other settings are copied from main,
# but source and path are overridden (they MUST be)
index delta : main
{
source = delta
path = /path/to/delta
}
How does this work? As you can see in main specification a pre-fetch query called sqlquerypre appears. That query is executed before the main query (you can have a bunch of those).
REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
Our pre-fetch query, updates a record in the sph_counter table to be used later by delta. This record will store the max(id) from our documents table in the moment of indexing, so the main query will get documents with id less than or equal to that maximum and delta query with ids bigger than that max.
In brief, you update main once in a while, and delta every so often. main will get ALL the documents till the time you update it, and delta will get all the new ones.
In your code you will have to search in both indexes:
$sphinxClient->Query(“this is my search query”, “main delta”);
Indexing main will take a long time too, but you will be sort of "live" because delta will update very quickly.
Instead of reindexing main you can merge both indexes, and only update delta. I will write about merging in a future post.
Links:
http://av4tar.blogspot.com
http://sphinxsearch.com/
http://www.ivinco.com/
Written by Diego Sapriza
Related protips
1 Response
Thanks for an article. And how Sphinxsearch is doing now. Went to his site, it seems neglected