ve6ipw
Last Updated: March 12, 2019
·
6.092K
· polzme

Change the type of field used by Solr to index your data

People who have recently followed my latest tweets and who know me in real life are aware that I'm learning Solr and especially how it works with Drupal.

The problem I had is that when you use Solr, it has a lot of avantages.
One of the disadvantages is that you are not able to search substring of words.
For example, let's say that you have a node title called: "How to learn Drupal in a nutshell", you won't be able to search for the string nut or rupa. Which sometimes turns out to be useful, especially for node titles.

Why isn't this possible ?

Solr is indexing your node title using the text type, if you look in the schema for the definition of the field type 'text', you'll notice that the filter solr.EdgeNGramFilterFactory is not enabled and thus, the n-gram search is not possible using that type.

Actually, this is possible, but you have to edit your file schema.xml manually and add some filters in it. But this is not a good practice. We'll see here that there's a new way to do it without altering the core files.

What's a N-gram ?

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. (source: Wikipedia)

Example with the French word 'praline':

  • 1-gram:p, r, a, l, i, n, e
  • 2-gram: pr, ra, al, li, in, ne
  • 3-gram: pra, ral, ali, lin, ine
  • 4-gram: pral, rali, alin, line
  • 5-gram: prali, ralin, aline
  • 6-gram: pralin, raline

How to enable it in Drupal

By default, the search_api_solr module allows you to use a list of pre-defined data types:

  • Fulltext
  • String
  • Integer
  • Decimal
  • Date
  • Duration
  • Boolean
  • URI

These data types are mapped to your fields and that's how Solr index them.

With the hook_search_api_data_type_info, you are able to define your own data type.
So, with a couple of extra lines in your module, like this:

function mymodule_search_api_data_type_info() {
  return array(
    'edge_n2_kw_text' => array(
      'name' => t('Fulltext (w/ partial matching)'),
      'fallback' => 'text',
      'prefix' => 'tem',
      'always multiValued' => TRUE,
    ),
  );
}

The prefix used by this data type is tem, you can check its definition in the schema.xml file:

<dynamicField name="tem_*" type="edge_n2_kw_text" indexed="true" stored="true" multiValued="true" omitTermFreqAndPositions="true" />

As you can see, the fields with a prefix tem_ with use the edge_n2_kw_text type, which is also defined in the shema.xml file.

You can add your own data type in the data type select list when you map them to your fields.
The key of the array returned is usually a field type defined in the schema.xml.

By using this hook, you'll be able to map any field to the edge_n2_kw_text data type, this data type is defined in schema.xml:

<fieldType name="edge_n2_kw_text" class="solr.TextField" omitNorms="true" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

As you can see, the filter solr.EdgeNGramFilterFactory is enabled on this field type and it means that it will be enabled on the field it will be mapped to. It's magic !

I suggest you to read that issue #1846860, this is where I started to ask some questions about this, and where I found the solution too.

Here's the documentation on the parameters of the filter and don't forget that you can also create your own types in the file schema_extra_types.xml.

Warning

Be aware that you should not use that on big text field, it might break your index, a good example of use is to use it on your field titles only.

For large fields, the best is to use wildcards, and I suggest to read this issue #1879762 to understand how to implement it.

I hope this tutorial will help people !

1 Response
Add your response

32127

Hi, great post. I'm wondering if you can help me as I have a similar but even simpler issue. My Solr implementation (Drupal 8 site with Search API Solr) seems to use StandardTokenizerFactory by default when I need it to behave (use) KeywordTokenizerFactory. In other words, when you search for a phrase, the search returns are ranked by the keywords in the phrase instead of treating the phrase as though it was wrapped in quotes. Do you know how to make that adjustment? So that it searches the whole phrase first and then keywords if the exact phrase isn't found?

Thanks, Rick

2 months ago ·