Autocomplete in Elasticsearch: Getting Started

Autocomplete in Elasticsearch: Getting Started

Autocomplete, also known as type-ahead or search-as-you-type, is a crucial feature for enhancing user experience in search applications. It provides suggestions to users as they type, improving search efficiency and accuracy, and reducing typos. Elasticsearch offers powerful and flexible mechanisms for implementing autocomplete functionality. This article provides a step-by-step guide to get started with autocomplete in Elasticsearch.

1. Choosing the Right Approach

Elasticsearch offers several approaches to implement autocomplete, each with its own trade-offs in terms of performance, complexity, and flexibility. Here are the most common:

  • Edge n-grams (Tokenizer/Filter): This is generally the recommended approach for most use cases. It breaks down words into smaller “edge n-grams” (prefixes) at index time, allowing for fast and efficient prefix matching.
  • Completion Suggester: Specifically designed for “completion” scenarios where order matters (e.g., suggesting full sentences or phrases). It uses a finite state transducer (FST) for very fast lookups but requires data to be stored in a specific format, optimized for memory efficiency.
  • Search-as-you-type (Datatype): Introduced in Elasticsearch 7.2, this datatype is designed to simplify autocomplete implementations. It automatically creates the necessary fields (including n-gram and edge n-gram fields) under the hood. It’s a good choice for ease of use, but offers less fine-grained control than the n-gram approach.
  • N-grams (Tokenizer/Filter): Similar to edge n-grams, but generates n-grams of any position within the word, not just the leading edge. This is less efficient for autocomplete and is generally not recommended for simple prefix matching.

For this “getting started” guide, we’ll focus on the edge n-gram approach, as it offers a good balance of performance, flexibility, and ease of implementation for most common autocomplete use cases.

2. Indexing Your Data (with Edge N-Grams)

The core of edge n-gram autocomplete is creating an index with an appropriate analyzer. Let’s illustrate with a simple example of indexing product names.

2.1. Create an Index with a Custom Analyzer:

json
PUT /products
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
},
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "standard" // Use standard analyzer for full-word searches
}
}
}
}

Explanation:

  • PUT /products: Creates an index named “products”.
  • settings.analysis: Defines custom analyzers and filters.
  • analyzer.autocomplete_analyzer: Defines a custom analyzer for autocomplete.
    • tokenizer: "standard": Uses the standard tokenizer to break text into words.
    • filter: ["lowercase", "autocomplete_filter"]: Applies filters.
      • lowercase: Converts all tokens to lowercase (important for case-insensitive matching).
      • autocomplete_filter: This is our custom edge n-gram filter.
  • filter.autocomplete_filter: Defines the edge n-gram filter.
    • type: "edge_ngram": Specifies the filter type.
    • min_gram: 1: The minimum length of generated n-grams (start with single characters).
    • max_gram: 20: The maximum length of generated n-grams (limit to prevent excessive index size). Adjust this based on your data.
  • mappings.properties.name: Defines the mapping for the “name” field.
    • type: "text": Indicates that this field contains text.
    • analyzer: "autocomplete_analyzer": Uses our custom analyzer for indexing.
    • search_analyzer: "standard": This is crucial. We use the standard analyzer for full-word searches when the user submits a complete query (e.g., by pressing Enter). This allows us to still perform full-text search on the same field, in addition to autocomplete.

2.2. Index Some Documents:

“`json
POST /products/_doc/1
{
“name”: “Elasticsearch Handbook”
}

POST /products/_doc/2
{
“name”: “Elasticsearch in Action”
}

POST /products/_doc/3
{
“name”: “Learning Elasticsearch”
}
“`

With the edge n-gram filter, “Elasticsearch Handbook” would be indexed with tokens like:

  • “e”
  • “el”
  • “ela”
  • “elas”
  • “elast”
  • “elastic”
  • “elastics”
  • “elasticse”
  • “elasticsea”
  • “elasticsear”
  • “elasticsearch”
  • “h”
  • “ha”
  • “han”
  • “hand”
  • “handb”
  • “handbo”
  • “handboo”
  • “handbook”
    …and so on. This allows for fast prefix matching.

3. Searching for Autocomplete Suggestions

Now that we have our data indexed with the edge n-gram analyzer, let’s implement the search query. We will use a match_phrase_prefix query.

json
GET /products/_search
{
"query": {
"match_phrase_prefix": {
"name": {
"query": "elas"
}
}
}
}

Explanation:

  • GET /products/_search: Searches the “products” index.
  • query.match_phrase_prefix: This query type is ideal for autocomplete. It performs a prefix match on the last term in the query.
  • name.query: "elas": Searches for documents where the “name” field starts with “elas”.

This query will efficiently return all three documents because “elas” is a prefix of “Elasticsearch” in each of their names.

4. Highlighting Results (Optional)

To make the suggestions even more user-friendly, you can highlight the matched portion of the text.

json
GET /products/_search
{
"query": {
"match_phrase_prefix": {
"name": {
"query": "elas"
}
}
},
"highlight": {
"fields": {
"name": {}
}
}
}

This will add a highlight section to the response, showing the matched text enclosed in <em> tags (by default):

“`json
{
“took” : … ,
“timed_out” : false,
“_shards” : … ,
“hits” : {
“total” : {
“value” : 3,
“relation” : “eq”
},
“max_score” : … ,
“hits” : [
{
“_index” : “products”,
“_id” : “1”,
“_score” : … ,
“_source” : {
“name” : “Elasticsearch Handbook”
},
“highlight” : {
“name” : [
Elasticsearch Handbook”
]
}
},
{
“_index” : “products”,
“_id” : “2”,
“_score” : … ,
“_source” : {
“name” : “Elasticsearch in Action”
},
“highlight” : {
“name” : [
Elasticsearch in Action”
]
}
},
{
“_index” : “products”,
“_id” : “3”,
“_score” : … ,
“_source” : {
“name” : “Learning Elasticsearch”
},
“highlight” : {
“name” : [
“Learning Elasticsearch”
]
}
}
]
}
}

“`

5. Refining Your Implementation

  • Boosting: You might want to boost more recent or popular items in the suggestions. You can use function score queries or field-value-factor functions for this.

  • Fuzzy Matching: If you want to handle slight typos, consider adding a fuzziness parameter to your match_phrase_prefix query (but be mindful of performance implications). This, however, is generally not recommended for autocomplete; edge n-grams already handle prefixes. Fuzziness is more appropriate for full-text search.

  • Multiple Fields: You can extend this approach to include multiple fields (e.g., product description, category) by creating a multi-field mapping and using a multi-match query.

  • Performance Tuning: For very large datasets, consider:

    • Increasing the min_gram value to reduce the number of tokens generated.
    • Using the Completion Suggester if you need extreme speed and have specific data requirements.
    • Optimizing your Elasticsearch cluster (e.g., using dedicated master nodes, increasing shard count).
  • Filtering Autocomplete Suggestions: Implement filters on the suggestions returned. For instance, let’s say we wanted only products that cost under $50. We would add a field called “price” and add a range filter to the search query.

“`json
POST /products/_doc/4
{
“name”: “Elasticsearch Essentials”,
“price”: 35
}

POST /products/_doc/5
{
    "name": "Elasticsearch Mastery",
    "price": 75
}

GET /products/_search
{
  "query": {
     "bool": {
       "must": [
         {
          "match_phrase_prefix": {
             "name": {
                 "query": "elas"
                }
            }
         }
       ],
       "filter": [
         {
           "range": {
             "price": {
               "lte": 50
             }
           }
         }
       ]
     }
  }
}

“`

This updated query uses a bool query combining must (for the prefix match) and filter (for the price range). Only products matching both criteria will be returned.

Conclusion

This guide demonstrates the basics of setting up autocomplete in Elasticsearch using edge n-grams. This approach offers a great starting point for most autocomplete implementations, balancing ease of use, flexibility, and performance. By understanding the concepts of analyzers, filters, and query types, you can build a robust and user-friendly search experience in your applications. Remember to experiment with different settings and approaches to find the best solution for your specific needs. The search-as-you-type field, and the completion suggester, are also great options to explore as you become more comfortable with Elasticsearch.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top