Introduction to Searching with Elasticsearch | Docs


Introduction to Searching with Elasticsearch: A Comprehensive Guide

In today’s data-driven world, the ability to quickly and accurately find information within vast datasets is not just a convenience—it’s a necessity. Whether you’re building an e-commerce platform needing lightning-fast product searches, a logging system requiring rapid analysis of operational data, or a content management system demanding relevant article retrieval, the underlying search technology is paramount. This is where Elasticsearch shines.

Elasticsearch is a powerful, distributed, open-source search and analytics engine built on Apache Lucene. It’s designed for horizontal scalability, reliability, and easy management. At its core, Elasticsearch allows you to store, search, and analyze large volumes of data in near real-time. While it offers a wide array of features, its heart lies in its sophisticated searching capabilities.

This guide serves as a detailed introduction to searching within Elasticsearch. We’ll explore the fundamental concepts, dive into the essential Query DSL (Domain Specific Language), examine various query types, and learn how to refine and control search results. By the end, you’ll have a solid foundation for leveraging Elasticsearch’s search power in your own applications.

Table of Contents

  1. Understanding the Basics: Core Elasticsearch Concepts
    • Documents and Indices
    • Nodes and Clusters
    • Shards and Replicas
    • The Inverted Index: The Secret Sauce of Fast Search
    • Mapping: Defining Your Data Structure
  2. Interacting with Elasticsearch: The Search API
    • RESTful API Principles
    • URI Search: Simple but Limited
    • Request Body Search: Power and Flexibility
    • Basic Search Request Structure
  3. The Heart of the Search: Introduction to the Query DSL
    • What is the Query DSL?
    • Query Context vs. Filter Context
  4. Fundamental Query Types: Finding What You Need
    • Full-Text Queries (Query Context): Searching Analyzed Text
      • match Query: The Standard Full-Text Search
      • match_phrase Query: Searching for Exact Phrases
      • match_phrase_prefix Query: Phrase Matching with Prefix on Last Term
      • multi_match Query: Searching Across Multiple Fields
    • Term-Level Queries (Filter Context / Query Context): Searching Exact Values
      • term Query: Finding Exact Terms (Not Analyzed)
      • terms Query: Finding Multiple Exact Terms
      • range Query: Searching Within a Range
      • exists Query: Finding Documents Containing a Field
      • prefix Query: Matching Document Fields with a Specific Prefix
      • wildcard Query: Using Wildcards (Use with Caution)
      • regexp Query: Using Regular Expressions (Use with Caution)
      • ids Query: Retrieving Documents by ID
  5. Combining Queries: The bool Query
    • Structure of a bool Query
    • must: Clauses MUST Match (AND)
    • should: Clauses SHOULD Match (OR, influences score)
    • must_not: Clauses MUST NOT Match (NOT)
    • filter: Clauses MUST Match (AND, but in Filter Context)
    • Combining Clauses for Complex Logic
  6. Controlling Search Results
    • Sorting: Ordering Your Results (sort)
    • Pagination: Retrieving Results in Batches (from, size)
    • Deep Pagination and search_after: Efficient Scrolling
    • Source Filtering: Selecting Which Fields to Return (_source)
  7. Understanding Relevance and Scoring
    • What is Relevance (_score)?
    • Brief Overview of TF/IDF and BM25
    • Factors Influencing Score
    • Debugging Scores with the explain Parameter
  8. Highlighting Search Results
    • Why Use Highlighting?
    • Basic Highlighting Usage (highlight)
    • Configuring Highlighters
  9. The Role of Text Analysis
    • What is Analysis?
    • Analyzers, Tokenizers, and Token Filters
    • Standard Analyzer vs. Other Built-in Analyzers
    • How Analysis Affects match vs. term Queries
    • Testing Analyzers with the _analyze API
  10. Putting It All Together: A More Complex Example
  11. Conclusion and Next Steps

1. Understanding the Basics: Core Elasticsearch Concepts

Before diving into search queries, it’s essential to understand how Elasticsearch organizes and stores data.

Documents and Indices

  • Document: The basic unit of information that can be indexed in Elasticsearch. It’s represented in JSON (JavaScript Object Notation) format. Think of a document as a row in a relational database table, but more flexible and hierarchical. Example: A document representing a user, a product, or a log entry.
  • Index: A collection of documents that have somewhat similar characteristics. An index is the highest-level entity you can query against. It’s analogous to a database in a relational system. For example, you might have an index for products, users, or logs-2023-10. Index names must be lowercase.

json
// Example Document for a 'products' index
{
"product_id": "P12345",
"name": "Elasticsearch Power Bank",
"description": "A high-capacity power bank with fast charging for all your devices.",
"price": 49.99,
"category": "Electronics",
"tags": ["power bank", "charger", "portable", "electronics"],
"in_stock": true,
"date_added": "2023-10-26T10:00:00Z",
"features": {
"capacity_mah": 20000,
"ports": ["USB-A", "USB-C"],
"weight_grams": 350
}
}

Nodes and Clusters

  • Node: A single running instance of Elasticsearch. It participates in the cluster’s indexing and search capabilities.
  • Cluster: A collection of one or more nodes that work together, sharing their data and workload. A cluster provides high availability and scalability. Nodes communicate with each other to maintain a consistent state.

Shards and Replicas

To handle large amounts of data and provide fault tolerance, Elasticsearch divides indices into smaller pieces called shards.

  • Shard: Each shard is a fully functional and independent index (a Lucene index). When you index a document, Elasticsearch routes it to a specific primary shard based on a routing algorithm (often based on the document ID). When you search, Elasticsearch queries all relevant shards in parallel and combines the results. This distribution allows for horizontal scaling.
  • Primary Shard: The main shard where indexing operations first occur. The number of primary shards for an index is fixed when the index is created.
  • Replica Shard: A copy of a primary shard. Replicas serve two main purposes:
    1. High Availability: If a node holding a primary shard fails, a replica shard on another node can be promoted to become the primary.
    2. Increased Search Throughput: Search requests can be handled by either primary or replica shards, distributing the search load.
      The number of replicas can be changed dynamically.

The Inverted Index: The Secret Sauce of Fast Search

The reason Elasticsearch (and underlying Lucene) can perform fast full-text searches is primarily due to a data structure called the inverted index.

Instead of listing documents and the words they contain (like a traditional database), an inverted index lists unique terms (words) that appear in any document within the index and identifies all the documents each term appears in.

Consider these simple documents:

  1. { "text": "The quick brown fox" }
  2. { "text": "The lazy brown dog" }
  3. { "text": "The quick agile fox" }

A simplified inverted index for the text field might look like this:

Term Document IDs Frequency Positions (in Doc)
agile [3] 1 [2]
brown [1, 2] 2 [2], [2]
dog [2] 1 [3]
fox [1, 3] 2 [3], [3]
lazy [2] 1 [1]
quick [1, 3] 2 [1], [1]
the [1, 2, 3] 3 [0], [0], [0]

(Note: Stop words like “the” might be removed by analysis, and actual indices store more info like term positions for phrase queries)

When you search for “quick fox”, Elasticsearch:

  1. Looks up “quick” in the inverted index -> finds Docs [1, 3].
  2. Looks up “fox” in the inverted index -> finds Docs [1, 3].
  3. Combines these lists (e.g., finds the intersection for an AND query) -> identifies Docs [1, 3] as potential matches.
  4. Calculates a relevance score for each matching document based on factors like term frequency (how often the term appears in the document) and inverse document frequency (how rare the term is across all documents).

This structure makes searching incredibly fast, as it avoids scanning every document for the search terms.

Mapping: Defining Your Data Structure

While Elasticsearch can often infer data types (dynamic mapping), explicitly defining the structure and data types of your fields using a mapping is crucial for effective searching. Mapping tells Elasticsearch how to treat each field in your documents – whether it’s text to be analyzed for full-text search, a keyword for exact matching, a number, a date, a boolean, etc.

  • Data Types: text, keyword, integer, long, float, double, boolean, date, object, nested, geo_point, etc.
  • Analyzers: For text fields, you specify which analyzer should be used to process the text during indexing and searching (more on this later).

Defining mappings ensures data consistency and enables precise querying. For example, mapping a field as keyword ensures it’s treated as a single, exact value, suitable for filtering and aggregations, while mapping it as text enables full-text search capabilities.

json
// Example Mapping Definition for the 'products' index
PUT /products
{
"mappings": {
"properties": {
"product_id": { "type": "keyword" }, // Exact match ID
"name": {
"type": "text", // Full-text search on name
"fields": {
"keyword": { // Also index as keyword for sorting/aggregation
"type": "keyword",
"ignore_above": 256
}
}
},
"description": { "type": "text" }, // Full-text search on description
"price": { "type": "float" }, // Numeric type for range queries
"category": { "type": "keyword" }, // Exact match category
"tags": { "type": "keyword" }, // Array of exact match tags
"in_stock": { "type": "boolean" }, // Boolean type
"date_added": { "type": "date" }, // Date type for time-based queries
"features": { // Nested object
"properties": {
"capacity_mah": { "type": "integer" },
"ports": { "type": "keyword" },
"weight_grams": { "type": "integer" }
}
}
}
}
}


2. Interacting with Elasticsearch: The Search API

Elasticsearch provides a comprehensive and easy-to-use RESTful API for interacting with your cluster. Searching is primarily done via HTTP GET or POST requests to the _search endpoint.

RESTful API Principles

Interactions typically involve:

  • HTTP Verbs: GET (retrieve data), POST (send data, often used for complex searches), PUT (create or update data), DELETE (remove data).
  • Endpoints: URLs identifying resources (e.g., /{index}/_search, /{index}/_doc/{id}).
  • Request Body: Often JSON payloads containing instructions or data (especially for POST and PUT).
  • Response Body: JSON payloads containing results or status information.

URI Search: Simple but Limited

The simplest way to search is by passing query parameters directly in the URL. This is known as URI Search. It’s useful for quick tests or simple queries.

The primary parameter is q, which uses the Lucene query string syntax.

“`bash

Search for documents containing “power bank” in any field in the ‘products’ index

GET /products/_search?q=power%20bank

Search for products where the category is “Electronics”

GET /products/_search?q=category:Electronics

More complex query: name contains “elastic” AND price is greater than 30

GET /products/_search?q=name:elastic%20AND%20price:>30
“`

Limitations of URI Search:

  • Can become complex and hard to read/debug.
  • Limited expressiveness compared to the Request Body Search.
  • URL length limits can be an issue for very complex queries.
  • Less control over specific query types and options.

While useful for exploration, most applications rely on the Request Body Search.

Request Body Search: Power and Flexibility

The most common and powerful way to search is by sending a JSON payload in the request body of a GET or POST request to the _search endpoint. This allows you to use the full Query DSL.

“`bash

Using GET with a request body

GET /products/_search
{
“query”: {
“match”: {
“description”: “fast charging”
}
}
}

Using POST (often preferred for complex bodies)

POST /products/_search
{
“query”: {
“match”: {
“description”: “fast charging”
}
}
}
“`

Basic Search Request Structure

A typical Request Body search request has the following structure:

json
{
"query": {
// Query definition goes here (using Query DSL)
},
"from": 0, // Starting document offset (for pagination)
"size": 10, // Number of documents to return (for pagination)
"sort": [ // How to sort results
// Sorting criteria go here
],
"_source": [ // Which fields to include/exclude from the original document
// Field patterns go here
],
"highlight": { // How to highlight matching terms in results
// Highlighting configuration goes here
},
"aggs": { // Aggregations (for analytics/summarization)
// Aggregation definitions go here
}
// ... other top-level parameters like timeout, explain, etc.
}

The most critical part is the query object, which contains the Query DSL specifying the search criteria.


3. The Heart of the Search: Introduction to the Query DSL

The Elasticsearch Query DSL (Domain Specific Language) is a flexible, expressive JSON-based language used to define queries. It provides a rich set of query types that can be combined in intricate ways to find exactly the data you need.

What is the Query DSL?

Instead of writing queries as plain text strings (like SQL or Lucene query syntax), you build queries by composing JSON objects. Each object represents a specific type of query clause (e.g., match, term, range, bool).

Benefits of Query DSL:

  • Expressiveness: Supports a wide range of query types, from simple term matching to complex geospatial queries.
  • Readability: Well-structured JSON can be easier to read and maintain than complex string queries.
  • Composability: Queries can be nested and combined logically (e.g., using the bool query).
  • Control: Provides fine-grained control over the querying process.

Query Context vs. Filter Context

This is a crucial concept in Elasticsearch querying. When you write a query clause, it executes in one of two contexts:

  1. Query Context:

    • Asks the question: “How well does this document match the query criteria?”
    • Clauses running in query context contribute to the relevance score (_score), determining the order of results. Higher scores mean better matches.
    • Examples: match, multi_match. Generally used for full-text search where relevance matters.
  2. Filter Context:

    • Asks the question: “Does this document match the query criteria?” (Yes/No).
    • Clauses running in filter context do not calculate a score. They simply include or exclude documents.
    • Performance: Filter context is generally faster than query context because scoring is skipped.
    • Caching: Filter results are often cached by Elasticsearch, leading to significant performance improvements for repeated queries.
    • Examples: term, terms, range, exists. Typically used for exact value matching, range filtering, or boolean checks.

When to use which?

  • Use Query Context when you need full-text search capabilities or when the relevance score is important (e.g., searching for “best power bank” in product descriptions).
  • Use Filter Context when you need exact matching or filtering based on structured data (e.g., finding products with category: "Electronics" or price between 30 and 50, or in_stock: true). Filters are ideal for faceting/aggregation scenarios.

The bool query (discussed later) provides a powerful way to combine clauses in both query and filter contexts within a single search request.


4. Fundamental Query Types: Finding What You Need

The Query DSL offers numerous query types. We’ll cover the most common and fundamental ones, categorized broadly into Full-Text and Term-Level queries.

(Assume we are querying the products index created earlier for the examples below.)

Full-Text Queries (Query Context): Searching Analyzed Text

These queries are typically used for searching natural language text in fields mapped as text. They work by first analyzing the query string (using the same analyzer applied to the field during indexing) and then finding documents that match the resulting terms. They operate in Query Context and contribute to the _score.

a) match Query: The Standard Full-Text Search

The match query is your go-to for standard full-text search. It analyzes the provided text and looks for the analyzed terms in the specified field. By default, it uses an OR operator between the terms, but this can be changed.

json
POST /products/_search
{
"query": {
"match": {
"description": "high capacity charging"
}
}
}

  • How it works: The text “high capacity charging” is analyzed (e.g., tokenized into high, capacity, charging). Elasticsearch finds documents whose description field contains any of these terms. Documents containing more terms or more frequent terms will generally score higher.
  • Operators: You can change the default OR behavior to AND:
    json
    {
    "query": {
    "match": {
    "description": {
    "query": "high capacity charging",
    "operator": "and" // All terms must be present
    }
    }
    }
    }
  • Fuzziness: Supports finding terms that are “close” (within a certain edit distance) using the fuzziness parameter (e.g., fuzziness: "AUTO" or fuzziness: 2). Useful for handling typos.

b) match_phrase Query: Searching for Exact Phrases

This query analyzes the text and looks for the analyzed terms appearing in the exact specified order, potentially with a configurable slop (maximum number of positions terms can be apart).

json
POST /products/_search
{
"query": {
"match_phrase": {
"description": "fast charging" // Looks for "fast" immediately followed by "charging"
}
}
}

  • Slop: Allows for some flexibility. A slop of 1 means the terms can be one position apart or in a different order.
    json
    {
    "query": {
    "match_phrase": {
    "description": {
    "query": "charging fast", // Terms are reversed
    "slop": 1 // Allows them to match "fast charging"
    }
    }
    }
    }

c) match_phrase_prefix Query: Phrase Matching with Prefix on Last Term

Similar to match_phrase, but performs a prefix match on the last term in the phrase. Useful for auto-complete “search-as-you-type” scenarios.

json
POST /products/_search
{
"query": {
"match_phrase_prefix": {
"name": "Elasticsearch Pow" // Matches "Elasticsearch Power Bank"
}
}
}
// Can also control max_expansions for the prefix part

d) multi_match Query: Searching Across Multiple Fields

Allows searching for the same text across multiple fields. Useful when you don’t know exactly where the information resides or want to boost relevance based on matches in certain fields.

json
POST /products/_search
{
"query": {
"multi_match": {
"query": "portable charger",
"fields": ["name", "description", "tags"] // Search these fields
// Optional: "boost" certain fields: "fields": ["name^3", "description", "tags"]
}
}
}

  • Types: multi_match has different execution types (best_fields, most_fields, cross_fields, phrase, phrase_prefix) controlling how scores from different fields are combined. best_fields (default) is often suitable: it finds the best matching field and uses its score. most_fields is useful when matching across fields containing different analyses of the same text.

Term-Level Queries (Filter Context / Query Context): Searching Exact Values

These queries operate on exact, non-analyzed terms. They are typically used on fields mapped as keyword, numbers, dates, or booleans. They are often used within a Filter Context for precise filtering (as they don’t usually require scoring), but can also be used in a Query Context if needed.

a) term Query: Finding Exact Terms (Not Analyzed)

Finds documents containing the exact term specified in the inverted index for a given field. Important: The value provided is NOT analyzed. This is key for keyword fields or when you need precise matching.

json
POST /products/_search
{
"query": {
// Usually used inside a 'bool' filter for performance
"bool": {
"filter": [
{ "term": { "category": "Electronics" } }, // Match exact category
{ "term": { "in_stock": true } } // Match exact boolean value
]
}
}
}

  • Case Sensitivity: term queries are case-sensitive for keyword fields unless a normalizer is defined in the mapping. If used on a text field, it looks for the exact term after analysis (e.g., term: { "description": "charging" } would match the analyzed lowercase term). Using term on text fields is often less useful than match.

b) terms Query: Finding Multiple Exact Terms

Similar to term, but allows matching any term from a provided list (logical OR).

json
POST /products/_search
{
"query": {
"bool": {
"filter": {
"terms": {
"tags": ["portable", "charger"] // Match if tags include EITHER "portable" OR "charger"
}
}
}
}
}

c) range Query: Searching Within a Range

Finds documents where a field’s value falls within a specified range. Works on numbers, dates, and even string ranges.

Operators:
* gt: Greater than
* gte: Greater than or equal to
* lt: Less than
* lte: Less than or equal to

“`json
POST /products/_search
{
“query”: {
“bool”: {
“filter”: {
“range”: {
“price”: {
“gte”: 30, // Price >= 30
“lt”: 50 // Price < 50
}
}
}
}
}
}

// Range query on dates (using date math or absolute dates)
POST /products/_search
{
“query”: {
“bool”: {
“filter”: {
“range”: {
“date_added”: {
“gte”: “now-7d/d”, // From start of day, 7 days ago
“lte”: “now/d” // To start of today
}
}
}
}
}
}
“`

d) exists Query: Finding Documents Containing a Field

Finds documents that have any non-null value for a specific field. Useful for finding documents where a field is present (or missing, when used inside must_not).

json
POST /products/_search
{
"query": {
"bool": {
"filter": {
"exists": {
"field": "features.capacity_mah" // Find products where capacity is specified
}
}
}
}
}

e) prefix Query: Matching Document Fields with a Specific Prefix

Finds documents where a field starts with the specified prefix. Like term, it operates on non-analyzed terms. Can be slow on text fields; more efficient on keyword fields.

json
POST /products/_search
{
"query": {
"prefix": {
"product_id": "P123" // Find products whose ID starts with "P123"
}
}
}
// Use with caution on highly variable text fields due to performance implications.

f) wildcard Query: Using Wildcards (Use with Caution)

Allows using wildcard characters (* for multiple characters, ? for a single character) in the term. Warning: Wildcard queries, especially with leading wildcards (*term), can be very slow and resource-intensive as they may need to scan many terms in the index. Avoid if possible, or ensure prefixes are long enough.

json
POST /products/_search
{
"query": {
"wildcard": {
"name.keyword": "Elas*arch" // Match name.keyword like "Elas*arch"
}
}
}
// Use on keyword fields where possible. Avoid leading wildcards.

g) regexp Query: Using Regular Expressions (Use with Caution)

Allows matching based on regular expression patterns. Warning: Like wildcard queries, regexp queries can be very slow and resource-intensive. Use sparingly and carefully.

json
POST /products/_search
{
"query": {
"regexp": {
"product_id": "P[0-9]{4}" // Match product IDs like P followed by 4 digits
}
}
}

h) ids Query: Retrieving Documents by ID

A specialized query to retrieve documents based on their specific _id values. Very fast.

json
POST /products/_search
{
"query": {
"ids": {
"values": ["P12345", "P67890"] // Retrieve these specific documents
}
}
}


5. Combining Queries: The bool Query

Real-world search requirements often involve combining multiple criteria. The bool (boolean) query is the primary tool for this. It allows you to combine other query clauses using boolean logic (AND, OR, NOT).

Structure of a bool Query

A bool query has four main clause types:

json
{
"query": {
"bool": {
"must": [
// Clauses here MUST match (like AND). Contribute to score.
],
"filter": [
// Clauses here MUST match (like AND). Execute in Filter Context (no score, cached).
],
"should": [
// Clauses here SHOULD match (like OR). Contribute to score. If used alone, at least one MUST match.
// If used with must/filter, they act as score boosters.
],
"must_not": [
// Clauses here MUST NOT match (like NOT). Execute in Filter Context.
]
}
}
}

must: Clauses MUST Match (AND)

All query clauses within the must array must match for a document to be considered a hit. These clauses run in Query Context and contribute to the _score.

json
{
"query": {
"bool": {
"must": [
{ "match": { "description": "power bank" } },
{ "match": { "description": "fast charging" } }
// Both 'match' queries must be satisfied
]
}
}
}

should: Clauses SHOULD Match (OR, influences score)

Clauses within the should array are optional matches. If a bool query contains only should clauses (no must or filter), then at least one should clause must match. If must or filter clauses are present, then should clauses act primarily as score boosters – documents matching should clauses get a higher _score.

“`json
// Scenario 1: Only ‘should’ – acts like OR
{
“query”: {
“bool”: {
“should”: [
{ “term”: { “category”: “Electronics” } },
{ “term”: { “tags”: “portable” } }
// Match if category is Electronics OR tag is portable
],
“minimum_should_match”: 1 // Default is 1 when only ‘should’ is present
}
}
}

// Scenario 2: ‘must’ and ‘should’ – ‘should’ boosts score
{
“query”: {
“bool”: {
“must”: [
{ “match”: { “name”: “power bank” } } // Must be a power bank
],
“should”: [
{ “term”: { “features.ports”: “USB-C” } }, // Higher score if it has USB-C
{ “range”: { “features.capacity_mah”: { “gte”: 20000 } } } // Higher score if capacity >= 20000
]
}
}
}
``
The
minimum_should_matchparameter controls how manyshould` clauses must match.

must_not: Clauses MUST NOT Match (NOT)

All clauses within the must_not array must not match for a document to be considered a hit. These clauses run in Filter Context (they don’t affect the score, just exclusion).

json
{
"query": {
"bool": {
"must": [
{ "match": { "description": "charger" } }
],
"must_not": [
{ "term": { "category": "Clearance" } } // Must be a charger, but not in Clearance category
]
}
}
}

filter: Clauses MUST Match (AND, but in Filter Context)

Clauses within the filter array must match, just like must. However, they execute in Filter Context. This means they don’t contribute to the score and are candidates for caching. This is the preferred place for term-level queries (term, range, exists, etc.) used for filtering.

json
{
"query": {
"bool": {
"must": [
// Use 'must' for full-text search where score matters
{ "match": { "description": "high capacity" } }
],
"filter": [
// Use 'filter' for exact matching / range filtering
{ "term": { "in_stock": true } },
{ "range": { "price": { "lte": 50 } } }
]
}
}
}

Performance Tip: Whenever possible, use filter instead of must for non-scoring, exact-match criteria.

Combining Clauses for Complex Logic

The real power comes from nesting bool queries and combining different clause types to represent complex business logic.

“`json
// Find products that:
// – MUST mention “power bank” in name or description (full-text, scored)
// – AND MUST be in stock (filter)
// – AND MUST have a price <= 100 (filter)
// – AND MUST NOT be in the ‘Obsolete’ category (filter)
// – AND SHOULD ideally have “USB-C” port OR capacity >= 20000 (boost score)

POST /products/_search
{
“query”: {
“bool”: {
“must”: [ // Scored query
{
“multi_match”: {
“query”: “power bank”,
“fields”: [“name”, “description”]
}
}
],
“filter”: [ // Non-scoring filters
{ “term”: { “in_stock”: true } },
{ “range”: { “price”: { “lte”: 100 } } }
],
“must_not”: [ // Exclusion filter
{ “term”: { “category”: “Obsolete” } }
],
“should”: [ // Score boosters
{ “term”: { “features.ports”: “USB-C” } },
{ “range”: { “features.capacity_mah”: { “gte”: 20000 } } }
],
“minimum_should_match”: 0 // Should clauses are purely optional boosters here
}
}
}
“`


6. Controlling Search Results

Finding matching documents is only part of the story. You also need to control how the results are presented.

Sorting: Ordering Your Results (sort)

By default, Elasticsearch sorts results by relevance score (_score) in descending order (most relevant first). You can change this using the sort parameter, specifying one or more fields to sort by.

json
POST /products/_search
{
"query": {
"match": { "category": "Electronics" }
},
"sort": [
{ "price": "asc" }, // Sort primarily by price ascending
{ "_score": "desc" }, // Then by relevance score descending (for ties in price)
{ "name.keyword": "asc" } // Then by name ascending (using the keyword version)
]
}

  • Use the .keyword sub-field for sorting on analyzed text fields if you want to sort alphabetically on the original, non-analyzed value.
  • Sorting by fields other than _score can impact performance as it requires accessing field data.
  • When sorting by fields other than _score, the _score is not calculated by default (as it’s not needed). You can force score calculation by setting track_scores: true.

Pagination: Retrieving Results in Batches (from, size)

Elasticsearch returns the top 10 hits by default (size: 10). You can retrieve results in pages using:

  • size: The number of hits to return per page.
  • from: The starting offset (document index) from which to return hits. from is 0-based.

“`json
// Get the first page (hits 0-9)
POST /products/_search
{
“query”: { “match_all”: {} }, // Match all documents
“size”: 10,
“from”: 0
}

// Get the second page (hits 10-19)
POST /products/_search
{
“query”: { “match_all”: {} },
“size”: 10,
“from”: 10
}

// Get the third page (hits 20-29)
POST /products/_search
{
“query”: { “match_all”: {} },
“size”: 10,
“from”: 20
}
“`

Warning: Deep Pagination
Using large from values (e.g., retrieving page 1000 with size: 10, meaning from: 9990) can be very inefficient. Elasticsearch needs to fetch from + size results from each relevant shard, sort them all on the coordinating node, and then discard the first from results. This consumes significant memory and CPU, especially in a distributed environment. Avoid deep pagination with from/size if possible (typically beyond the first few pages). The default limit for from + size is 10,000 (index.max_result_window), though this can be changed.

Deep Pagination and search_after: Efficient Scrolling

For deep scrolling scenarios (like infinite scroll or processing all results of a query), search_after is the recommended approach. It avoids the overhead of from/size by using the sort values of the last hit from the previous page to fetch the next page.

How it works:

  1. Perform an initial search with a sort clause (must include a unique tie-breaker like _id or another unique field).
  2. The response will contain the hits for the first page, and each hit will have a sort array containing the values it was sorted by.
  3. To get the next page, repeat the same query and same sort, but add the search_after parameter, passing the sort array values from the last hit of the previous page.

“`json
// 1. Initial Search (Get Page 1)
POST /products/_search
{
“size”: 10,
“query”: { “match”: { “category”: “Electronics” } },
“sort”: [
{“price”: “asc”},
{“product_id”: “asc”} // Use a unique field like product_id as a tie-breaker
]
}

// Assume the last hit on page 1 had price: 49.99 and product_id: “P12345”
// Its ‘sort’ array in the response would be: [49.99, “P12345”]

// 2. Get Page 2 using search_after
POST /products/_search
{
“size”: 10,
“query”: { “match”: { “category”: “Electronics” } },
“sort”: [
{“price”: “asc”},
{“product_id”: “asc”}
],
“search_after”: [49.99, “P12345”] // Pass the sort values from the last hit of page 1
}

// Repeat for subsequent pages, using the sort values from the last hit of the current page
“`

search_after is stateless and much more efficient for deep scrolling than from/size.

Source Filtering: Selecting Which Fields to Return (_source)

By default, Elasticsearch returns the entire JSON _source document for each hit. To reduce network traffic and client-side processing, you can specify which fields to include or exclude using the _source parameter.

“`json
POST /products/_search
{
“query”: { “match_all”: {} },
“_source”: false // Don’t return the _source at all
}

POST /products/_search
{
“query”: { “match_all”: {} },
“_source”: [“product_id”, “name”, “price”] // Only return these fields
}

POST /products/search
{
“query”: { “match_all”: {} },
“_source”: { // Use include/exclude patterns
“includes”: [“product
“, “name”],
“excludes”: [“
.description”]
}
}
“`

Source filtering is highly recommended, especially when dealing with large documents, to retrieve only the necessary data.


7. Understanding Relevance and Scoring

When performing full-text searches (like match), Elasticsearch assigns a relevance score (_score) to each matching document. This score determines the default sort order. Understanding how this score is calculated helps you interpret results and fine-tune queries.

What is Relevance (_score)?

The _score is a floating-point number representing how relevant a document is to a given query. A higher score indicates higher relevance. The exact value is relative to the query and the documents in the index; it’s not an absolute measure.

Brief Overview of TF/IDF and BM25

Elasticsearch uses sophisticated algorithms to calculate relevance. Historically, TF/IDF (Term Frequency/Inverse Document Frequency) was common. Modern versions of Elasticsearch (since 5.0) default to Okapi BM25, which is generally considered superior.

Both algorithms are based on similar core concepts:

  1. Term Frequency (TF): How often does the search term appear in the field within this specific document? More frequent occurrences generally mean higher relevance.
  2. Inverse Document Frequency (IDF): How rare is the search term across all documents in the index? Common terms (like “the”, “a”) have low IDF scores, while rare terms have high IDF scores. Matches on rarer terms contribute more significantly to relevance.
  3. Field Length Norm: How long is the field? Matches in shorter fields are typically considered more relevant than matches in very long fields (the term constitutes a larger portion of the field). BM25 handles this more effectively than classic TF/IDF, preventing scores from being overly penalized for longer documents (saturation).

BM25 also includes parameters (k1 and b) that can be tuned to adjust how TF and field length influence the score, although the defaults are usually effective.

Factors Influencing Score

  • Matching terms from the query (match, multi_match, etc.).
  • Frequency of terms within the document (TF).
  • Rarity of terms across the index (IDF).
  • Length of the field being searched.
  • Query type (e.g., match_phrase scores exact phrases higher).
  • Boosting applied at query time (e.g., boosting fields in multi_match, boosting clauses in bool/should).
  • Boosting applied at index time (less common now).

Debugging Scores with the explain Parameter

If you need to understand why a specific document received a certain score, you can add "explain": true to your search request. The response will include a detailed breakdown of the score calculation for each hit, showing the contribution of TF, IDF, field norms, and boosts. This can be verbose but is invaluable for debugging relevance issues.

json
POST /products/_search
{
"explain": true, // Add this parameter
"query": {
"match": {
"description": "fast charging"
}
}
}


8. Highlighting Search Results

When displaying search results to users, it’s often helpful to show why a document matched the query by highlighting the search terms within the relevant fields. Elasticsearch provides built-in highlighting capabilities.

Why Use Highlighting?

Highlighting provides context to the user, showing snippets of the document content where the query terms appear. This allows users to quickly assess the relevance of a result without reading the entire document.

Basic Highlighting Usage (highlight)

You enable highlighting by adding a highlight object to your search request, specifying which fields to highlight.

json
POST /products/_search
{
"query": {
"match": {
"description": "high-capacity power bank"
}
},
"highlight": {
"fields": {
"description": {} // Request highlighting for the 'description' field
// You can add more fields here: "name": {}
}
}
}

The search response will include a highlight section for each hit, containing HTML fragments (by default) with the matched terms wrapped in <em> tags.

json
// Example snippet from response 'hits' array
{
"_index": "products",
"_id": "P12345",
"_score": 1.234,
"_source": { ... original document ... },
"highlight": {
"description": [
"A <em>high</em>-<em>capacity</em> <em>power</em> <em>bank</em> with fast charging for all your devices."
]
}
}

Configuring Highlighters

Elasticsearch offers several highlighter types (unified, plain, fvh – Fast Vector Highlighter) and numerous configuration options:

  • Tags: Change the HTML tags used for highlighting (pre_tags, post_tags).
    json
    "highlight": {
    "pre_tags": ["<mark>"],
    "post_tags": ["</mark>"],
    "fields": { "description": {} }
    }
  • Fragment Size and Number: Control the length (fragment_size) and maximum number (number_of_fragments) of snippets returned per field.
  • Require Field Match: (require_field_match: false) Allows highlighting on fields even if the query didn’t explicitly match that specific field (useful with multi_match on fields with different analyses).
  • Order: Control the order of fragments (e.g., order: "score" to show the most relevant snippets first).

The fvh highlighter often provides better performance for large documents if term vectors are enabled in the mapping, while unified is a good general-purpose choice.


9. The Role of Text Analysis

Understanding text analysis is fundamental to using full-text search effectively in Elasticsearch. Analysis is the process Elasticsearch uses to convert raw text into a list of tokens (terms) suitable for inclusion in the inverted index.

What is Analysis?

When you index a document with a text field, or when you use a full-text query like match, the text undergoes an analysis process. This process breaks down the text, normalizes it, and prepares it for searching.

Analyzers, Tokenizers, and Token Filters

An analyzer orchestrates the analysis process. It typically consists of:

  1. Character Filters (Optional): Pre-process the raw text before tokenization. Examples: Stripping HTML tags (html_strip), replacing characters (mapping), pattern-based replacement (pattern_replace).
  2. Tokenizer: Splits the text stream into individual tokens (usually words). Examples:
    • standard: Grammar-based, good for most Western languages.
    • whitespace: Splits only on whitespace.
    • keyword: Treats the entire input string as a single token (no splitting).
    • pattern: Splits based on a regular expression pattern.
    • Language-specific tokenizers.
  3. Token Filters (Optional): Process the tokens generated by the tokenizer. Examples:
    • lowercase: Converts tokens to lowercase.
    • stop: Removes common stop words (“the”, “a”, “is”).
    • stemmer: Reduces words to their root form (e.g., “charging” -> “charg”). (porter_stem, snowball).
    • synonym: Adds synonyms for tokens.
    • ngram / edge_ngram: Creates n-grams for partial word matching / auto-complete.

Standard Analyzer vs. Other Built-in Analyzers

Elasticsearch comes with several pre-configured analyzers:

  • standard Analyzer: (Default for text fields) Includes grammar-based standard tokenizer, lowercase token filter, and stop token filter (disabled by default in recent versions, but common). Good general-purpose choice.
  • simple Analyzer: Tokenizes on non-letter characters and lowercases.
  • whitespace Analyzer: Tokenizes on whitespace and does not lowercase.
  • keyword Analyzer: No-op analyzer. Outputs the entire input string as a single token. (Used internally by keyword fields).
  • english, french, etc.: Language-specific analyzers with appropriate stop words and stemming rules.

You can also define custom analyzers by combining different character filters, tokenizers, and token filters to precisely control how your text is processed.

How Analysis Affects match vs. term Queries

  • match Query: The query string provided to a match query is analyzed using the same analyzer configured for the target field. This ensures that the search terms are processed in the same way as the indexed text (e.g., lowercased, stemmed). This allows “Fast Charging” to match “fast charging” or even “faster charger” (if stemmed).
  • term Query: The value provided to a term query is NOT analyzed. It searches for the exact value as-is in the inverted index. This is why term queries are typically used on keyword fields (which are indexed as-is) or when you need to find the exact, post-analysis token from a text field (less common use case). A query term: { "description": "Charging" } would likely fail to match text analyzed by the standard analyzer because the indexed term would be charging (lowercase).

Testing Analyzers with the _analyze API

You can test how different analyzers process text using the _analyze API, without needing to index documents. This is extremely useful for debugging and understanding analysis.

“`json
// Test the standard analyzer
POST /_analyze
{
“analyzer”: “standard”,
“text”: “Elasticsearch Power Bank – Fast Charging!”
}
// Response shows the generated tokens: [elasticsearch, power, bank, fast, charging]

// Test a specific field’s analyzer from an index
POST /products/_analyze
{
“field”: “name”, // Use the analyzer configured for the ‘name’ field in ‘products’
“text”: “Elasticsearch Power Bank”
}

// Test a custom analyzer definition
POST /_analyze
{
“tokenizer”: “whitespace”,
“filter”: [“lowercase”, {“type”: “stop”, “stopwords”: [“a”, “the”]}],
“text”: “The Quick Brown Fox”
}
// Response shows tokens: [quick, brown, fox]
“`

Understanding analysis is key to crafting effective full-text search queries and ensuring your match queries behave as expected.


10. Putting It All Together: A More Complex Example

Let’s combine several concepts into a single, more realistic search request for our products index.

Goal: Find electronic products related to “portable charging”, priced between $20 and $75, added in the last 90 days, are currently in stock, prioritizing those with “USB-C” or high capacity. Return only the name, price, and description, highlight the matches in the description, sort by relevance then date added, and get the first 5 results.

“`json
POST /products/_search
{
“size”: 5,
“from”: 0,
“_source”: [“name”, “price”, “description”], // Source filtering
“query”: {
“bool”: {
“must”: [ // Full-text query (scored)
{
“match”: {
“description”: {
“query”: “portable charging”,
“operator”: “and” // Both terms preferred
}
}
}
],
“filter”: [ // Non-scoring filters for efficiency
{ “term”: { “category”: “Electronics” } },
{ “term”: { “in_stock”: true } },
{
“range”: {
“price”: {
“gte”: 20,
“lte”: 75
}
}
},
{
“range”: {
“date_added”: {
“gte”: “now-90d/d” // Added in the last 90 days (start of day)
}
}
}
],
“should”: [ // Boost score for desirable features
{ “term”: { “features.ports”: “USB-C” } },
{ “range”: { “features.capacity_mah”: { “gte”: 15000 } } }
],
“minimum_should_match”: 0 // Should clauses are optional boosters
}
},
“sort”: [
{ “_score”: “desc” }, // Primarily sort by relevance
{ “date_added”: “desc” } // Then by date added (newest first)
],
“highlight”: {
“pre_tags”: [““], // Use strong tags for highlighting
“post_tags”: [“
“],
“fields”: {
“description”: { // Highlight matches in the description field
“fragment_size”: 150, // Max snippet length
“number_of_fragments”: 1 // Max number of snippets
}
}
}
}

“`

This single request demonstrates:

  • Combining full-text (must) and term-level (filter) criteria using bool.
  • Using should clauses for score boosting.
  • Filtering by category, boolean, price range, and date range.
  • Controlling pagination (size, from).
  • Custom sorting (_score, date_added).
  • Source filtering (_source).
  • Highlighting matches in specific fields with custom tags.

11. Conclusion and Next Steps

Elasticsearch offers an incredibly powerful and flexible platform for searching data. We’ve covered the foundational concepts, from understanding documents, indices, and the inverted index, to interacting with the Search API and mastering the core components of the Query DSL. You should now be comfortable with:

  • The difference between URI and Request Body search.
  • The crucial distinction between Query Context and Filter Context.
  • Using fundamental query types like match, term, range, and bool.
  • Combining queries logically with the bool query (must, filter, should, must_not).
  • Controlling results through sorting, pagination (from/size, search_after), and source filtering.
  • The basics of relevance scoring and highlighting.
  • The importance of text analysis and how it impacts search.

This comprehensive introduction provides a solid base, but the world of Elasticsearch search is vast. Potential next steps in your learning journey could include:

  • Aggregations: Performing complex analytics and summarizations alongside your searches (faceting, metrics, bucketing).
  • Advanced Query Types: Exploring nested queries (for arrays of objects), geo queries (for location-based search), fuzzy queries, more_like_this queries, and function score queries (for custom relevance tuning).
  • Mapping Deep Dive: Understanding dynamic mapping, explicit mapping options, multi-fields, index templates, and dynamic templates for fine-grained control over indexing.
  • Index Management: Learning about index aliases, rollover indices, shrinking/splitting indices, and managing cluster health.
  • Performance Tuning: Optimizing query performance, shard allocation, hardware provisioning, and caching strategies.
  • Client Libraries: Using official Elasticsearch clients for languages like Python, Java, JavaScript, Go, Ruby, .NET, etc., for easier application integration.

By building upon the fundamentals covered here and exploring these advanced topics, you can unlock the full potential of Elasticsearch to build sophisticated, fast, and relevant search experiences for any application. Happy searching!


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top