Introduction to Searching with Elasticsearch: A Comprehensive Guide
In today’s data-driven world, the ability to quickly and accurately find information within vast datasets is not just a convenience—it’s a necessity. Whether you’re building an e-commerce platform needing lightning-fast product searches, a logging system requiring rapid analysis of operational data, or a content management system demanding relevant article retrieval, the underlying search technology is paramount. This is where Elasticsearch shines.
Elasticsearch is a powerful, distributed, open-source search and analytics engine built on Apache Lucene. It’s designed for horizontal scalability, reliability, and easy management. At its core, Elasticsearch allows you to store, search, and analyze large volumes of data in near real-time. While it offers a wide array of features, its heart lies in its sophisticated searching capabilities.
This guide serves as a detailed introduction to searching within Elasticsearch. We’ll explore the fundamental concepts, dive into the essential Query DSL (Domain Specific Language), examine various query types, and learn how to refine and control search results. By the end, you’ll have a solid foundation for leveraging Elasticsearch’s search power in your own applications.
Table of Contents
- Understanding the Basics: Core Elasticsearch Concepts
- Documents and Indices
- Nodes and Clusters
- Shards and Replicas
- The Inverted Index: The Secret Sauce of Fast Search
- Mapping: Defining Your Data Structure
- Interacting with Elasticsearch: The Search API
- RESTful API Principles
- URI Search: Simple but Limited
- Request Body Search: Power and Flexibility
- Basic Search Request Structure
- The Heart of the Search: Introduction to the Query DSL
- What is the Query DSL?
- Query Context vs. Filter Context
- Fundamental Query Types: Finding What You Need
- Full-Text Queries (Query Context): Searching Analyzed Text
match
Query: The Standard Full-Text Searchmatch_phrase
Query: Searching for Exact Phrasesmatch_phrase_prefix
Query: Phrase Matching with Prefix on Last Termmulti_match
Query: Searching Across Multiple Fields
- Term-Level Queries (Filter Context / Query Context): Searching Exact Values
term
Query: Finding Exact Terms (Not Analyzed)terms
Query: Finding Multiple Exact Termsrange
Query: Searching Within a Rangeexists
Query: Finding Documents Containing a Fieldprefix
Query: Matching Document Fields with a Specific Prefixwildcard
Query: Using Wildcards (Use with Caution)regexp
Query: Using Regular Expressions (Use with Caution)ids
Query: Retrieving Documents by ID
- Full-Text Queries (Query Context): Searching Analyzed Text
- Combining Queries: The
bool
Query- Structure of a
bool
Query must
: Clauses MUST Match (AND)should
: Clauses SHOULD Match (OR, influences score)must_not
: Clauses MUST NOT Match (NOT)filter
: Clauses MUST Match (AND, but in Filter Context)- Combining Clauses for Complex Logic
- Structure of a
- Controlling Search Results
- Sorting: Ordering Your Results (
sort
) - Pagination: Retrieving Results in Batches (
from
,size
) - Deep Pagination and
search_after
: Efficient Scrolling - Source Filtering: Selecting Which Fields to Return (
_source
)
- Sorting: Ordering Your Results (
- Understanding Relevance and Scoring
- What is Relevance (
_score
)? - Brief Overview of TF/IDF and BM25
- Factors Influencing Score
- Debugging Scores with the
explain
Parameter
- What is Relevance (
- Highlighting Search Results
- Why Use Highlighting?
- Basic Highlighting Usage (
highlight
) - Configuring Highlighters
- The Role of Text Analysis
- What is Analysis?
- Analyzers, Tokenizers, and Token Filters
- Standard Analyzer vs. Other Built-in Analyzers
- How Analysis Affects
match
vs.term
Queries - Testing Analyzers with the
_analyze
API
- Putting It All Together: A More Complex Example
- Conclusion and Next Steps
1. Understanding the Basics: Core Elasticsearch Concepts
Before diving into search queries, it’s essential to understand how Elasticsearch organizes and stores data.
Documents and Indices
- Document: The basic unit of information that can be indexed in Elasticsearch. It’s represented in JSON (JavaScript Object Notation) format. Think of a document as a row in a relational database table, but more flexible and hierarchical. Example: A document representing a user, a product, or a log entry.
- Index: A collection of documents that have somewhat similar characteristics. An index is the highest-level entity you can query against. It’s analogous to a database in a relational system. For example, you might have an index for
products
,users
, orlogs-2023-10
. Index names must be lowercase.
json
// Example Document for a 'products' index
{
"product_id": "P12345",
"name": "Elasticsearch Power Bank",
"description": "A high-capacity power bank with fast charging for all your devices.",
"price": 49.99,
"category": "Electronics",
"tags": ["power bank", "charger", "portable", "electronics"],
"in_stock": true,
"date_added": "2023-10-26T10:00:00Z",
"features": {
"capacity_mah": 20000,
"ports": ["USB-A", "USB-C"],
"weight_grams": 350
}
}
Nodes and Clusters
- Node: A single running instance of Elasticsearch. It participates in the cluster’s indexing and search capabilities.
- Cluster: A collection of one or more nodes that work together, sharing their data and workload. A cluster provides high availability and scalability. Nodes communicate with each other to maintain a consistent state.
Shards and Replicas
To handle large amounts of data and provide fault tolerance, Elasticsearch divides indices into smaller pieces called shards.
- Shard: Each shard is a fully functional and independent index (a Lucene index). When you index a document, Elasticsearch routes it to a specific primary shard based on a routing algorithm (often based on the document ID). When you search, Elasticsearch queries all relevant shards in parallel and combines the results. This distribution allows for horizontal scaling.
- Primary Shard: The main shard where indexing operations first occur. The number of primary shards for an index is fixed when the index is created.
- Replica Shard: A copy of a primary shard. Replicas serve two main purposes:
- High Availability: If a node holding a primary shard fails, a replica shard on another node can be promoted to become the primary.
- Increased Search Throughput: Search requests can be handled by either primary or replica shards, distributing the search load.
The number of replicas can be changed dynamically.
The Inverted Index: The Secret Sauce of Fast Search
The reason Elasticsearch (and underlying Lucene) can perform fast full-text searches is primarily due to a data structure called the inverted index.
Instead of listing documents and the words they contain (like a traditional database), an inverted index lists unique terms (words) that appear in any document within the index and identifies all the documents each term appears in.
Consider these simple documents:
{ "text": "The quick brown fox" }
{ "text": "The lazy brown dog" }
{ "text": "The quick agile fox" }
A simplified inverted index for the text
field might look like this:
Term | Document IDs | Frequency | Positions (in Doc) |
---|---|---|---|
agile |
[3] | 1 | [2] |
brown |
[1, 2] | 2 | [2], [2] |
dog |
[2] | 1 | [3] |
fox |
[1, 3] | 2 | [3], [3] |
lazy |
[2] | 1 | [1] |
quick |
[1, 3] | 2 | [1], [1] |
the |
[1, 2, 3] | 3 | [0], [0], [0] |
(Note: Stop words like “the” might be removed by analysis, and actual indices store more info like term positions for phrase queries)
When you search for “quick fox”, Elasticsearch:
- Looks up “quick” in the inverted index -> finds Docs [1, 3].
- Looks up “fox” in the inverted index -> finds Docs [1, 3].
- Combines these lists (e.g., finds the intersection for an AND query) -> identifies Docs [1, 3] as potential matches.
- Calculates a relevance score for each matching document based on factors like term frequency (how often the term appears in the document) and inverse document frequency (how rare the term is across all documents).
This structure makes searching incredibly fast, as it avoids scanning every document for the search terms.
Mapping: Defining Your Data Structure
While Elasticsearch can often infer data types (dynamic mapping
), explicitly defining the structure and data types of your fields using a mapping is crucial for effective searching. Mapping tells Elasticsearch how to treat each field in your documents – whether it’s text to be analyzed for full-text search, a keyword for exact matching, a number, a date, a boolean, etc.
- Data Types:
text
,keyword
,integer
,long
,float
,double
,boolean
,date
,object
,nested
,geo_point
, etc. - Analyzers: For
text
fields, you specify which analyzer should be used to process the text during indexing and searching (more on this later).
Defining mappings ensures data consistency and enables precise querying. For example, mapping a field as keyword
ensures it’s treated as a single, exact value, suitable for filtering and aggregations, while mapping it as text
enables full-text search capabilities.
json
// Example Mapping Definition for the 'products' index
PUT /products
{
"mappings": {
"properties": {
"product_id": { "type": "keyword" }, // Exact match ID
"name": {
"type": "text", // Full-text search on name
"fields": {
"keyword": { // Also index as keyword for sorting/aggregation
"type": "keyword",
"ignore_above": 256
}
}
},
"description": { "type": "text" }, // Full-text search on description
"price": { "type": "float" }, // Numeric type for range queries
"category": { "type": "keyword" }, // Exact match category
"tags": { "type": "keyword" }, // Array of exact match tags
"in_stock": { "type": "boolean" }, // Boolean type
"date_added": { "type": "date" }, // Date type for time-based queries
"features": { // Nested object
"properties": {
"capacity_mah": { "type": "integer" },
"ports": { "type": "keyword" },
"weight_grams": { "type": "integer" }
}
}
}
}
}
2. Interacting with Elasticsearch: The Search API
Elasticsearch provides a comprehensive and easy-to-use RESTful API for interacting with your cluster. Searching is primarily done via HTTP GET or POST requests to the _search
endpoint.
RESTful API Principles
Interactions typically involve:
- HTTP Verbs:
GET
(retrieve data),POST
(send data, often used for complex searches),PUT
(create or update data),DELETE
(remove data). - Endpoints: URLs identifying resources (e.g.,
/{index}/_search
,/{index}/_doc/{id}
). - Request Body: Often JSON payloads containing instructions or data (especially for
POST
andPUT
). - Response Body: JSON payloads containing results or status information.
URI Search: Simple but Limited
The simplest way to search is by passing query parameters directly in the URL. This is known as URI Search. It’s useful for quick tests or simple queries.
The primary parameter is q
, which uses the Lucene query string syntax.
“`bash
Search for documents containing “power bank” in any field in the ‘products’ index
GET /products/_search?q=power%20bank
Search for products where the category is “Electronics”
GET /products/_search?q=category:Electronics
More complex query: name contains “elastic” AND price is greater than 30
GET /products/_search?q=name:elastic%20AND%20price:>30
“`
Limitations of URI Search:
- Can become complex and hard to read/debug.
- Limited expressiveness compared to the Request Body Search.
- URL length limits can be an issue for very complex queries.
- Less control over specific query types and options.
While useful for exploration, most applications rely on the Request Body Search.
Request Body Search: Power and Flexibility
The most common and powerful way to search is by sending a JSON payload in the request body of a GET
or POST
request to the _search
endpoint. This allows you to use the full Query DSL.
“`bash
Using GET with a request body
GET /products/_search
{
“query”: {
“match”: {
“description”: “fast charging”
}
}
}
Using POST (often preferred for complex bodies)
POST /products/_search
{
“query”: {
“match”: {
“description”: “fast charging”
}
}
}
“`
Basic Search Request Structure
A typical Request Body search request has the following structure:
json
{
"query": {
// Query definition goes here (using Query DSL)
},
"from": 0, // Starting document offset (for pagination)
"size": 10, // Number of documents to return (for pagination)
"sort": [ // How to sort results
// Sorting criteria go here
],
"_source": [ // Which fields to include/exclude from the original document
// Field patterns go here
],
"highlight": { // How to highlight matching terms in results
// Highlighting configuration goes here
},
"aggs": { // Aggregations (for analytics/summarization)
// Aggregation definitions go here
}
// ... other top-level parameters like timeout, explain, etc.
}
The most critical part is the query
object, which contains the Query DSL specifying the search criteria.
3. The Heart of the Search: Introduction to the Query DSL
The Elasticsearch Query DSL (Domain Specific Language) is a flexible, expressive JSON-based language used to define queries. It provides a rich set of query types that can be combined in intricate ways to find exactly the data you need.
What is the Query DSL?
Instead of writing queries as plain text strings (like SQL or Lucene query syntax), you build queries by composing JSON objects. Each object represents a specific type of query clause (e.g., match
, term
, range
, bool
).
Benefits of Query DSL:
- Expressiveness: Supports a wide range of query types, from simple term matching to complex geospatial queries.
- Readability: Well-structured JSON can be easier to read and maintain than complex string queries.
- Composability: Queries can be nested and combined logically (e.g., using the
bool
query). - Control: Provides fine-grained control over the querying process.
Query Context vs. Filter Context
This is a crucial concept in Elasticsearch querying. When you write a query clause, it executes in one of two contexts:
-
Query Context:
- Asks the question: “How well does this document match the query criteria?”
- Clauses running in query context contribute to the relevance score (
_score
), determining the order of results. Higher scores mean better matches. - Examples:
match
,multi_match
. Generally used for full-text search where relevance matters.
-
Filter Context:
- Asks the question: “Does this document match the query criteria?” (Yes/No).
- Clauses running in filter context do not calculate a score. They simply include or exclude documents.
- Performance: Filter context is generally faster than query context because scoring is skipped.
- Caching: Filter results are often cached by Elasticsearch, leading to significant performance improvements for repeated queries.
- Examples:
term
,terms
,range
,exists
. Typically used for exact value matching, range filtering, or boolean checks.
When to use which?
- Use Query Context when you need full-text search capabilities or when the relevance score is important (e.g., searching for “best power bank” in product descriptions).
- Use Filter Context when you need exact matching or filtering based on structured data (e.g., finding products with
category: "Electronics"
orprice
between 30 and 50, orin_stock: true
). Filters are ideal for faceting/aggregation scenarios.
The bool
query (discussed later) provides a powerful way to combine clauses in both query and filter contexts within a single search request.
4. Fundamental Query Types: Finding What You Need
The Query DSL offers numerous query types. We’ll cover the most common and fundamental ones, categorized broadly into Full-Text and Term-Level queries.
(Assume we are querying the products
index created earlier for the examples below.)
Full-Text Queries (Query Context): Searching Analyzed Text
These queries are typically used for searching natural language text in fields mapped as text
. They work by first analyzing the query string (using the same analyzer applied to the field during indexing) and then finding documents that match the resulting terms. They operate in Query Context and contribute to the _score
.
a) match
Query: The Standard Full-Text Search
The match
query is your go-to for standard full-text search. It analyzes the provided text and looks for the analyzed terms in the specified field. By default, it uses an OR
operator between the terms, but this can be changed.
json
POST /products/_search
{
"query": {
"match": {
"description": "high capacity charging"
}
}
}
- How it works: The text “high capacity charging” is analyzed (e.g., tokenized into
high
,capacity
,charging
). Elasticsearch finds documents whosedescription
field contains any of these terms. Documents containing more terms or more frequent terms will generally score higher. - Operators: You can change the default
OR
behavior toAND
:
json
{
"query": {
"match": {
"description": {
"query": "high capacity charging",
"operator": "and" // All terms must be present
}
}
}
} - Fuzziness: Supports finding terms that are “close” (within a certain edit distance) using the
fuzziness
parameter (e.g.,fuzziness: "AUTO"
orfuzziness: 2
). Useful for handling typos.
b) match_phrase
Query: Searching for Exact Phrases
This query analyzes the text and looks for the analyzed terms appearing in the exact specified order, potentially with a configurable slop
(maximum number of positions terms can be apart).
json
POST /products/_search
{
"query": {
"match_phrase": {
"description": "fast charging" // Looks for "fast" immediately followed by "charging"
}
}
}
- Slop: Allows for some flexibility. A
slop
of 1 means the terms can be one position apart or in a different order.
json
{
"query": {
"match_phrase": {
"description": {
"query": "charging fast", // Terms are reversed
"slop": 1 // Allows them to match "fast charging"
}
}
}
}
c) match_phrase_prefix
Query: Phrase Matching with Prefix on Last Term
Similar to match_phrase
, but performs a prefix match on the last term in the phrase. Useful for auto-complete “search-as-you-type” scenarios.
json
POST /products/_search
{
"query": {
"match_phrase_prefix": {
"name": "Elasticsearch Pow" // Matches "Elasticsearch Power Bank"
}
}
}
// Can also control max_expansions for the prefix part
d) multi_match
Query: Searching Across Multiple Fields
Allows searching for the same text across multiple fields. Useful when you don’t know exactly where the information resides or want to boost relevance based on matches in certain fields.
json
POST /products/_search
{
"query": {
"multi_match": {
"query": "portable charger",
"fields": ["name", "description", "tags"] // Search these fields
// Optional: "boost" certain fields: "fields": ["name^3", "description", "tags"]
}
}
}
- Types:
multi_match
has different execution types (best_fields
,most_fields
,cross_fields
,phrase
,phrase_prefix
) controlling how scores from different fields are combined.best_fields
(default) is often suitable: it finds the best matching field and uses its score.most_fields
is useful when matching across fields containing different analyses of the same text.
Term-Level Queries (Filter Context / Query Context): Searching Exact Values
These queries operate on exact, non-analyzed terms. They are typically used on fields mapped as keyword
, numbers, dates, or booleans. They are often used within a Filter Context for precise filtering (as they don’t usually require scoring), but can also be used in a Query Context if needed.
a) term
Query: Finding Exact Terms (Not Analyzed)
Finds documents containing the exact term specified in the inverted index for a given field. Important: The value provided is NOT analyzed. This is key for keyword
fields or when you need precise matching.
json
POST /products/_search
{
"query": {
// Usually used inside a 'bool' filter for performance
"bool": {
"filter": [
{ "term": { "category": "Electronics" } }, // Match exact category
{ "term": { "in_stock": true } } // Match exact boolean value
]
}
}
}
- Case Sensitivity:
term
queries are case-sensitive forkeyword
fields unless anormalizer
is defined in the mapping. If used on atext
field, it looks for the exact term after analysis (e.g.,term: { "description": "charging" }
would match the analyzed lowercase term). Usingterm
ontext
fields is often less useful thanmatch
.
b) terms
Query: Finding Multiple Exact Terms
Similar to term
, but allows matching any term from a provided list (logical OR).
json
POST /products/_search
{
"query": {
"bool": {
"filter": {
"terms": {
"tags": ["portable", "charger"] // Match if tags include EITHER "portable" OR "charger"
}
}
}
}
}
c) range
Query: Searching Within a Range
Finds documents where a field’s value falls within a specified range. Works on numbers, dates, and even string ranges.
Operators:
* gt
: Greater than
* gte
: Greater than or equal to
* lt
: Less than
* lte
: Less than or equal to
“`json
POST /products/_search
{
“query”: {
“bool”: {
“filter”: {
“range”: {
“price”: {
“gte”: 30, // Price >= 30
“lt”: 50 // Price < 50
}
}
}
}
}
}
// Range query on dates (using date math or absolute dates)
POST /products/_search
{
“query”: {
“bool”: {
“filter”: {
“range”: {
“date_added”: {
“gte”: “now-7d/d”, // From start of day, 7 days ago
“lte”: “now/d” // To start of today
}
}
}
}
}
}
“`
d) exists
Query: Finding Documents Containing a Field
Finds documents that have any non-null value for a specific field. Useful for finding documents where a field is present (or missing, when used inside must_not
).
json
POST /products/_search
{
"query": {
"bool": {
"filter": {
"exists": {
"field": "features.capacity_mah" // Find products where capacity is specified
}
}
}
}
}
e) prefix
Query: Matching Document Fields with a Specific Prefix
Finds documents where a field starts with the specified prefix. Like term
, it operates on non-analyzed terms. Can be slow on text
fields; more efficient on keyword
fields.
json
POST /products/_search
{
"query": {
"prefix": {
"product_id": "P123" // Find products whose ID starts with "P123"
}
}
}
// Use with caution on highly variable text fields due to performance implications.
f) wildcard
Query: Using Wildcards (Use with Caution)
Allows using wildcard characters (*
for multiple characters, ?
for a single character) in the term. Warning: Wildcard queries, especially with leading wildcards (*term
), can be very slow and resource-intensive as they may need to scan many terms in the index. Avoid if possible, or ensure prefixes are long enough.
json
POST /products/_search
{
"query": {
"wildcard": {
"name.keyword": "Elas*arch" // Match name.keyword like "Elas*arch"
}
}
}
// Use on keyword fields where possible. Avoid leading wildcards.
g) regexp
Query: Using Regular Expressions (Use with Caution)
Allows matching based on regular expression patterns. Warning: Like wildcard queries, regexp queries can be very slow and resource-intensive. Use sparingly and carefully.
json
POST /products/_search
{
"query": {
"regexp": {
"product_id": "P[0-9]{4}" // Match product IDs like P followed by 4 digits
}
}
}
h) ids
Query: Retrieving Documents by ID
A specialized query to retrieve documents based on their specific _id
values. Very fast.
json
POST /products/_search
{
"query": {
"ids": {
"values": ["P12345", "P67890"] // Retrieve these specific documents
}
}
}
5. Combining Queries: The bool
Query
Real-world search requirements often involve combining multiple criteria. The bool
(boolean) query is the primary tool for this. It allows you to combine other query clauses using boolean logic (AND
, OR
, NOT
).
Structure of a bool
Query
A bool
query has four main clause types:
json
{
"query": {
"bool": {
"must": [
// Clauses here MUST match (like AND). Contribute to score.
],
"filter": [
// Clauses here MUST match (like AND). Execute in Filter Context (no score, cached).
],
"should": [
// Clauses here SHOULD match (like OR). Contribute to score. If used alone, at least one MUST match.
// If used with must/filter, they act as score boosters.
],
"must_not": [
// Clauses here MUST NOT match (like NOT). Execute in Filter Context.
]
}
}
}
must
: Clauses MUST Match (AND)
All query clauses within the must
array must match for a document to be considered a hit. These clauses run in Query Context and contribute to the _score
.
json
{
"query": {
"bool": {
"must": [
{ "match": { "description": "power bank" } },
{ "match": { "description": "fast charging" } }
// Both 'match' queries must be satisfied
]
}
}
}
should
: Clauses SHOULD Match (OR, influences score)
Clauses within the should
array are optional matches. If a bool
query contains only should
clauses (no must
or filter
), then at least one should
clause must match. If must
or filter
clauses are present, then should
clauses act primarily as score boosters – documents matching should
clauses get a higher _score
.
“`json
// Scenario 1: Only ‘should’ – acts like OR
{
“query”: {
“bool”: {
“should”: [
{ “term”: { “category”: “Electronics” } },
{ “term”: { “tags”: “portable” } }
// Match if category is Electronics OR tag is portable
],
“minimum_should_match”: 1 // Default is 1 when only ‘should’ is present
}
}
}
// Scenario 2: ‘must’ and ‘should’ – ‘should’ boosts score
{
“query”: {
“bool”: {
“must”: [
{ “match”: { “name”: “power bank” } } // Must be a power bank
],
“should”: [
{ “term”: { “features.ports”: “USB-C” } }, // Higher score if it has USB-C
{ “range”: { “features.capacity_mah”: { “gte”: 20000 } } } // Higher score if capacity >= 20000
]
}
}
}
``
minimum_should_match
Theparameter controls how many
should` clauses must match.
must_not
: Clauses MUST NOT Match (NOT)
All clauses within the must_not
array must not match for a document to be considered a hit. These clauses run in Filter Context (they don’t affect the score, just exclusion).
json
{
"query": {
"bool": {
"must": [
{ "match": { "description": "charger" } }
],
"must_not": [
{ "term": { "category": "Clearance" } } // Must be a charger, but not in Clearance category
]
}
}
}
filter
: Clauses MUST Match (AND, but in Filter Context)
Clauses within the filter
array must match, just like must
. However, they execute in Filter Context. This means they don’t contribute to the score and are candidates for caching. This is the preferred place for term-level queries (term
, range
, exists
, etc.) used for filtering.
json
{
"query": {
"bool": {
"must": [
// Use 'must' for full-text search where score matters
{ "match": { "description": "high capacity" } }
],
"filter": [
// Use 'filter' for exact matching / range filtering
{ "term": { "in_stock": true } },
{ "range": { "price": { "lte": 50 } } }
]
}
}
}
Performance Tip: Whenever possible, use filter
instead of must
for non-scoring, exact-match criteria.
Combining Clauses for Complex Logic
The real power comes from nesting bool
queries and combining different clause types to represent complex business logic.
“`json
// Find products that:
// – MUST mention “power bank” in name or description (full-text, scored)
// – AND MUST be in stock (filter)
// – AND MUST have a price <= 100 (filter)
// – AND MUST NOT be in the ‘Obsolete’ category (filter)
// – AND SHOULD ideally have “USB-C” port OR capacity >= 20000 (boost score)
POST /products/_search
{
“query”: {
“bool”: {
“must”: [ // Scored query
{
“multi_match”: {
“query”: “power bank”,
“fields”: [“name”, “description”]
}
}
],
“filter”: [ // Non-scoring filters
{ “term”: { “in_stock”: true } },
{ “range”: { “price”: { “lte”: 100 } } }
],
“must_not”: [ // Exclusion filter
{ “term”: { “category”: “Obsolete” } }
],
“should”: [ // Score boosters
{ “term”: { “features.ports”: “USB-C” } },
{ “range”: { “features.capacity_mah”: { “gte”: 20000 } } }
],
“minimum_should_match”: 0 // Should clauses are purely optional boosters here
}
}
}
“`
6. Controlling Search Results
Finding matching documents is only part of the story. You also need to control how the results are presented.
Sorting: Ordering Your Results (sort
)
By default, Elasticsearch sorts results by relevance score (_score
) in descending order (most relevant first). You can change this using the sort
parameter, specifying one or more fields to sort by.
json
POST /products/_search
{
"query": {
"match": { "category": "Electronics" }
},
"sort": [
{ "price": "asc" }, // Sort primarily by price ascending
{ "_score": "desc" }, // Then by relevance score descending (for ties in price)
{ "name.keyword": "asc" } // Then by name ascending (using the keyword version)
]
}
- Use the
.keyword
sub-field for sorting on analyzedtext
fields if you want to sort alphabetically on the original, non-analyzed value. - Sorting by fields other than
_score
can impact performance as it requires accessing field data. - When sorting by fields other than
_score
, the_score
is not calculated by default (as it’s not needed). You can force score calculation by settingtrack_scores: true
.
Pagination: Retrieving Results in Batches (from
, size
)
Elasticsearch returns the top 10 hits by default (size: 10
). You can retrieve results in pages using:
size
: The number of hits to return per page.from
: The starting offset (document index) from which to return hits.from
is 0-based.
“`json
// Get the first page (hits 0-9)
POST /products/_search
{
“query”: { “match_all”: {} }, // Match all documents
“size”: 10,
“from”: 0
}
// Get the second page (hits 10-19)
POST /products/_search
{
“query”: { “match_all”: {} },
“size”: 10,
“from”: 10
}
// Get the third page (hits 20-29)
POST /products/_search
{
“query”: { “match_all”: {} },
“size”: 10,
“from”: 20
}
“`
Warning: Deep Pagination
Using large from
values (e.g., retrieving page 1000 with size: 10
, meaning from: 9990
) can be very inefficient. Elasticsearch needs to fetch from + size
results from each relevant shard, sort them all on the coordinating node, and then discard the first from
results. This consumes significant memory and CPU, especially in a distributed environment. Avoid deep pagination with from
/size
if possible (typically beyond the first few pages). The default limit for from + size
is 10,000 (index.max_result_window
), though this can be changed.
Deep Pagination and search_after
: Efficient Scrolling
For deep scrolling scenarios (like infinite scroll or processing all results of a query), search_after
is the recommended approach. It avoids the overhead of from
/size
by using the sort values of the last hit from the previous page to fetch the next page.
How it works:
- Perform an initial search with a
sort
clause (must include a unique tie-breaker like_id
or another unique field). - The response will contain the hits for the first page, and each hit will have a
sort
array containing the values it was sorted by. - To get the next page, repeat the same query and same sort, but add the
search_after
parameter, passing thesort
array values from the last hit of the previous page.
“`json
// 1. Initial Search (Get Page 1)
POST /products/_search
{
“size”: 10,
“query”: { “match”: { “category”: “Electronics” } },
“sort”: [
{“price”: “asc”},
{“product_id”: “asc”} // Use a unique field like product_id as a tie-breaker
]
}
// Assume the last hit on page 1 had price: 49.99 and product_id: “P12345”
// Its ‘sort’ array in the response would be: [49.99, “P12345”]
// 2. Get Page 2 using search_after
POST /products/_search
{
“size”: 10,
“query”: { “match”: { “category”: “Electronics” } },
“sort”: [
{“price”: “asc”},
{“product_id”: “asc”}
],
“search_after”: [49.99, “P12345”] // Pass the sort values from the last hit of page 1
}
// Repeat for subsequent pages, using the sort values from the last hit of the current page
“`
search_after
is stateless and much more efficient for deep scrolling than from
/size
.
Source Filtering: Selecting Which Fields to Return (_source
)
By default, Elasticsearch returns the entire JSON _source
document for each hit. To reduce network traffic and client-side processing, you can specify which fields to include or exclude using the _source
parameter.
“`json
POST /products/_search
{
“query”: { “match_all”: {} },
“_source”: false // Don’t return the _source at all
}
POST /products/_search
{
“query”: { “match_all”: {} },
“_source”: [“product_id”, “name”, “price”] // Only return these fields
}
POST /products/search
{
“query”: { “match_all”: {} },
“_source”: { // Use include/exclude patterns
“includes”: [“product“, “name”],
“excludes”: [“.description”]
}
}
“`
Source filtering is highly recommended, especially when dealing with large documents, to retrieve only the necessary data.
7. Understanding Relevance and Scoring
When performing full-text searches (like match
), Elasticsearch assigns a relevance score (_score
) to each matching document. This score determines the default sort order. Understanding how this score is calculated helps you interpret results and fine-tune queries.
What is Relevance (_score
)?
The _score
is a floating-point number representing how relevant a document is to a given query. A higher score indicates higher relevance. The exact value is relative to the query and the documents in the index; it’s not an absolute measure.
Brief Overview of TF/IDF and BM25
Elasticsearch uses sophisticated algorithms to calculate relevance. Historically, TF/IDF (Term Frequency/Inverse Document Frequency) was common. Modern versions of Elasticsearch (since 5.0) default to Okapi BM25, which is generally considered superior.
Both algorithms are based on similar core concepts:
- Term Frequency (TF): How often does the search term appear in the field within this specific document? More frequent occurrences generally mean higher relevance.
- Inverse Document Frequency (IDF): How rare is the search term across all documents in the index? Common terms (like “the”, “a”) have low IDF scores, while rare terms have high IDF scores. Matches on rarer terms contribute more significantly to relevance.
- Field Length Norm: How long is the field? Matches in shorter fields are typically considered more relevant than matches in very long fields (the term constitutes a larger portion of the field). BM25 handles this more effectively than classic TF/IDF, preventing scores from being overly penalized for longer documents (saturation).
BM25 also includes parameters (k1 and b) that can be tuned to adjust how TF and field length influence the score, although the defaults are usually effective.
Factors Influencing Score
- Matching terms from the query (
match
,multi_match
, etc.). - Frequency of terms within the document (TF).
- Rarity of terms across the index (IDF).
- Length of the field being searched.
- Query type (e.g.,
match_phrase
scores exact phrases higher). - Boosting applied at query time (e.g., boosting fields in
multi_match
, boosting clauses inbool/should
). - Boosting applied at index time (less common now).
Debugging Scores with the explain
Parameter
If you need to understand why a specific document received a certain score, you can add "explain": true
to your search request. The response will include a detailed breakdown of the score calculation for each hit, showing the contribution of TF, IDF, field norms, and boosts. This can be verbose but is invaluable for debugging relevance issues.
json
POST /products/_search
{
"explain": true, // Add this parameter
"query": {
"match": {
"description": "fast charging"
}
}
}
8. Highlighting Search Results
When displaying search results to users, it’s often helpful to show why a document matched the query by highlighting the search terms within the relevant fields. Elasticsearch provides built-in highlighting capabilities.
Why Use Highlighting?
Highlighting provides context to the user, showing snippets of the document content where the query terms appear. This allows users to quickly assess the relevance of a result without reading the entire document.
Basic Highlighting Usage (highlight
)
You enable highlighting by adding a highlight
object to your search request, specifying which fields to highlight.
json
POST /products/_search
{
"query": {
"match": {
"description": "high-capacity power bank"
}
},
"highlight": {
"fields": {
"description": {} // Request highlighting for the 'description' field
// You can add more fields here: "name": {}
}
}
}
The search response will include a highlight
section for each hit, containing HTML fragments (by default) with the matched terms wrapped in <em>
tags.
json
// Example snippet from response 'hits' array
{
"_index": "products",
"_id": "P12345",
"_score": 1.234,
"_source": { ... original document ... },
"highlight": {
"description": [
"A <em>high</em>-<em>capacity</em> <em>power</em> <em>bank</em> with fast charging for all your devices."
]
}
}
Configuring Highlighters
Elasticsearch offers several highlighter types (unified
, plain
, fvh
– Fast Vector Highlighter) and numerous configuration options:
- Tags: Change the HTML tags used for highlighting (
pre_tags
,post_tags
).
json
"highlight": {
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"],
"fields": { "description": {} }
} - Fragment Size and Number: Control the length (
fragment_size
) and maximum number (number_of_fragments
) of snippets returned per field. - Require Field Match: (
require_field_match: false
) Allows highlighting on fields even if the query didn’t explicitly match that specific field (useful withmulti_match
on fields with different analyses). - Order: Control the order of fragments (e.g.,
order: "score"
to show the most relevant snippets first).
The fvh
highlighter often provides better performance for large documents if term vectors are enabled in the mapping, while unified
is a good general-purpose choice.
9. The Role of Text Analysis
Understanding text analysis is fundamental to using full-text search effectively in Elasticsearch. Analysis is the process Elasticsearch uses to convert raw text into a list of tokens (terms) suitable for inclusion in the inverted index.
What is Analysis?
When you index a document with a text
field, or when you use a full-text query like match
, the text undergoes an analysis process. This process breaks down the text, normalizes it, and prepares it for searching.
Analyzers, Tokenizers, and Token Filters
An analyzer orchestrates the analysis process. It typically consists of:
- Character Filters (Optional): Pre-process the raw text before tokenization. Examples: Stripping HTML tags (
html_strip
), replacing characters (mapping
), pattern-based replacement (pattern_replace
). - Tokenizer: Splits the text stream into individual tokens (usually words). Examples:
standard
: Grammar-based, good for most Western languages.whitespace
: Splits only on whitespace.keyword
: Treats the entire input string as a single token (no splitting).pattern
: Splits based on a regular expression pattern.- Language-specific tokenizers.
- Token Filters (Optional): Process the tokens generated by the tokenizer. Examples:
lowercase
: Converts tokens to lowercase.stop
: Removes common stop words (“the”, “a”, “is”).stemmer
: Reduces words to their root form (e.g., “charging” -> “charg”). (porter_stem
,snowball
).synonym
: Adds synonyms for tokens.ngram
/edge_ngram
: Creates n-grams for partial word matching / auto-complete.
Standard Analyzer vs. Other Built-in Analyzers
Elasticsearch comes with several pre-configured analyzers:
standard
Analyzer: (Default fortext
fields) Includes grammar-based standard tokenizer, lowercase token filter, and stop token filter (disabled by default in recent versions, but common). Good general-purpose choice.simple
Analyzer: Tokenizes on non-letter characters and lowercases.whitespace
Analyzer: Tokenizes on whitespace and does not lowercase.keyword
Analyzer: No-op analyzer. Outputs the entire input string as a single token. (Used internally bykeyword
fields).english
,french
, etc.: Language-specific analyzers with appropriate stop words and stemming rules.
You can also define custom analyzers by combining different character filters, tokenizers, and token filters to precisely control how your text is processed.
How Analysis Affects match
vs. term
Queries
match
Query: The query string provided to amatch
query is analyzed using the same analyzer configured for the target field. This ensures that the search terms are processed in the same way as the indexed text (e.g., lowercased, stemmed). This allows “Fast Charging” to match “fast charging” or even “faster charger” (if stemmed).term
Query: The value provided to aterm
query is NOT analyzed. It searches for the exact value as-is in the inverted index. This is whyterm
queries are typically used onkeyword
fields (which are indexed as-is) or when you need to find the exact, post-analysis token from atext
field (less common use case). A queryterm: { "description": "Charging" }
would likely fail to match text analyzed by thestandard
analyzer because the indexed term would becharging
(lowercase).
Testing Analyzers with the _analyze
API
You can test how different analyzers process text using the _analyze
API, without needing to index documents. This is extremely useful for debugging and understanding analysis.
“`json
// Test the standard analyzer
POST /_analyze
{
“analyzer”: “standard”,
“text”: “Elasticsearch Power Bank – Fast Charging!”
}
// Response shows the generated tokens: [elasticsearch, power, bank, fast, charging]
// Test a specific field’s analyzer from an index
POST /products/_analyze
{
“field”: “name”, // Use the analyzer configured for the ‘name’ field in ‘products’
“text”: “Elasticsearch Power Bank”
}
// Test a custom analyzer definition
POST /_analyze
{
“tokenizer”: “whitespace”,
“filter”: [“lowercase”, {“type”: “stop”, “stopwords”: [“a”, “the”]}],
“text”: “The Quick Brown Fox”
}
// Response shows tokens: [quick, brown, fox]
“`
Understanding analysis is key to crafting effective full-text search queries and ensuring your match
queries behave as expected.
10. Putting It All Together: A More Complex Example
Let’s combine several concepts into a single, more realistic search request for our products
index.
Goal: Find electronic products related to “portable charging”, priced between $20 and $75, added in the last 90 days, are currently in stock, prioritizing those with “USB-C” or high capacity. Return only the name, price, and description, highlight the matches in the description, sort by relevance then date added, and get the first 5 results.
“`json
POST /products/_search
{
“size”: 5,
“from”: 0,
“_source”: [“name”, “price”, “description”], // Source filtering
“query”: {
“bool”: {
“must”: [ // Full-text query (scored)
{
“match”: {
“description”: {
“query”: “portable charging”,
“operator”: “and” // Both terms preferred
}
}
}
],
“filter”: [ // Non-scoring filters for efficiency
{ “term”: { “category”: “Electronics” } },
{ “term”: { “in_stock”: true } },
{
“range”: {
“price”: {
“gte”: 20,
“lte”: 75
}
}
},
{
“range”: {
“date_added”: {
“gte”: “now-90d/d” // Added in the last 90 days (start of day)
}
}
}
],
“should”: [ // Boost score for desirable features
{ “term”: { “features.ports”: “USB-C” } },
{ “range”: { “features.capacity_mah”: { “gte”: 15000 } } }
],
“minimum_should_match”: 0 // Should clauses are optional boosters
}
},
“sort”: [
{ “_score”: “desc” }, // Primarily sort by relevance
{ “date_added”: “desc” } // Then by date added (newest first)
],
“highlight”: {
“pre_tags”: [““], // Use strong tags for highlighting
“post_tags”: [““],
“fields”: {
“description”: { // Highlight matches in the description field
“fragment_size”: 150, // Max snippet length
“number_of_fragments”: 1 // Max number of snippets
}
}
}
}
“`
This single request demonstrates:
- Combining full-text (
must
) and term-level (filter
) criteria usingbool
. - Using
should
clauses for score boosting. - Filtering by category, boolean, price range, and date range.
- Controlling pagination (
size
,from
). - Custom sorting (
_score
,date_added
). - Source filtering (
_source
). - Highlighting matches in specific fields with custom tags.
11. Conclusion and Next Steps
Elasticsearch offers an incredibly powerful and flexible platform for searching data. We’ve covered the foundational concepts, from understanding documents, indices, and the inverted index, to interacting with the Search API and mastering the core components of the Query DSL. You should now be comfortable with:
- The difference between URI and Request Body search.
- The crucial distinction between Query Context and Filter Context.
- Using fundamental query types like
match
,term
,range
, andbool
. - Combining queries logically with the
bool
query (must
,filter
,should
,must_not
). - Controlling results through sorting, pagination (
from
/size
,search_after
), and source filtering. - The basics of relevance scoring and highlighting.
- The importance of text analysis and how it impacts search.
This comprehensive introduction provides a solid base, but the world of Elasticsearch search is vast. Potential next steps in your learning journey could include:
- Aggregations: Performing complex analytics and summarizations alongside your searches (faceting, metrics, bucketing).
- Advanced Query Types: Exploring
nested
queries (for arrays of objects),geo
queries (for location-based search),fuzzy
queries,more_like_this
queries, and function score queries (for custom relevance tuning). - Mapping Deep Dive: Understanding dynamic mapping, explicit mapping options, multi-fields, index templates, and dynamic templates for fine-grained control over indexing.
- Index Management: Learning about index aliases, rollover indices, shrinking/splitting indices, and managing cluster health.
- Performance Tuning: Optimizing query performance, shard allocation, hardware provisioning, and caching strategies.
- Client Libraries: Using official Elasticsearch clients for languages like Python, Java, JavaScript, Go, Ruby, .NET, etc., for easier application integration.
By building upon the fundamentals covered here and exploring these advanced topics, you can unlock the full potential of Elasticsearch to build sophisticated, fast, and relevant search experiences for any application. Happy searching!