How to Use MATLAB strsplit: An Introduction


How to Use MATLAB strsplit: A Comprehensive Introduction and Deep Dive

Introduction: The Ubiquitous Need for String Splitting

In the world of data analysis, scientific computing, and software development, dealing with text data is an inescapable reality. Whether you’re parsing log files, reading configuration settings, processing comma-separated value (CSV) data, analyzing natural language, or handling user input, the ability to break down strings into smaller, meaningful pieces is fundamental. Raw text data often arrives in a semi-structured format, where information is packed into longer strings, separated by specific characters or patterns known as delimiters.

MATLAB, a high-level language and interactive environment primarily designed for numerical computation, visualization, and programming, provides powerful tools for string manipulation. Among these tools, the strsplit function stands out as a versatile workhorse for dissecting strings based on specified delimiters. Understanding how to effectively use strsplit unlocks the potential to efficiently process and extract valuable information from textual data within the MATLAB environment.

This article serves as a comprehensive guide to the MATLAB strsplit function. We will explore its purpose, syntax variations, the nuances of different delimiter types, powerful options for controlling its behavior, and practical examples illustrating its application in various scenarios. We will also delve into handling edge cases, consider performance implications, and compare strsplit with other related MATLAB functions, including its modern successor, split. By the end of this article, you will have a thorough understanding of strsplit and be well-equipped to use it confidently in your MATLAB projects.

Note: While strsplit is a widely used and functional tool, it’s important to be aware that MATLAB introduced the split function in R2016b. The split function offers improved performance, more consistent behavior (especially with regular expressions), and is generally recommended for new code. However, understanding strsplit remains crucial for working with legacy code and for appreciating the evolution of string manipulation tools in MATLAB. We will explicitly compare strsplit and split later in this guide.

What is strsplit? The Core Concept

At its heart, strsplit performs a simple yet essential task: it splits a string (or a cell array of strings) into smaller substrings based on occurrences of specified delimiters.

Imagine you have a string like 'apple,banana,orange'. You want to separate the individual fruit names. The comma (,) acts as the delimiter. Using strsplit, you can instruct MATLAB to break the string apart wherever it encounters a comma. The result would be a collection of the individual substrings: 'apple', 'banana', and 'orange'.

Key Characteristics:

  1. Input: strsplit typically operates on a single character vector (the traditional MATLAB string type) or a string scalar (introduced later). It can also operate element-wise on a cell array of character vectors.
  2. Delimiter: You provide the character(s) or pattern(s) that mark the boundaries between the substrings you want to extract.
  3. Output: The function returns a cell array of character vectors. Each cell in the array contains one of the resulting substrings. The delimiters themselves are not included in the output substrings.

This process is fundamental for parsing data where fields or items are separated by consistent markers.

Basic Syntax and Usage

The most basic syntax for strsplit is straightforward:

matlab
C = strsplit(str)
C = strsplit(str, Delimiter)

Let’s break down these forms:

  1. C = strsplit(str):

    • str: The input character vector or string scalar you want to split.
    • Behavior: When no delimiter is specified, strsplit splits the input string str at whitespace characters by default. Furthermore, it collapses consecutive whitespace characters, meaning multiple spaces or tabs between words are treated as a single delimiter, and leading/trailing whitespace in the input str is effectively ignored (it doesn’t produce empty strings at the beginning or end).
    • C: The output cell array containing the resulting substrings.

    Example:
    “`matlab
    mySentence = ‘ This is a sample sentence. ‘;
    words = strsplit(mySentence);

    disp(words);
    % Output:
    % 1×5 cell array
    % {‘This’} {‘is’} {‘a’} {‘sample’} {‘sentence.’}
    “`
    Notice how the leading/trailing spaces and the multiple spaces between words were handled cleanly, resulting only in the actual words.

  2. C = strsplit(str, Delimiter):

    • str: The input character vector or string scalar.
    • Delimiter: This specifies the character(s) or pattern(s) to use for splitting. It can be:
      • A single character vector (e.g., ',', '; ').
      • A cell array of character vectors (e.g., {',', ';', ':'}). This allows splitting by any of the delimiters in the cell array.
    • Behavior: Splits str wherever an occurrence of Delimiter is found. By default, when a specific Delimiter is provided, consecutive delimiters are not collapsed, and delimiters at the beginning or end of the string will produce empty character vectors ('') in the output cell array.
    • C: The output cell array of substrings.

    Example (Single Delimiter):
    “`matlab
    csvLine = ‘value1,value2,value3,value4’;
    fields = strsplit(csvLine, ‘,’);

    disp(fields);
    % Output:
    % 1×4 cell array
    % {‘value1’} {‘value2’} {‘value3’} {‘value4’}
    “`

    Example (Multiple Delimiters in a Cell Array):
    “`matlab
    mixedData = ‘itemA;itemB,itemC:itemD’;
    items = strsplit(mixedData, {‘,’, ‘;’, ‘:’});

    disp(items);
    % Output:
    % 1×4 cell array
    % {‘itemA’} {‘itemB’} {‘itemC’} {‘itemD’}
    “`

    Example (Consecutive Delimiters and Leading/Trailing Delimiters):
    “`matlab
    dataWithGaps = ‘,field1,,field3,’;
    splitResult = strsplit(dataWithGaps, ‘,’);

    disp(splitResult);
    % Output:
    % 1×5 cell array
    % {”} {‘field1’} {”} {‘field3’} {”}
    ``
    Observe the empty character vectors (
    ”`) generated by the leading comma, the two consecutive commas, and the trailing comma. This behavior is often important for preserving positional information in data.

Understanding Delimiters: The Key to Precise Splitting

The choice and specification of the delimiter are critical to achieving the desired split. strsplit offers flexibility here:

1. Single Character Vector Delimiter

This is the most common scenario. You provide a character vector (like ',', '-', '|', or even multi-character sequences like '; ', '===') that acts as the separator.

“`matlab
pathStr = ‘C:\Users\Documents\MATLAB\myfile.m’;
pathComponents = strsplit(pathStr, ‘\’); % Use backslash as delimiter

disp(pathComponents);
% Output:
% 1×5 cell array
% {‘C:’} {‘Users’} {‘Documents’} {‘MATLAB’} {‘myfile.m’}

configLine = ‘parameter_name===value’;
parts = strsplit(configLine, ‘===’);

disp(parts);
% Output:
% 1×2 cell array
% {‘parameter_name’} {‘value’}
“`

2. Cell Array of Character Vector Delimiters

This allows you to split the string using any of the delimiters provided in the cell array. This is useful when data might use inconsistent separators.

“`matlab
logEntry = ‘Error:Timestamp=12345;Source=ModuleA,Details=FileNotFound’;
logParts = strsplit(logEntry, {‘:’, ‘=’, ‘;’, ‘,’});

disp(logParts);
% Output:
% 1×8 cell array
% {‘Error’} {‘Timestamp’} {‘12345’} {‘Source’} {‘ModuleA’} {‘Details’} {‘FileNotFound’}
``
In this example, the string was split wherever
:,=,;, or,` occurred.

3. Whitespace Delimiter (Default Behavior)

As seen earlier, if you call strsplit(str) without specifying a delimiter, it defaults to splitting by whitespace. The definition of whitespace typically includes space (' '), tab ('\t'), newline ('\n'), carriage return ('\r'), vertical tab ('\v'), and form feed ('\f').

Key points about the default whitespace splitting:

  • Collapses: Consecutive whitespace characters are treated as a single delimiter.
  • Trims: Leading and trailing whitespace in the input string do not result in empty strings at the beginning or end of the output cell array.

“`matlab
multiLineStr = sprintf(‘Line 1 \t has words\n Line 2 \r also’);
splitByWhitespace = strsplit(multiLineStr);

disp(splitByWhitespace);
% Output:
% 1×7 cell array
% {‘Line’} {‘1’} {‘has’} {‘words’} {‘Line’} {‘2’} {‘also’}
“`

4. Regular Expression Delimiters (Advanced)

strsplit can also use regular expressions as delimiters, unlocking much more powerful and flexible pattern-based splitting. This requires using the full syntax involving Name-Value pairs, specifically 'DelimiterType', 'RegularExpression'. We will cover this in detail in the “Controlling Splitting Behavior” section.

Important Note on Empty Delimiters: Providing an empty character vector ('') or an empty cell array ({}) as the delimiter results in an error in most MATLAB versions. It’s not a valid way to split between every character, for instance.

Controlling Splitting Behavior: Name-Value Pair Options

Beyond the basic syntax, strsplit allows fine-grained control over its operation using Name-Value pair arguments. The full syntax looks like this:

matlab
C = strsplit(str, Delimiter, Name, Value, ...)

Where Name is the name of an option (as a character vector) and Value is its corresponding setting. Let’s explore the key options:

1. 'DelimiterType'

This option explicitly tells strsplit how to interpret the Delimiter argument.

  • 'DelimiterType', 'char' (Default if Delimiter is a char vector or cell array of char vectors): Interprets the Delimiter literally as characters or sequences of characters.

  • 'DelimiterType', 'string' (Requires Delimiter to be string or cell array of strings): Similar to 'char', but uses MATLAB’s string data type for delimiters. Generally behaves like 'char'.

  • 'DelimiterType', 'RegularExpression' (or 'Regexp'): Interprets the Delimiter as a regular expression pattern. This enables splitting based on complex patterns rather than just fixed strings.

Example using Regular Expression: Split by one or more digits.
“`matlab
dataString = ‘Part123Next45Another6End’;
% The regex ‘\d+’ means “one or more digits”
parts = strsplit(dataString, ‘\d+’, ‘DelimiterType’, ‘RegularExpression’);

disp(parts);
% Output:
% 1×4 cell array
% {‘Part’} {‘Next’} {‘Another’} {‘End’}
``
Without
‘DelimiterType’, ‘RegularExpression’,strsplitwould literally look for the sequence‘\d+’`, which is unlikely to exist in the string.

When to specify DelimiterType:
* You must specify 'DelimiterType', 'RegularExpression' when using a regex pattern as the delimiter.
* For character or string delimiters, explicit specification is usually optional, as strsplit often infers the type correctly. However, being explicit can sometimes improve clarity or resolve ambiguity if the delimiter itself contains characters that have special meaning in regex (like ., *, +, \, etc.) and you don’t want them treated as regex.

2. 'CollapseDelimiters'

This option controls how strsplit handles sequences of two or more consecutive delimiters.

  • 'CollapseDelimiters', true: Treats consecutive delimiters as a single delimiter. This prevents the creation of empty character vectors ('') between them in the output.
  • 'CollapseDelimiters', false (Default when a specific Delimiter is provided): Treats each delimiter occurrence independently. Consecutive delimiters will result in one or more empty character vectors ('') in the output cell array.

Note: When using the default whitespace splitting (strsplit(str)), CollapseDelimiters is effectively true.

Example:
“`matlab
rawData = ‘A,,B, C ,D’; % Note double comma and space around C

% Default behavior (CollapseDelimiters = false)
splitDefault = strsplit(rawData, ‘,’);
disp(‘Default (Collapse = false):’);
disp(splitDefault);
% Output:
% Default (Collapse = false):
% 1×5 cell array
% {‘A’} {”} {‘B’} {‘ C ‘} {‘D’} % Note empty string and preserved spaces

% Collapse delimiters
splitCollapsed = strsplit(rawData, ‘,’, ‘CollapseDelimiters’, true);
disp(‘Collapsed (Collapse = true):’);
disp(splitCollapsed);
% Output:
% Collapsed (Collapse = true):
% 1×4 cell array
% {‘A’} {‘B’} {‘ C ‘} {‘D’} % No empty string, but spaces around C remain
``
In the collapsed version, the empty string resulting from
,,is gone. However, note that‘CollapseDelimiters’only affects *consecutive occurrences of the specified delimiter(s)*. It does *not* automatically trim whitespace from the resulting substrings (like the spaces around‘ C ‘). Trimming whitespace usually requires a separate step (e.g., usingstrtrim`).

3. 'PreserveQuotes'

This option is designed for parsing text where parts of the string might be enclosed in quotation marks (single ' or double "), and you want the delimiters inside the quotes to be ignored.

  • 'PreserveQuotes', true: Delimiters appearing within pairs of single or double quotation marks are not treated as split points. The quotation marks themselves are removed from the resulting substrings.
  • 'PreserveQuotes', false (Default): Delimiters are recognized everywhere, regardless of quotes.

Example: Parsing a CSV line where one field contains a comma.
“`matlab
csvLineWithQuote = ‘Field1,”Field 2, contains comma”,Field3’;

% Default behavior (PreserveQuotes = false)
splitDefault = strsplit(csvLineWithQuote, ‘,’);
disp(‘Default (PreserveQuotes = false):’);
disp(splitDefault);
% Output:
% Default (PreserveQuotes = false):
% 1×4 cell array
% {‘Field1’} {‘”Field 2’} {‘ contains comma”‘} {‘Field3’} % Incorrect split

% Preserve quotes
splitPreserved = strsplit(csvLineWithQuote, ‘,’, ‘PreserveQuotes’, true);
disp(‘Preserved (PreserveQuotes = true):’);
disp(splitPreserved);
% Output:
% Preserved (PreserveQuotes = true):
% 1×3 cell array
% {‘Field1’} {‘Field 2, contains comma’} {‘Field3’} % Correct split, quotes removed
“`
This option is very useful for handling standard CSV formats or similar text protocols where quoting is used to escape delimiters within fields.

Caveats for 'PreserveQuotes':
* It only recognizes matching pairs of unnested quotes (' or ").
* It doesn’t handle escaped quotes within quoted sections (e.g., "He said ""Hello""" might not parse as expected depending on the exact escaping mechanism). More complex parsing might require regular expressions or dedicated CSV parsing functions like readtable.

Understanding the Output: The Cell Array

Regardless of the input or options used, strsplit consistently returns a 1-by-N cell array of character vectors.

  • Cell Array: A MATLAB data structure that can hold different types and sizes of data in its elements (cells). In the case of strsplit, each cell holds a character vector (a substring).
  • 1-by-N: The output is always a row vector of cells, where N is the number of substrings generated by the split.
  • Character Vectors: The content of each cell is a standard MATLAB character vector (e.g., 'hello'). Even if the input was a string scalar, the output elements are character vectors for strsplit. (The newer split function returns string arrays by default).

Accessing the Results:
You access the individual substrings using standard cell array indexing with curly braces {}.

“`matlab
myString = ‘first:second:third’;
resultCell = strsplit(myString, ‘:’);

firstElement = resultCell{1}; % Access the first substring
secondElement = resultCell{2}; % Access the second substring
numElements = numel(resultCell); % Get the number of substrings

fprintf(‘First element: %s\n’, firstElement);
fprintf(‘Second element: %s\n’, secondElement);
fprintf(‘Total elements: %d\n’, numElements);

% Output:
% First element: first
% Second element: second
% Total elements: 3
“`

Empty Character Vectors ('') in Output:
As previously highlighted, empty character vectors can appear in the output under certain conditions (when CollapseDelimiters is false):

  1. Delimiter at the Start: strsplit(',a,b', ',') -> {'', 'a', 'b'}
  2. Delimiter at the End: strsplit('a,b,', ',') -> {'a', 'b', ''}
  3. Consecutive Delimiters: strsplit('a,,b', ',') -> {'a', '', 'b'}

Understanding when and why these empty elements appear is crucial for correctly interpreting the split results, especially when dealing with potentially missing data or preserving positional information.

Comprehensive Examples: strsplit in Action

Let’s solidify our understanding with more practical examples covering various use cases.

Example 1: Parsing Simple Key-Value Pairs

“`matlab
configLine = ‘ FontSize = 12 ; Color = Blue ; Font = Arial ‘;
settings = strsplit(configLine, ‘;’); % Split by semicolon

% Trim whitespace from each resulting part
settings = strtrim(settings);
disp(‘Initial Split:’);
disp(settings);

% Further process each key-value pair
numSettings = numel(settings);
configData = struct(); % Store results in a struct

for i = 1:numSettings
if isempty(settings{i}) % Skip if the split resulted in an empty string
continue;
end
pair = strsplit(settings{i}, ‘=’); % Split key and value by ‘=’
if numel(pair) == 2
key = strtrim(pair{1});
value = strtrim(pair{2});
% Basic type conversion attempt (optional)
numValue = str2double(value);
if ~isnan(numValue)
configData.(key) = numValue; % Store as number if possible
else
configData.(key) = value; % Store as string
end
else
warning(‘Skipping malformed setting: %s’, settings{i});
end
end

disp(‘Parsed Configuration Structure:’);
disp(configData);

% Output:
% Initial Split:
% 1×4 cell array
% {‘FontSize = 12’} {‘Color = Blue’} {‘Font = Arial’} {”} % Note trailing empty cell
%
% Parsed Configuration Structure:
% FontSize: 12
% Color: ‘Blue’
% Font: ‘Arial’
``
This example demonstrates a common pattern: initial splitting by a primary delimiter (
;), followed by trimming whitespace (strtrim), and then further splitting each part by a secondary delimiter (=). Error handling (checkingnumel(pair)`) is also included.

Example 2: Splitting File Paths (Cross-Platform)

File paths can use different separators (\ on Windows, / on Unix/macOS). strsplit can handle this using a cell array delimiter.

“`matlab
pathWin = ‘C:\Folder\Subfolder\file.txt’;
pathUnix = ‘/home/user/data/file.txt’;

componentsWin = strsplit(pathWin, {‘\’, ‘/’}); % Use both separators
componentsUnix = strsplit(pathUnix, {‘\’, ‘/’});

% Note: On Windows, the initial ‘C:’ might be preceded by an empty string
% if the path starts with ‘\’. Similarly for Unix paths starting with ‘/’.
if isempty(componentsWin{1}) && contains(pathWin, ‘:’)
componentsWin{2} = [componentsWin{2}, ‘:’]; % Reconstruct drive letter if needed
componentsWin(1) = []; % Remove leading empty string
end
if isempty(componentsUnix{1}) && startsWith(pathUnix, ‘/’)
componentsUnix(1) = []; % Remove leading empty string from Unix path
end

disp(‘Windows Path Components:’);
disp(componentsWin);

disp(‘Unix Path Components:’);
disp(componentsUnix);

% Output:
% Windows Path Components:
% 1×4 cell array
% {‘C:’} {‘Folder’} {‘Subfolder’} {‘file.txt’}
%
% Unix Path Components:
% 1×4 cell array
% {‘home’} {‘user’} {‘data’} {‘file.txt’}
“`
This shows how to use multiple delimiters and includes a simple post-processing step to handle potential leading empty strings caused by absolute paths.

Example 3: Handling Data with Quoted Fields

“`matlab
csvLine = ‘101,”Smith, John”,Sales,”Level 5, Access All”‘;

% Use PreserveQuotes to handle the commas inside quotes correctly
fields = strsplit(csvLine, ‘,’, ‘PreserveQuotes’, true);

disp(‘Parsed Fields:’);
disp(fields);

% Output:
% Parsed Fields:
% 1×4 cell array
% {‘101’} {‘Smith, John’} {‘Sales’} {‘Level 5, Access All’}
``
This clearly demonstrates the power of
‘PreserveQuotes’` for correctly parsing data where delimiters might appear within quoted text.

Example 4: Splitting based on Multiple Whitespace Types

Using the default behavior for splitting by any whitespace.

“`matlab
textBlock = [‘First item \t Second item\n’ …
‘Third item Fourth item\r\nFifth item’];
items = strsplit(textBlock); % Default whitespace splitting

disp(‘Items split by whitespace:’);
disp(items);

% Output:
% Items split by whitespace:
% 1×5 cell array
% {‘First’} {‘item’} {‘Second’} {‘item’} {‘Third’} {‘item’} {‘Fourth’} {‘item’} {‘Fifth’} {‘item’}
``
This highlights how default
strsplitconveniently handles various whitespace characters (space, tab\t, newline\n, carriage return\r`) and collapses them.

Example 5: Using strsplit on a Cell Array of Strings

strsplit operates element-wise when the first input is a cell array of character vectors.

“`matlab
listOfStrings = {‘a,b,c’, ‘d,e’, ‘f,g,h,i’};
delimiter = ‘,’;

splitCells = strsplit(listOfStrings, delimiter);

disp(‘Result of splitting cell array elements:’);
disp(splitCells);

% Output:
% Result of splitting cell array elements:
% 1×3 cell array
% {1×3 cell} {1×2 cell} {1×4 cell}
%
% Let’s inspect the contents:
disp(‘Contents of splitCells{1}:’);
disp(splitCells{1}); % {‘a’} {‘b’} {‘c’}
%
disp(‘Contents of splitCells{2}:’);
disp(splitCells{2}); % {‘d’} {‘e’}
%
disp(‘Contents of splitCells{3}:’);
disp(splitCells{3}); % {‘f’} {‘g’} {‘h’} {‘i’}
``
The output
splitCellsis a cell array where *each cell* contains the result (another cell array) of applyingstrsplitto the corresponding element of the inputlistOfStrings`.

Advanced Usage with Regular Expressions

Using 'DelimiterType', 'RegularExpression' unlocks sophisticated splitting capabilities. Regular expressions (regex) provide a concise and powerful syntax for describing patterns in text.

Why Use Regex with strsplit?

  • Pattern-Based Splitting: Split by patterns, not just fixed strings (e.g., split by any number, any non-alphanumeric character).
  • Complex Delimiters: Define delimiters that have variations (e.g., split by “Error:”, “Warning:”, or “Info:”).
  • Contextual Splitting: Split based on context (though more complex context usually involves regexp for capturing).

Key Regex Concepts for Delimiters:

  • .: Matches any single character (except newline).
  • *: Matches the previous element zero or more times.
  • +: Matches the previous element one or more times.
  • ?: Matches the previous element zero or one time.
  • \d: Matches any digit (0-9).
  • \s: Matches any whitespace character.
  • \w: Matches any word character (alphanumeric plus _).
  • [...]: Matches any single character within the brackets (e.g., [abc] matches ‘a’, ‘b’, or ‘c’).
  • [^...]: Matches any single character not within the brackets.
  • |: Acts as an OR operator (e.g., cat|dog matches ‘cat’ or ‘dog’).
  • (): Groups parts of the expression.

Example 6: Splitting by Any Non-Alphanumeric Character

“`matlab
messyData = ‘Value1;Value2#Value3@Value4/Value5’;
% Regex ‘[^\w]+’ means “one or more characters that are NOT word characters”
% Word characters (\w) are letters, numbers, and underscore.
cleanParts = strsplit(messyData, ‘[^\w]+’, ‘DelimiterType’, ‘RegularExpression’);

disp(cleanParts);
% Output:
% 1×5 cell array
% {‘Value1’} {‘Value2’} {‘Value3’} {‘Value4’} {‘Value5’}
“`

Example 7: Splitting by Specific Words (Case-Insensitive)

“`matlab
text = ‘Start section ALPHA then continue section BRAVO finally end’;
% Regex ‘ (ALPHA|BRAVO|finally) ‘ splits by the words, surrounded by spaces.
% ‘(?i)’ makes the match case-insensitive (though strsplit might not fully support flags this way, depends on version)
% A safer approach for case-insensitivity with strsplit’s regex might be:
delimiterPattern = ‘\s+(alpha|bravo|finally)\s+’; % Explicitly list cases or use [Aa]lpha etc.
% Using regexp directly might be better for case-insensitivity flags.

% Let’s try a case-sensitive version first:
sections = strsplit(text, ‘\s+(ALPHA|BRAVO|finally)\s+’, ‘DelimiterType’, ‘RegularExpression’);
disp(‘Case-Sensitive Split:’);
disp(sections);
% Output:
% Case-Sensitive Split:
% 1×3 cell array
% {‘Start section’} {‘then continue section’} {‘end’}

% If case-insensitivity is needed, building the pattern works:
delimiterPatternCI = ‘\s+([Aa][Ll][Pp][Hh][Aa]|[Bb][Rr][Aa][Vv][Oo]|[Ff][Ii][Nn][Aa][Ll][Ll][Yy])\s+’;
sectionsCI = strsplit(text, delimiterPatternCI, ‘DelimiterType’, ‘RegularExpression’);
disp(‘Case-Insensitive Split (Manual Pattern):’);
disp(sectionsCI);
% Output:
% Case-Insensitive Split (Manual Pattern):
% 1×3 cell array
% {‘Start section’} {‘then continue section’} {‘end’}

``
*Self-correction:* Directly using
(?i)flags withinstrsplit's delimiter pattern might be unreliable across MATLAB versions. Building the pattern to explicitly include case variations (like[Aa]) or using theregexp` function directly is often more robust for case-insensitive pattern matching. The example above demonstrates the manual pattern approach.

Example 8: Splitting and Keeping Delimiters (Using regexp)

strsplit inherently removes the delimiters. If you need to keep the delimiters as separate elements in your output, strsplit is not the right tool. You should use regexp with capturing parentheses () around the delimiter pattern.

“`matlab
text = ‘item1DELIMitem2DELIMitem3’;
delimiter = ‘DELIM’;

% Using regexp to split AND keep delimiters
splitAndKeep = regexp(text, [‘(‘ delimiter ‘)’], ‘split’, ‘match’);
% The ‘split’ flag tells regexp to split by the pattern.
% The ‘match’ flag would typically return the matches, but in combination with ‘split’,
% it influences how capturing groups are handled.
% Let’s refine: A common pattern is to split by the delimiter, which puts delimiters
% between the parts you want. Or, match parts separated by delimiters.
%
% A better regexp approach to keep delimiters: Match either the delimiter OR
% the content between delimiters.
pattern = [delimiter ‘|(.+?)’]; % Match DELIM or capture one or more characters non-greedily
tokens = regexp(text, pattern, ‘tokens’);
% This gives nested cells, need to flatten
tokens = [tokens{:}]; % Flatten the cell array

disp(‘Using regexp to potentially keep delimiters (approach 1 – tokens):’);
disp(tokens); % Output: {‘item1’} {‘DELIM’} {‘item2’} {‘DELIM’} {‘item3’}

% Another regexp approach: split and interleave (more complex)
% A simpler way for this specific case: Find delimiter indices, extract parts.

% Let’s rethink keeping delimiters with ‘split’. Capturing the delimiter
% might insert it into the cell array in some contexts, but it’s not standard.
% The most reliable way with regexp is often to match the content and delimiters separately
% or use lookarounds if the regex engine supports them well in split mode.

% Let’s try the documented ‘split’ behavior with capturing:
splitWithCapture = regexp(text, [‘(‘ delimiter ‘)’], ‘split’);
% According to some docs, captured delimiters should be inserted.
disp(‘Using regexp(…, ”split”) with capturing parens:’);
disp(splitWithCapture);
% Output (Typical):
% 1×4 cell array
% {‘item1’} {‘DELIM’} {‘item2’} {‘DELIM’} {‘item3’}

% Conclusion: If you need the delimiters, regexp(..., 'split') with capturing
% parentheses around the delimiter pattern is the intended way.
``
This example highlights that while
strsplitis for removing delimiters,regexp` offers more flexibility when delimiter retention or more complex pattern logic (like lookarounds) is needed.

Handling Edge Cases and Potential Pitfalls

When using strsplit, be mindful of these scenarios:

  1. Empty Input String (''):

    • strsplit('') (default whitespace): Returns {' '} in some older versions or string.empty / cell(1,0) ({}) in newer versions. Behavior can vary.
    • strsplit('', ','): Typically returns {'', ''} if the delimiter is non-empty. The logic often implies an empty string exists before and after the “split point” of an empty input.
    • strsplit('', ',', 'CollapseDelimiters', true): Typically returns a 1×0 empty cell array {}.

    matlab
    disp(strsplit('', ',')) % Often {'', ''}
    disp(strsplit('', ',', 'CollapseDelimiters', true)) % Often {}

  2. Delimiter Not Found: If the specified delimiter(s) do not exist in the input string, strsplit returns a 1×1 cell array containing the original, unmodified input string.
    matlab
    result = strsplit('abcde', 'X');
    disp(result); % Output: {'abcde'}

  3. Input is Not a Character Vector/String/Cell Array: Providing numeric input or other data types will result in an error.

  4. Regex Syntax Errors: If using 'DelimiterType', 'RegularExpression', an invalid regex pattern will cause an error. Test your regex patterns carefully.

  5. Performance with Very Large Strings or Complex Regex: For extremely large strings or computationally intensive regular expressions used as delimiters, strsplit (and string manipulation in general) can become a bottleneck. Consider alternatives or optimization strategies if performance is critical (see next sections).

  6. Whitespace Handling: Remember the difference in default whitespace handling (CollapseDelimiters = true, trimming effect) versus specifying a delimiter (CollapseDelimiters = false by default, no trimming). Use strtrim on results if you need to remove leading/trailing whitespace from substrings when a specific delimiter was used.

Performance Considerations

While strsplit is generally efficient for common tasks, performance can vary based on several factors:

  1. Input String Size: Larger strings naturally take longer to process.
  2. Number of Delimiters: More occurrences of the delimiter mean more splitting operations.
  3. Delimiter Complexity:
    • Simple character delimiters are usually fastest.
    • Cell arrays of delimiters add some overhead.
    • Regular expressions are the most powerful but can also be the slowest, especially complex patterns involving backtracking.
  4. CollapseDelimiters Option: Setting this to true might add a small overhead compared to false, as it requires checking for consecutive delimiters.
  5. MATLAB Version and Engine: Performance characteristics can change between MATLAB releases due to internal optimizations.

strsplit vs. split Performance:
The newer split function (introduced R2016b) was designed with performance improvements in mind, particularly for common use cases and when working with string arrays. In many scenarios, split is significantly faster than strsplit, especially when dealing with regular expression delimiters or large datasets.

Recommendation: For performance-critical code written in R2016b or later, prefer the split function over strsplit. If you are bound to older MATLAB versions or maintaining legacy code, strsplit remains the tool to use, but be mindful of potential bottlenecks with very large inputs or complex regex. Profiling your code (profile viewer) is always recommended to identify actual performance issues.

Comparison with Alternatives: split, regexp, textscan

MATLAB offers several functions for breaking down strings. Choosing the right one depends on the specific task:

Function Primary Use Input Type(s) Output Type(s) Delimiter Handling Key Features / Differences Recommended?
strsplit Split char vectors by delimiters (removed) Char vector, Cell array of char vectors Cell array of char vectors Char, Cell array of chars, Regex (via option) Legacy function, Default whitespace collapse, PreserveQuotes, Returns cell array. Legacy/Older code
split Modern split strings/char vectors (removed) String array, Char vector, Cell array String array (default), Cell array String/Char, Cell array, Regex (direct support) Recommended, Faster, More consistent regex, Handles string arrays naturally, Returns string array by default. Yes (R2016b+)
regexp Complex pattern matching, extraction, replace Char vector, String array Cell array, Numeric array, Struct Defined by regex pattern (can be captured/kept) Most powerful pattern matching, Can capture groups, split, match, replace. Steeper learning curve. For complex patterns, capture groups
textscan Read formatted data from string or file ID Char vector, File ID Cell array (columns of data) Whitespace, Specified delimiters, Format specifiers Designed for structured text/CSV, Handles data type conversion, Efficient for files. For reading structured files/text
splitlines Split text into lines based on newline chars String array, Char vector String array Newline characters (\n, \r\n, \r) Specifically for splitting text into lines, simpler than strsplit for this task. For line splitting
readtable Read tabular data from file/text File name, String, Char vector Table Auto-detected or specified (CSV, TSV, etc.) High-level function for tabular data, Handles headers, types, missing data. For tabular data files (CSV etc.)

Summary of Choices:

  • Need to split strings/char vectors by simple or regex delimiters in modern MATLAB (R2016b+)? Use split. It’s generally faster and preferred.
  • Working with legacy code or MATLAB versions before R2016b? Use strsplit.
  • Need to split by complex patterns AND capture parts of the string or the delimiters themselves? Use regexp.
  • Reading structured data from a file (like CSV) or a large text block with known formats and need type conversion? Use textscan or readtable.
  • Just need to split text into separate lines? Use splitlines.
  • Need to parse quoted fields simply? strsplit with 'PreserveQuotes', true or split (which also respects quotes by default when splitting by single char delimiter) are good options. readtable is robust for complex CSVs.

Best Practices and Tips for Using strsplit

  1. Know split: If using R2016b or later, strongly consider using the newer split function for new code due to performance and consistency benefits. Understand strsplit primarily for legacy code or specific features like 'PreserveQuotes' if split doesn’t meet the exact need (though split often handles quotes well too).
  2. Be Explicit with Regex: When using regular expressions, always specify 'DelimiterType', 'RegularExpression' to avoid ambiguity and ensure the pattern is interpreted correctly.
  3. Handle Whitespace: Be aware of the default whitespace splitting behavior (collapsing, trimming effect) versus splitting by an explicit delimiter (no collapsing/trimming by default). Use strtrim on the output cells if needed after splitting with an explicit delimiter.
  4. Understand CollapseDelimiters: Use 'CollapseDelimiters', true when you want to ignore empty fields resulting from consecutive delimiters. Use the default (false) when positional empty fields are meaningful.
  5. Use PreserveQuotes Wisely: Only use 'PreserveQuotes', true when your data genuinely uses quotes to escape delimiters. It’s not a general-purpose solution for all complex parsing. For robust CSV parsing, consider readtable.
  6. Check Output Size: After splitting, check numel() of the resulting cell array, especially if subsequent code assumes a fixed number of elements. This helps catch errors from unexpected input formats.
  7. Consider regexp for Complexity: If your splitting logic becomes very complex (e.g., needing lookarounds, keeping delimiters, intricate conditional splitting), regexp is likely a more appropriate and powerful tool.
  8. Cell Array Output: Remember strsplit returns a cell array. Use curly braces {} for accessing the content of each cell.
  9. Profile Critical Code: If string splitting is part of a performance-sensitive section of your code, use the MATLAB Profiler to measure its impact and compare strsplit with split or other alternatives if necessary.

Conclusion: Mastering String Segmentation with strsplit

The strsplit function is a fundamental tool in the MATLAB arsenal for text manipulation. It provides a straightforward and flexible mechanism for breaking down character vectors into manageable substrings based on specified delimiters. From parsing simple comma-separated data to leveraging the power of regular expressions for complex pattern-based splitting, strsplit (along with its understanding relative to the newer split function) empowers users to effectively process and extract information from diverse textual sources.

We have explored its basic syntax, delved into the nuances of different delimiter types, harnessed the control offered by Name-Value pair options like 'DelimiterType', 'CollapseDelimiters', and 'PreserveQuotes', examined practical application scenarios through comprehensive examples, and discussed important considerations like edge cases, performance, and comparisons with alternative functions like split, regexp, and textscan.

While the modern recommendation often leans towards the split function for new development in recent MATLAB versions, a solid grasp of strsplit remains invaluable. It equips you to work with existing codebases, understand the evolution of MATLAB’s string handling capabilities, and apply the right tool for specific text processing challenges, particularly when features like 'PreserveQuotes' are essential or when working in older environments. By mastering strsplit and its context within MATLAB’s suite of string functions, you significantly enhance your ability to turn raw text data into structured, usable information.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top