How to Use MATLAB strsplit
: A Comprehensive Introduction and Deep Dive
Introduction: The Ubiquitous Need for String Splitting
In the world of data analysis, scientific computing, and software development, dealing with text data is an inescapable reality. Whether you’re parsing log files, reading configuration settings, processing comma-separated value (CSV) data, analyzing natural language, or handling user input, the ability to break down strings into smaller, meaningful pieces is fundamental. Raw text data often arrives in a semi-structured format, where information is packed into longer strings, separated by specific characters or patterns known as delimiters.
MATLAB, a high-level language and interactive environment primarily designed for numerical computation, visualization, and programming, provides powerful tools for string manipulation. Among these tools, the strsplit
function stands out as a versatile workhorse for dissecting strings based on specified delimiters. Understanding how to effectively use strsplit
unlocks the potential to efficiently process and extract valuable information from textual data within the MATLAB environment.
This article serves as a comprehensive guide to the MATLAB strsplit
function. We will explore its purpose, syntax variations, the nuances of different delimiter types, powerful options for controlling its behavior, and practical examples illustrating its application in various scenarios. We will also delve into handling edge cases, consider performance implications, and compare strsplit
with other related MATLAB functions, including its modern successor, split
. By the end of this article, you will have a thorough understanding of strsplit
and be well-equipped to use it confidently in your MATLAB projects.
Note: While strsplit
is a widely used and functional tool, it’s important to be aware that MATLAB introduced the split
function in R2016b. The split
function offers improved performance, more consistent behavior (especially with regular expressions), and is generally recommended for new code. However, understanding strsplit
remains crucial for working with legacy code and for appreciating the evolution of string manipulation tools in MATLAB. We will explicitly compare strsplit
and split
later in this guide.
What is strsplit
? The Core Concept
At its heart, strsplit
performs a simple yet essential task: it splits a string (or a cell array of strings) into smaller substrings based on occurrences of specified delimiters.
Imagine you have a string like 'apple,banana,orange'
. You want to separate the individual fruit names. The comma (,
) acts as the delimiter. Using strsplit
, you can instruct MATLAB to break the string apart wherever it encounters a comma. The result would be a collection of the individual substrings: 'apple'
, 'banana'
, and 'orange'
.
Key Characteristics:
- Input:
strsplit
typically operates on a single character vector (the traditional MATLAB string type) or a string scalar (introduced later). It can also operate element-wise on a cell array of character vectors. - Delimiter: You provide the character(s) or pattern(s) that mark the boundaries between the substrings you want to extract.
- Output: The function returns a cell array of character vectors. Each cell in the array contains one of the resulting substrings. The delimiters themselves are not included in the output substrings.
This process is fundamental for parsing data where fields or items are separated by consistent markers.
Basic Syntax and Usage
The most basic syntax for strsplit
is straightforward:
matlab
C = strsplit(str)
C = strsplit(str, Delimiter)
Let’s break down these forms:
-
C = strsplit(str)
:str
: The input character vector or string scalar you want to split.- Behavior: When no delimiter is specified,
strsplit
splits the input stringstr
at whitespace characters by default. Furthermore, it collapses consecutive whitespace characters, meaning multiple spaces or tabs between words are treated as a single delimiter, and leading/trailing whitespace in the inputstr
is effectively ignored (it doesn’t produce empty strings at the beginning or end). C
: The output cell array containing the resulting substrings.
Example:
“`matlab
mySentence = ‘ This is a sample sentence. ‘;
words = strsplit(mySentence);disp(words);
% Output:
% 1×5 cell array
% {‘This’} {‘is’} {‘a’} {‘sample’} {‘sentence.’}
“`
Notice how the leading/trailing spaces and the multiple spaces between words were handled cleanly, resulting only in the actual words. -
C = strsplit(str, Delimiter)
:str
: The input character vector or string scalar.Delimiter
: This specifies the character(s) or pattern(s) to use for splitting. It can be:- A single character vector (e.g.,
','
,'; '
). - A cell array of character vectors (e.g.,
{',', ';', ':'}
). This allows splitting by any of the delimiters in the cell array.
- A single character vector (e.g.,
- Behavior: Splits
str
wherever an occurrence ofDelimiter
is found. By default, when a specificDelimiter
is provided, consecutive delimiters are not collapsed, and delimiters at the beginning or end of the string will produce empty character vectors (''
) in the output cell array. C
: The output cell array of substrings.
Example (Single Delimiter):
“`matlab
csvLine = ‘value1,value2,value3,value4’;
fields = strsplit(csvLine, ‘,’);disp(fields);
% Output:
% 1×4 cell array
% {‘value1’} {‘value2’} {‘value3’} {‘value4’}
“`Example (Multiple Delimiters in a Cell Array):
“`matlab
mixedData = ‘itemA;itemB,itemC:itemD’;
items = strsplit(mixedData, {‘,’, ‘;’, ‘:’});disp(items);
% Output:
% 1×4 cell array
% {‘itemA’} {‘itemB’} {‘itemC’} {‘itemD’}
“`Example (Consecutive Delimiters and Leading/Trailing Delimiters):
“`matlab
dataWithGaps = ‘,field1,,field3,’;
splitResult = strsplit(dataWithGaps, ‘,’);disp(splitResult);
% Output:
% 1×5 cell array
% {”} {‘field1’} {”} {‘field3’} {”}
``
”`) generated by the leading comma, the two consecutive commas, and the trailing comma. This behavior is often important for preserving positional information in data.
Observe the empty character vectors (
Understanding Delimiters: The Key to Precise Splitting
The choice and specification of the delimiter are critical to achieving the desired split. strsplit
offers flexibility here:
1. Single Character Vector Delimiter
This is the most common scenario. You provide a character vector (like ','
, '-'
, '|'
, or even multi-character sequences like '; '
, '==='
) that acts as the separator.
“`matlab
pathStr = ‘C:\Users\Documents\MATLAB\myfile.m’;
pathComponents = strsplit(pathStr, ‘\’); % Use backslash as delimiter
disp(pathComponents);
% Output:
% 1×5 cell array
% {‘C:’} {‘Users’} {‘Documents’} {‘MATLAB’} {‘myfile.m’}
configLine = ‘parameter_name===value’;
parts = strsplit(configLine, ‘===’);
disp(parts);
% Output:
% 1×2 cell array
% {‘parameter_name’} {‘value’}
“`
2. Cell Array of Character Vector Delimiters
This allows you to split the string using any of the delimiters provided in the cell array. This is useful when data might use inconsistent separators.
“`matlab
logEntry = ‘Error:Timestamp=12345;Source=ModuleA,Details=FileNotFound’;
logParts = strsplit(logEntry, {‘:’, ‘=’, ‘;’, ‘,’});
disp(logParts);
% Output:
% 1×8 cell array
% {‘Error’} {‘Timestamp’} {‘12345’} {‘Source’} {‘ModuleA’} {‘Details’} {‘FileNotFound’}
``
:
In this example, the string was split wherever,
=,
;, or
,` occurred.
3. Whitespace Delimiter (Default Behavior)
As seen earlier, if you call strsplit(str)
without specifying a delimiter, it defaults to splitting by whitespace. The definition of whitespace typically includes space (' '
), tab ('\t'
), newline ('\n'
), carriage return ('\r'
), vertical tab ('\v'
), and form feed ('\f'
).
Key points about the default whitespace splitting:
- Collapses: Consecutive whitespace characters are treated as a single delimiter.
- Trims: Leading and trailing whitespace in the input string do not result in empty strings at the beginning or end of the output cell array.
“`matlab
multiLineStr = sprintf(‘Line 1 \t has words\n Line 2 \r also’);
splitByWhitespace = strsplit(multiLineStr);
disp(splitByWhitespace);
% Output:
% 1×7 cell array
% {‘Line’} {‘1’} {‘has’} {‘words’} {‘Line’} {‘2’} {‘also’}
“`
4. Regular Expression Delimiters (Advanced)
strsplit
can also use regular expressions as delimiters, unlocking much more powerful and flexible pattern-based splitting. This requires using the full syntax involving Name-Value pairs, specifically 'DelimiterType', 'RegularExpression'
. We will cover this in detail in the “Controlling Splitting Behavior” section.
Important Note on Empty Delimiters: Providing an empty character vector (''
) or an empty cell array ({}
) as the delimiter results in an error in most MATLAB versions. It’s not a valid way to split between every character, for instance.
Controlling Splitting Behavior: Name-Value Pair Options
Beyond the basic syntax, strsplit
allows fine-grained control over its operation using Name-Value pair arguments. The full syntax looks like this:
matlab
C = strsplit(str, Delimiter, Name, Value, ...)
Where Name
is the name of an option (as a character vector) and Value
is its corresponding setting. Let’s explore the key options:
1. 'DelimiterType'
This option explicitly tells strsplit
how to interpret the Delimiter
argument.
-
'DelimiterType', 'char'
(Default ifDelimiter
is a char vector or cell array of char vectors): Interprets theDelimiter
literally as characters or sequences of characters. -
'DelimiterType', 'string'
(RequiresDelimiter
to be string or cell array of strings): Similar to'char'
, but uses MATLAB’sstring
data type for delimiters. Generally behaves like'char'
. -
'DelimiterType', 'RegularExpression'
(or'Regexp'
): Interprets theDelimiter
as a regular expression pattern. This enables splitting based on complex patterns rather than just fixed strings.
Example using Regular Expression: Split by one or more digits.
“`matlab
dataString = ‘Part123Next45Another6End’;
% The regex ‘\d+’ means “one or more digits”
parts = strsplit(dataString, ‘\d+’, ‘DelimiterType’, ‘RegularExpression’);
disp(parts);
% Output:
% 1×4 cell array
% {‘Part’} {‘Next’} {‘Another’} {‘End’}
``
‘DelimiterType’, ‘RegularExpression’
Without,
strsplitwould literally look for the sequence
‘\d+’`, which is unlikely to exist in the string.
When to specify DelimiterType
:
* You must specify 'DelimiterType', 'RegularExpression'
when using a regex pattern as the delimiter.
* For character or string delimiters, explicit specification is usually optional, as strsplit
often infers the type correctly. However, being explicit can sometimes improve clarity or resolve ambiguity if the delimiter itself contains characters that have special meaning in regex (like .
, *
, +
, \
, etc.) and you don’t want them treated as regex.
2. 'CollapseDelimiters'
This option controls how strsplit
handles sequences of two or more consecutive delimiters.
'CollapseDelimiters', true
: Treats consecutive delimiters as a single delimiter. This prevents the creation of empty character vectors (''
) between them in the output.'CollapseDelimiters', false
(Default when a specificDelimiter
is provided): Treats each delimiter occurrence independently. Consecutive delimiters will result in one or more empty character vectors (''
) in the output cell array.
Note: When using the default whitespace splitting (strsplit(str)
), CollapseDelimiters
is effectively true
.
Example:
“`matlab
rawData = ‘A,,B, C ,D’; % Note double comma and space around C
% Default behavior (CollapseDelimiters = false)
splitDefault = strsplit(rawData, ‘,’);
disp(‘Default (Collapse = false):’);
disp(splitDefault);
% Output:
% Default (Collapse = false):
% 1×5 cell array
% {‘A’} {”} {‘B’} {‘ C ‘} {‘D’} % Note empty string and preserved spaces
% Collapse delimiters
splitCollapsed = strsplit(rawData, ‘,’, ‘CollapseDelimiters’, true);
disp(‘Collapsed (Collapse = true):’);
disp(splitCollapsed);
% Output:
% Collapsed (Collapse = true):
% 1×4 cell array
% {‘A’} {‘B’} {‘ C ‘} {‘D’} % No empty string, but spaces around C remain
``
,,
In the collapsed version, the empty string resulting fromis gone. However, note that
‘CollapseDelimiters’only affects *consecutive occurrences of the specified delimiter(s)*. It does *not* automatically trim whitespace from the resulting substrings (like the spaces around
‘ C ‘). Trimming whitespace usually requires a separate step (e.g., using
strtrim`).
3. 'PreserveQuotes'
This option is designed for parsing text where parts of the string might be enclosed in quotation marks (single '
or double "
), and you want the delimiters inside the quotes to be ignored.
'PreserveQuotes', true
: Delimiters appearing within pairs of single or double quotation marks are not treated as split points. The quotation marks themselves are removed from the resulting substrings.'PreserveQuotes', false
(Default): Delimiters are recognized everywhere, regardless of quotes.
Example: Parsing a CSV line where one field contains a comma.
“`matlab
csvLineWithQuote = ‘Field1,”Field 2, contains comma”,Field3’;
% Default behavior (PreserveQuotes = false)
splitDefault = strsplit(csvLineWithQuote, ‘,’);
disp(‘Default (PreserveQuotes = false):’);
disp(splitDefault);
% Output:
% Default (PreserveQuotes = false):
% 1×4 cell array
% {‘Field1’} {‘”Field 2’} {‘ contains comma”‘} {‘Field3’} % Incorrect split
% Preserve quotes
splitPreserved = strsplit(csvLineWithQuote, ‘,’, ‘PreserveQuotes’, true);
disp(‘Preserved (PreserveQuotes = true):’);
disp(splitPreserved);
% Output:
% Preserved (PreserveQuotes = true):
% 1×3 cell array
% {‘Field1’} {‘Field 2, contains comma’} {‘Field3’} % Correct split, quotes removed
“`
This option is very useful for handling standard CSV formats or similar text protocols where quoting is used to escape delimiters within fields.
Caveats for 'PreserveQuotes'
:
* It only recognizes matching pairs of unnested quotes ('
or "
).
* It doesn’t handle escaped quotes within quoted sections (e.g., "He said ""Hello"""
might not parse as expected depending on the exact escaping mechanism). More complex parsing might require regular expressions or dedicated CSV parsing functions like readtable
.
Understanding the Output: The Cell Array
Regardless of the input or options used, strsplit
consistently returns a 1-by-N cell array of character vectors.
- Cell Array: A MATLAB data structure that can hold different types and sizes of data in its elements (cells). In the case of
strsplit
, each cell holds a character vector (a substring). 1-by-N
: The output is always a row vector of cells, whereN
is the number of substrings generated by the split.- Character Vectors: The content of each cell is a standard MATLAB character vector (e.g.,
'hello'
). Even if the input was astring
scalar, the output elements are character vectors forstrsplit
. (The newersplit
function returnsstring
arrays by default).
Accessing the Results:
You access the individual substrings using standard cell array indexing with curly braces {}
.
“`matlab
myString = ‘first:second:third’;
resultCell = strsplit(myString, ‘:’);
firstElement = resultCell{1}; % Access the first substring
secondElement = resultCell{2}; % Access the second substring
numElements = numel(resultCell); % Get the number of substrings
fprintf(‘First element: %s\n’, firstElement);
fprintf(‘Second element: %s\n’, secondElement);
fprintf(‘Total elements: %d\n’, numElements);
% Output:
% First element: first
% Second element: second
% Total elements: 3
“`
Empty Character Vectors (''
) in Output:
As previously highlighted, empty character vectors can appear in the output under certain conditions (when CollapseDelimiters
is false
):
- Delimiter at the Start:
strsplit(',a,b', ',')
->{'', 'a', 'b'}
- Delimiter at the End:
strsplit('a,b,', ',')
->{'a', 'b', ''}
- Consecutive Delimiters:
strsplit('a,,b', ',')
->{'a', '', 'b'}
Understanding when and why these empty elements appear is crucial for correctly interpreting the split results, especially when dealing with potentially missing data or preserving positional information.
Comprehensive Examples: strsplit
in Action
Let’s solidify our understanding with more practical examples covering various use cases.
Example 1: Parsing Simple Key-Value Pairs
“`matlab
configLine = ‘ FontSize = 12 ; Color = Blue ; Font = Arial ‘;
settings = strsplit(configLine, ‘;’); % Split by semicolon
% Trim whitespace from each resulting part
settings = strtrim(settings);
disp(‘Initial Split:’);
disp(settings);
% Further process each key-value pair
numSettings = numel(settings);
configData = struct(); % Store results in a struct
for i = 1:numSettings
if isempty(settings{i}) % Skip if the split resulted in an empty string
continue;
end
pair = strsplit(settings{i}, ‘=’); % Split key and value by ‘=’
if numel(pair) == 2
key = strtrim(pair{1});
value = strtrim(pair{2});
% Basic type conversion attempt (optional)
numValue = str2double(value);
if ~isnan(numValue)
configData.(key) = numValue; % Store as number if possible
else
configData.(key) = value; % Store as string
end
else
warning(‘Skipping malformed setting: %s’, settings{i});
end
end
disp(‘Parsed Configuration Structure:’);
disp(configData);
% Output:
% Initial Split:
% 1×4 cell array
% {‘FontSize = 12’} {‘Color = Blue’} {‘Font = Arial’} {”} % Note trailing empty cell
%
% Parsed Configuration Structure:
% FontSize: 12
% Color: ‘Blue’
% Font: ‘Arial’
``
;
This example demonstrates a common pattern: initial splitting by a primary delimiter (), followed by trimming whitespace (
strtrim), and then further splitting each part by a secondary delimiter (
=). Error handling (checking
numel(pair)`) is also included.
Example 2: Splitting File Paths (Cross-Platform)
File paths can use different separators (\
on Windows, /
on Unix/macOS). strsplit
can handle this using a cell array delimiter.
“`matlab
pathWin = ‘C:\Folder\Subfolder\file.txt’;
pathUnix = ‘/home/user/data/file.txt’;
componentsWin = strsplit(pathWin, {‘\’, ‘/’}); % Use both separators
componentsUnix = strsplit(pathUnix, {‘\’, ‘/’});
% Note: On Windows, the initial ‘C:’ might be preceded by an empty string
% if the path starts with ‘\’. Similarly for Unix paths starting with ‘/’.
if isempty(componentsWin{1}) && contains(pathWin, ‘:’)
componentsWin{2} = [componentsWin{2}, ‘:’]; % Reconstruct drive letter if needed
componentsWin(1) = []; % Remove leading empty string
end
if isempty(componentsUnix{1}) && startsWith(pathUnix, ‘/’)
componentsUnix(1) = []; % Remove leading empty string from Unix path
end
disp(‘Windows Path Components:’);
disp(componentsWin);
disp(‘Unix Path Components:’);
disp(componentsUnix);
% Output:
% Windows Path Components:
% 1×4 cell array
% {‘C:’} {‘Folder’} {‘Subfolder’} {‘file.txt’}
%
% Unix Path Components:
% 1×4 cell array
% {‘home’} {‘user’} {‘data’} {‘file.txt’}
“`
This shows how to use multiple delimiters and includes a simple post-processing step to handle potential leading empty strings caused by absolute paths.
Example 3: Handling Data with Quoted Fields
“`matlab
csvLine = ‘101,”Smith, John”,Sales,”Level 5, Access All”‘;
% Use PreserveQuotes to handle the commas inside quotes correctly
fields = strsplit(csvLine, ‘,’, ‘PreserveQuotes’, true);
disp(‘Parsed Fields:’);
disp(fields);
% Output:
% Parsed Fields:
% 1×4 cell array
% {‘101’} {‘Smith, John’} {‘Sales’} {‘Level 5, Access All’}
``
‘PreserveQuotes’` for correctly parsing data where delimiters might appear within quoted text.
This clearly demonstrates the power of
Example 4: Splitting based on Multiple Whitespace Types
Using the default behavior for splitting by any whitespace.
“`matlab
textBlock = [‘First item \t Second item\n’ …
‘Third item Fourth item\r\nFifth item’];
items = strsplit(textBlock); % Default whitespace splitting
disp(‘Items split by whitespace:’);
disp(items);
% Output:
% Items split by whitespace:
% 1×5 cell array
% {‘First’} {‘item’} {‘Second’} {‘item’} {‘Third’} {‘item’} {‘Fourth’} {‘item’} {‘Fifth’} {‘item’}
``
strsplit
This highlights how defaultconveniently handles various whitespace characters (space, tab
\t, newline
\n, carriage return
\r`) and collapses them.
Example 5: Using strsplit
on a Cell Array of Strings
strsplit
operates element-wise when the first input is a cell array of character vectors.
“`matlab
listOfStrings = {‘a,b,c’, ‘d,e’, ‘f,g,h,i’};
delimiter = ‘,’;
splitCells = strsplit(listOfStrings, delimiter);
disp(‘Result of splitting cell array elements:’);
disp(splitCells);
% Output:
% Result of splitting cell array elements:
% 1×3 cell array
% {1×3 cell} {1×2 cell} {1×4 cell}
%
% Let’s inspect the contents:
disp(‘Contents of splitCells{1}:’);
disp(splitCells{1}); % {‘a’} {‘b’} {‘c’}
%
disp(‘Contents of splitCells{2}:’);
disp(splitCells{2}); % {‘d’} {‘e’}
%
disp(‘Contents of splitCells{3}:’);
disp(splitCells{3}); % {‘f’} {‘g’} {‘h’} {‘i’}
``
splitCells
The outputis a cell array where *each cell* contains the result (another cell array) of applying
strsplitto the corresponding element of the input
listOfStrings`.
Advanced Usage with Regular Expressions
Using 'DelimiterType', 'RegularExpression'
unlocks sophisticated splitting capabilities. Regular expressions (regex) provide a concise and powerful syntax for describing patterns in text.
Why Use Regex with strsplit
?
- Pattern-Based Splitting: Split by patterns, not just fixed strings (e.g., split by any number, any non-alphanumeric character).
- Complex Delimiters: Define delimiters that have variations (e.g., split by “Error:”, “Warning:”, or “Info:”).
- Contextual Splitting: Split based on context (though more complex context usually involves
regexp
for capturing).
Key Regex Concepts for Delimiters:
.
: Matches any single character (except newline).*
: Matches the previous element zero or more times.+
: Matches the previous element one or more times.?
: Matches the previous element zero or one time.\d
: Matches any digit (0-9
).\s
: Matches any whitespace character.\w
: Matches any word character (alphanumeric plus_
).[...]
: Matches any single character within the brackets (e.g.,[abc]
matches ‘a’, ‘b’, or ‘c’).[^...]
: Matches any single character not within the brackets.|
: Acts as an OR operator (e.g.,cat|dog
matches ‘cat’ or ‘dog’).()
: Groups parts of the expression.
Example 6: Splitting by Any Non-Alphanumeric Character
“`matlab
messyData = ‘Value1;Value2#Value3@Value4/Value5’;
% Regex ‘[^\w]+’ means “one or more characters that are NOT word characters”
% Word characters (\w) are letters, numbers, and underscore.
cleanParts = strsplit(messyData, ‘[^\w]+’, ‘DelimiterType’, ‘RegularExpression’);
disp(cleanParts);
% Output:
% 1×5 cell array
% {‘Value1’} {‘Value2’} {‘Value3’} {‘Value4’} {‘Value5’}
“`
Example 7: Splitting by Specific Words (Case-Insensitive)
“`matlab
text = ‘Start section ALPHA then continue section BRAVO finally end’;
% Regex ‘ (ALPHA|BRAVO|finally) ‘ splits by the words, surrounded by spaces.
% ‘(?i)’ makes the match case-insensitive (though strsplit might not fully support flags this way, depends on version)
% A safer approach for case-insensitivity with strsplit’s regex might be:
delimiterPattern = ‘\s+(alpha|bravo|finally)\s+’; % Explicitly list cases or use [Aa]lpha etc.
% Using regexp directly might be better for case-insensitivity flags.
% Let’s try a case-sensitive version first:
sections = strsplit(text, ‘\s+(ALPHA|BRAVO|finally)\s+’, ‘DelimiterType’, ‘RegularExpression’);
disp(‘Case-Sensitive Split:’);
disp(sections);
% Output:
% Case-Sensitive Split:
% 1×3 cell array
% {‘Start section’} {‘then continue section’} {‘end’}
% If case-insensitivity is needed, building the pattern works:
delimiterPatternCI = ‘\s+([Aa][Ll][Pp][Hh][Aa]|[Bb][Rr][Aa][Vv][Oo]|[Ff][Ii][Nn][Aa][Ll][Ll][Yy])\s+’;
sectionsCI = strsplit(text, delimiterPatternCI, ‘DelimiterType’, ‘RegularExpression’);
disp(‘Case-Insensitive Split (Manual Pattern):’);
disp(sectionsCI);
% Output:
% Case-Insensitive Split (Manual Pattern):
% 1×3 cell array
% {‘Start section’} {‘then continue section’} {‘end’}
``
(?i)
*Self-correction:* Directly usingflags within
strsplit's delimiter pattern might be unreliable across MATLAB versions. Building the pattern to explicitly include case variations (like
[Aa]) or using the
regexp` function directly is often more robust for case-insensitive pattern matching. The example above demonstrates the manual pattern approach.
Example 8: Splitting and Keeping Delimiters (Using regexp
)
strsplit
inherently removes the delimiters. If you need to keep the delimiters as separate elements in your output, strsplit
is not the right tool. You should use regexp
with capturing parentheses ()
around the delimiter pattern.
“`matlab
text = ‘item1DELIMitem2DELIMitem3’;
delimiter = ‘DELIM’;
% Using regexp to split AND keep delimiters
splitAndKeep = regexp(text, [‘(‘ delimiter ‘)’], ‘split’, ‘match’);
% The ‘split’ flag tells regexp to split by the pattern.
% The ‘match’ flag would typically return the matches, but in combination with ‘split’,
% it influences how capturing groups are handled.
% Let’s refine: A common pattern is to split by the delimiter, which puts delimiters
% between the parts you want. Or, match parts separated by delimiters.
%
% A better regexp approach to keep delimiters: Match either the delimiter OR
% the content between delimiters.
pattern = [delimiter ‘|(.+?)’]; % Match DELIM or capture one or more characters non-greedily
tokens = regexp(text, pattern, ‘tokens’);
% This gives nested cells, need to flatten
tokens = [tokens{:}]; % Flatten the cell array
disp(‘Using regexp to potentially keep delimiters (approach 1 – tokens):’);
disp(tokens); % Output: {‘item1’} {‘DELIM’} {‘item2’} {‘DELIM’} {‘item3’}
% Another regexp approach: split and interleave (more complex)
% A simpler way for this specific case: Find delimiter indices, extract parts.
% Let’s rethink keeping delimiters with ‘split’. Capturing the delimiter
% might insert it into the cell array in some contexts, but it’s not standard.
% The most reliable way with regexp is often to match the content and delimiters separately
% or use lookarounds if the regex engine supports them well in split mode.
% Let’s try the documented ‘split’ behavior with capturing:
splitWithCapture = regexp(text, [‘(‘ delimiter ‘)’], ‘split’);
% According to some docs, captured delimiters should be inserted.
disp(‘Using regexp(…, ”split”) with capturing parens:’);
disp(splitWithCapture);
% Output (Typical):
% 1×4 cell array
% {‘item1’} {‘DELIM’} {‘item2’} {‘DELIM’} {‘item3’}
% Conclusion: If you need the delimiters, regexp(..., 'split')
with capturing
% parentheses around the delimiter pattern is the intended way.
``
strsplit
This example highlights that whileis for removing delimiters,
regexp` offers more flexibility when delimiter retention or more complex pattern logic (like lookarounds) is needed.
Handling Edge Cases and Potential Pitfalls
When using strsplit
, be mindful of these scenarios:
-
Empty Input String (
''
):strsplit('')
(default whitespace): Returns{' '}
in some older versions orstring.empty
/cell(1,0)
({}
) in newer versions. Behavior can vary.strsplit('', ',')
: Typically returns{'', ''}
if the delimiter is non-empty. The logic often implies an empty string exists before and after the “split point” of an empty input.strsplit('', ',', 'CollapseDelimiters', true)
: Typically returns a 1×0 empty cell array{}
.
matlab
disp(strsplit('', ',')) % Often {'', ''}
disp(strsplit('', ',', 'CollapseDelimiters', true)) % Often {} -
Delimiter Not Found: If the specified delimiter(s) do not exist in the input string,
strsplit
returns a 1×1 cell array containing the original, unmodified input string.
matlab
result = strsplit('abcde', 'X');
disp(result); % Output: {'abcde'} -
Input is Not a Character Vector/String/Cell Array: Providing numeric input or other data types will result in an error.
-
Regex Syntax Errors: If using
'DelimiterType', 'RegularExpression'
, an invalid regex pattern will cause an error. Test your regex patterns carefully. -
Performance with Very Large Strings or Complex Regex: For extremely large strings or computationally intensive regular expressions used as delimiters,
strsplit
(and string manipulation in general) can become a bottleneck. Consider alternatives or optimization strategies if performance is critical (see next sections). -
Whitespace Handling: Remember the difference in default whitespace handling (
CollapseDelimiters = true
, trimming effect) versus specifying a delimiter (CollapseDelimiters = false
by default, no trimming). Usestrtrim
on results if you need to remove leading/trailing whitespace from substrings when a specific delimiter was used.
Performance Considerations
While strsplit
is generally efficient for common tasks, performance can vary based on several factors:
- Input String Size: Larger strings naturally take longer to process.
- Number of Delimiters: More occurrences of the delimiter mean more splitting operations.
- Delimiter Complexity:
- Simple character delimiters are usually fastest.
- Cell arrays of delimiters add some overhead.
- Regular expressions are the most powerful but can also be the slowest, especially complex patterns involving backtracking.
CollapseDelimiters
Option: Setting this totrue
might add a small overhead compared tofalse
, as it requires checking for consecutive delimiters.- MATLAB Version and Engine: Performance characteristics can change between MATLAB releases due to internal optimizations.
strsplit
vs. split
Performance:
The newer split
function (introduced R2016b) was designed with performance improvements in mind, particularly for common use cases and when working with string
arrays. In many scenarios, split
is significantly faster than strsplit
, especially when dealing with regular expression delimiters or large datasets.
Recommendation: For performance-critical code written in R2016b or later, prefer the split
function over strsplit
. If you are bound to older MATLAB versions or maintaining legacy code, strsplit
remains the tool to use, but be mindful of potential bottlenecks with very large inputs or complex regex. Profiling your code (profile viewer
) is always recommended to identify actual performance issues.
Comparison with Alternatives: split
, regexp
, textscan
MATLAB offers several functions for breaking down strings. Choosing the right one depends on the specific task:
Function | Primary Use | Input Type(s) | Output Type(s) | Delimiter Handling | Key Features / Differences | Recommended? |
---|---|---|---|---|---|---|
strsplit |
Split char vectors by delimiters (removed) | Char vector, Cell array of char vectors | Cell array of char vectors | Char, Cell array of chars, Regex (via option) | Legacy function, Default whitespace collapse, PreserveQuotes , Returns cell array. |
Legacy/Older code |
split |
Modern split strings/char vectors (removed) | String array, Char vector, Cell array | String array (default), Cell array | String/Char, Cell array, Regex (direct support) | Recommended, Faster, More consistent regex, Handles string arrays naturally, Returns string array by default. | Yes (R2016b+) |
regexp |
Complex pattern matching, extraction, replace | Char vector, String array | Cell array, Numeric array, Struct | Defined by regex pattern (can be captured/kept) | Most powerful pattern matching, Can capture groups, split, match, replace. Steeper learning curve. | For complex patterns, capture groups |
textscan |
Read formatted data from string or file ID | Char vector, File ID | Cell array (columns of data) | Whitespace, Specified delimiters, Format specifiers | Designed for structured text/CSV, Handles data type conversion, Efficient for files. | For reading structured files/text |
splitlines |
Split text into lines based on newline chars | String array, Char vector | String array | Newline characters (\n , \r\n , \r ) |
Specifically for splitting text into lines, simpler than strsplit for this task. |
For line splitting |
readtable |
Read tabular data from file/text | File name, String, Char vector | Table | Auto-detected or specified (CSV, TSV, etc.) | High-level function for tabular data, Handles headers, types, missing data. | For tabular data files (CSV etc.) |
Summary of Choices:
- Need to split strings/char vectors by simple or regex delimiters in modern MATLAB (R2016b+)? Use
split
. It’s generally faster and preferred. - Working with legacy code or MATLAB versions before R2016b? Use
strsplit
. - Need to split by complex patterns AND capture parts of the string or the delimiters themselves? Use
regexp
. - Reading structured data from a file (like CSV) or a large text block with known formats and need type conversion? Use
textscan
orreadtable
. - Just need to split text into separate lines? Use
splitlines
. - Need to parse quoted fields simply?
strsplit
with'PreserveQuotes', true
orsplit
(which also respects quotes by default when splitting by single char delimiter) are good options.readtable
is robust for complex CSVs.
Best Practices and Tips for Using strsplit
- Know
split
: If using R2016b or later, strongly consider using the newersplit
function for new code due to performance and consistency benefits. Understandstrsplit
primarily for legacy code or specific features like'PreserveQuotes'
ifsplit
doesn’t meet the exact need (thoughsplit
often handles quotes well too). - Be Explicit with Regex: When using regular expressions, always specify
'DelimiterType', 'RegularExpression'
to avoid ambiguity and ensure the pattern is interpreted correctly. - Handle Whitespace: Be aware of the default whitespace splitting behavior (collapsing, trimming effect) versus splitting by an explicit delimiter (no collapsing/trimming by default). Use
strtrim
on the output cells if needed after splitting with an explicit delimiter. - Understand
CollapseDelimiters
: Use'CollapseDelimiters', true
when you want to ignore empty fields resulting from consecutive delimiters. Use the default (false
) when positional empty fields are meaningful. - Use
PreserveQuotes
Wisely: Only use'PreserveQuotes', true
when your data genuinely uses quotes to escape delimiters. It’s not a general-purpose solution for all complex parsing. For robust CSV parsing, considerreadtable
. - Check Output Size: After splitting, check
numel()
of the resulting cell array, especially if subsequent code assumes a fixed number of elements. This helps catch errors from unexpected input formats. - Consider
regexp
for Complexity: If your splitting logic becomes very complex (e.g., needing lookarounds, keeping delimiters, intricate conditional splitting),regexp
is likely a more appropriate and powerful tool. - Cell Array Output: Remember
strsplit
returns a cell array. Use curly braces{}
for accessing the content of each cell. - Profile Critical Code: If string splitting is part of a performance-sensitive section of your code, use the MATLAB Profiler to measure its impact and compare
strsplit
withsplit
or other alternatives if necessary.
Conclusion: Mastering String Segmentation with strsplit
The strsplit
function is a fundamental tool in the MATLAB arsenal for text manipulation. It provides a straightforward and flexible mechanism for breaking down character vectors into manageable substrings based on specified delimiters. From parsing simple comma-separated data to leveraging the power of regular expressions for complex pattern-based splitting, strsplit
(along with its understanding relative to the newer split
function) empowers users to effectively process and extract information from diverse textual sources.
We have explored its basic syntax, delved into the nuances of different delimiter types, harnessed the control offered by Name-Value pair options like 'DelimiterType'
, 'CollapseDelimiters'
, and 'PreserveQuotes'
, examined practical application scenarios through comprehensive examples, and discussed important considerations like edge cases, performance, and comparisons with alternative functions like split
, regexp
, and textscan
.
While the modern recommendation often leans towards the split
function for new development in recent MATLAB versions, a solid grasp of strsplit
remains invaluable. It equips you to work with existing codebases, understand the evolution of MATLAB’s string handling capabilities, and apply the right tool for specific text processing challenges, particularly when features like 'PreserveQuotes'
are essential or when working in older environments. By mastering strsplit
and its context within MATLAB’s suite of string functions, you significantly enhance your ability to turn raw text data into structured, usable information.