Tokenizing Strings in MATLAB with strtok

Okay, here’s a lengthy article (approximately 5000 words) detailing the strtok function in MATLAB, covering its usage, nuances, alternatives, and applications with numerous examples:

Tokenizing Strings in MATLAB with strtok: A Comprehensive Guide

Introduction

String manipulation is a fundamental aspect of programming, and a common task is breaking down a larger string into smaller, meaningful units called “tokens.” This process, known as tokenization, is crucial for tasks such as parsing data from files, processing user input, analyzing text, and implementing command-line interfaces. MATLAB provides several tools for string manipulation, and one of the most frequently used functions for tokenization is strtok.

This article provides a comprehensive guide to using strtok in MATLAB. We’ll cover its syntax, behavior, common use cases, limitations, and alternative approaches. We’ll also delve into advanced techniques and demonstrate practical examples to solidify your understanding.

1. Basic Syntax and Functionality of strtok

The strtok function in MATLAB has two primary forms:

[token, remainder] = strtok(str, delimiter): This is the most common form. It takes two input arguments:
- str: The input string that you want to tokenize.
- delimiter: (Optional) A string or character vector specifying the characters that separate the tokens. If omitted, strtok uses whitespace characters (space, tab, newline, carriage return, vertical tab, form feed) as the default delimiters.
It returns two output arguments:
* token: The first token found in the string, up to (but not including) the first occurrence of any delimiter character. Leading delimiters are ignored.
* remainder: The rest of the string, including the delimiter that was found. If no delimiter is found, token contains the entire input string, and remainder is an empty string.
token = strtok(str, delimiter): This form returns only the first token. The remainder of the string is discarded. This is less commonly used, as the ability to get the remainder is vital for iterative tokenization (discussed later).

1.1. Simple Examples

Let’s start with some basic examples to illustrate how strtok works:

“`matlab
% Example 1: Using the default delimiter (whitespace)
str = ‘This is a test string.’;
[token, remainder] = strtok(str);
disp([‘Token: ‘, token]); % Output: Token: This
disp([‘Remainder: ‘, remainder]); % Output: Remainder: is a test string.

% Example 2: Using a specific delimiter
str = ‘apple,banana,orange’;
[token, remainder] = strtok(str, ‘,’);
disp([‘Token: ‘, token]); % Output: Token: apple
disp([‘Remainder: ‘, remainder]); % Output: Remainder: ,banana,orange

% Example 3: No delimiter found
str = ‘oneword’;
[token, remainder] = strtok(str, ‘,’); % Delimiter ‘,’ not found
disp([‘Token: ‘, token]); % Output: Token: oneword
disp([‘Remainder: ‘, remainder]); % Output: Remainder:

% Example 4: Leading and trailing delimiters
str = ‘ ,apple,banana, ‘;
[token, remainder] = strtok(str, ‘,’); %Leading spaces AND the first comma are ignored
disp([‘Token: ‘, token]); % Output: Token: apple
disp([‘Remainder: ‘, remainder]); % Output: Remainder: ,banana,

% Example 5: Only retrieving the token
str = ‘123 456 789’;
token = strtok(str);
disp([‘Token: ‘, token]); % Output: Token: 123
“`

1.2. Understanding Whitespace as the Default Delimiter

When the delimiter argument is omitted, strtok uses whitespace characters as delimiters. It’s important to understand precisely which characters are considered whitespace:

Space (' ')
Tab ('\t')
Newline ('\n')
Carriage Return ('\r')
Vertical Tab ('\v')
Form Feed ('\f')

Importantly, multiple consecutive whitespace characters are treated as a single delimiter. This is a crucial aspect of strtok‘s behavior.

“`matlab
str = ‘ This is a test. ‘;
[token, remainder] = strtok(str);
disp([‘Token: ‘, token]); % Output: Token: This
disp([‘Remainder: ‘, remainder]); % Output: Remainder: is a test.

% Notice how multiple spaces are treated as a single delimiter,
% but the remaining spaces are kept in the ‘remainder’.
“`

2. Iterative Tokenization: Processing the Entire String

The real power of strtok comes from its ability to be used iteratively to extract all tokens from a string. This is achieved by repeatedly calling strtok on the remainder obtained from the previous call. The loop continues until the remainder is empty.

Here’s a general pattern for iterative tokenization:

“`matlab
str = ‘one,two,three,four,five’;
delimiter = ‘,’;
remainder = str; % Initialize remainder with the original string

while ~isempty(remainder)
[token, remainder] = strtok(remainder, delimiter);
disp([‘Token: ‘, token]);
end

% Output:
% Token: one
% Token: two
% Token: three
% Token: four
% Token: five
“`

2.1. Handling Different Delimiters in a Loop

You can change the delimiter within the loop if needed, but it is generally more common to have a consistent delimiter. However, if there is a change, the code can be adapted.

“`matlab
str = ‘one,two;three-four’; % Different delimiters
remainder = str;

[token, remainder] = strtok(remainder, ‘,’);
disp([‘Token 1: ‘, token]); % Output: Token 1: one

[token, remainder] = strtok(remainder, ‘;’);
disp([‘Token 2: ‘, token]); % Output: Token 2: ,two

[token, remainder] = strtok(remainder, ‘-‘);
disp([‘Token 3: ‘, token]); % Output: Token 3: ;three

disp([‘Token 4: ‘, strtok(remainder)]); % Output: Token 4: four.
% Using default whitespace since no ‘-‘ remains.
“`

2.2. Storing Tokens in a Cell Array

For many applications, you’ll want to store the extracted tokens in a data structure for later use. A cell array is the ideal choice in MATLAB for storing strings of varying lengths.

“`matlab
str = ‘red,green,blue,yellow,orange’;
delimiter = ‘,’;
remainder = str;
tokens = {}; % Initialize an empty cell array

while ~isempty(remainder)
[token, remainder] = strtok(remainder, delimiter);
tokens{end+1} = token; % Append the token to the cell array
end

disp(tokens);
% Output:
% {[“red”] [“green”] [“blue”] [“yellow”] [“orange”]}

% Accessing individual tokens:
disp(tokens{1}); % Output: red
disp(tokens{3}); % Output: blue
“`

3. Advanced Usage and Considerations

3.1. Handling Empty Tokens

If your string contains consecutive delimiters, strtok will effectively skip over them, as leading delimiters are ignored. This means you won’t get empty tokens. If you need to detect empty tokens, you’ll have to use a different approach (discussed in the Alternatives section).

“`matlab
str = ‘apple,,banana,,orange’;
delimiter = ‘,’;
remainder = str;
tokens = {};

while ~isempty(remainder)
[token, remainder] = strtok(remainder, delimiter);
tokens{end+1} = token;
end

disp(tokens); % Output: {[“apple”] [“banana”] [“orange”]} (No empty tokens)
“`

3.2. Multiple Delimiters

You can specify multiple delimiters by providing a string or character vector containing all the desired delimiter characters. strtok will treat any of these characters as a separator.

“`matlab
str = ‘1a2b3c4d5’;
delimiters = ‘abcd’;
[token, remainder] = strtok(str, delimiters);
disp([‘Token: ‘, token]); % Output: Token: 1
disp([‘Remainder: ‘, remainder]); % Output: Remainder: a2b3c4d5

%Iterative example
str = ‘one,two;three-four’;
delimiters = ‘,-;’; %Multiple delimiters
remainder = str;
tokens = {};

while ~isempty(remainder)
[token, remainder] = strtok(remainder, delimiters);
tokens{end + 1} = token;
end

disp(tokens); % Output: {[“one”] [“two”] [“three”] [“four”]}
“`

3.3. Delimiter at the End of the String

If the delimiter is at the very end of the string, the last token will be extracted correctly, and the remainder will be empty.

“`matlab
str = ‘apple,banana,orange,’;
delimiter = ‘,’;
[token, remainder] = strtok(str, delimiter);
disp([‘Token: ‘, token]); % Output: Token: apple
disp([‘Remainder: ‘, remainder]); % Output: Remainder: ,banana,orange,

remainder = str;
tokens = {};
while ~isempty(remainder)
[token, remainder] = strtok(remainder, delimiter);
tokens{end + 1} = token;
end
disp(tokens); % Output: {[“apple”] [“banana”] [“orange”] [“”]}
“`
Notice that the last token is an empty string. This is because a delimiter does exist after “orange,” it is just the last character. This is different to the leading delimiter case.

3.4 strtok and Character Arrays vs. String Arrays

While this article primarily uses character arrays (single quotes), strtok works seamlessly with MATLAB’s string arrays (double quotes) as well. The behavior is identical.

“`matlab
str = “This is a string array.”; % Double quotes
[token, remainder] = strtok(str);
disp([‘Token: ‘, token]); % Output: Token: This
disp([‘Remainder: ‘, remainder]); % Output: Remainder: is a string array.

str = “apple,banana,orange”; % String array
delimiter = “,”;
[token, remainder] = strtok(str, delimiter);
disp([‘Token: ‘, token]); % Output: Token: apple
disp([‘Remainder: ‘, remainder]); % Output: Remainder: ,banana,orange
“`

4. Limitations of strtok

While strtok is a versatile function, it has some limitations:

No Empty Tokens: As mentioned earlier, strtok doesn’t return empty tokens when consecutive delimiters are encountered.
Single Character Delimiters (Effectively): Although you can pass a string of delimiters, strtok treats each character individually as a delimiter. It does not recognize multi-character delimiters as a single unit. For example, if you try to use ',;' as a delimiter, it will split on either , or ;, not the sequence ',;'.
Modifies the Remainder: The remainder output includes the delimiter that was found. This is usually helpful for iteration, but in some cases, you might want the remainder without the delimiter.
Only Finds the First Token: strtok inherently finds only the first token. Iteration is required to find subsequent tokens.
No Regular Expressions: strtok doesn’t support regular expressions for more complex pattern matching.

5. Alternative Tokenization Methods

Because of these limitations, MATLAB offers other functions and techniques for string tokenization that may be more suitable for certain situations.

5.1. split (MATLAB R2016b and later)

The split function, introduced in MATLAB R2016b, is often a more convenient and powerful alternative to strtok for basic tokenization. It addresses several of strtok‘s limitations:

Syntax:
- C = split(str): Splits str at whitespace characters.
- C = split(str, delimiter): Splits str using the specified delimiter. The delimiter can be a string, a character vector, or a cell array of strings/character vectors.
- C = split(str, delimiter, 'CollapseDelimiters', false): This crucial option controls whether consecutive delimiters result in empty tokens. The default is true (collapse delimiters, like strtok), but setting it to false allows you to capture empty tokens.
- C = split(str, delimiter, 'DelimiterType', 'RegularExpression'): Enables the use of Regular Expressions!
Returns a String Array: split returns a string array (or a cell array of strings if the input is a cell array), making it easier to work with the resulting tokens.
Multi-Character Delimiters: split does support multi-character delimiters.
Empty Tokens (Optional): The 'CollapseDelimiters' option gives you control over whether to include or exclude empty tokens.

“`matlab
% Example 1: Basic splitting
str = ‘This is a test string.’;
tokens = split(str);
disp(tokens);
% Output:
% 4×1 string array
% “This”
% “is”
% “a”
% “test”
% “string.”

% Example 2: Using a specific delimiter
str = ‘apple,banana,orange’;
tokens = split(str, ‘,’);
disp(tokens);
% Output:
% 3×1 string array
% “apple”
% “banana”
% “orange”

% Example 3: Multi-character delimiter
str = ‘apple::banana;;orange’;
tokens = split(str, [“::”, “;;”]); %Split on “::” or “;;”
disp(tokens);
% Output:
% 3×1 string array
% “apple”
% “banana”
% “orange”
% Example 4: Handling empty tokens
str = ‘apple,,banana,,orange’;
tokens = split(str, ‘,’, ‘CollapseDelimiters’, false);
disp(tokens);
% Output:
% 5×1 string array
% “apple”
% “”
% “banana”
% “”
% “orange”

%Example 5: Regular Expressions
str = ‘The price is $12.99 today, but $15.50 tomorrow.’;
tokens = split(str, ‘\$[0-9]+.[0-9]+’, ‘DelimiterType’, ‘RegularExpression’);
disp(tokens);
% Output:
% 3×1 string array
% “The price is ”
% ” today, but ”
% ” tomorrow.”
“`

5.2. strsplit (Older MATLAB Versions)

For versions of MATLAB before R2016b, strsplit is the closest equivalent to split. However, it has some differences:

Returns a Cell Array: strsplit returns a cell array of strings, not a string array.
No CollapseDelimiters Option: strsplit always collapses consecutive delimiters (like strtok). There’s no built-in way to get empty tokens.
No Direct Regular Expression Support: While it can be used in conjunction with regexp, it is less straightforward.

“`matlab
str = ‘apple,banana,orange’;
tokens = strsplit(str, ‘,’);
disp(tokens); % Output: {[“apple”] [“banana”] [“orange”]}

str = ‘apple,,banana,,orange’;
tokens = strsplit(str, ‘,’);
disp(tokens); % Output: {[“apple”] [“banana”] [“orange”]} (No empty tokens)
“`

5.3. regexp (Regular Expressions)

For advanced tokenization scenarios involving complex patterns, regular expressions are essential. The regexp function provides powerful pattern-matching capabilities.

tokens = regexp(str, expression, 'split'): This is the key form for tokenization. It splits the string str based on the regular expression expression.

“`matlab
str = ‘The quick brown fox jumps over the lazy dog.’;
tokens = regexp(str, ‘\s+’, ‘split’); % Split on one or more whitespace characters
disp(tokens);
% Output:
% {[“The”] [“quick”] [“brown”] [“fox”] [“jumps”] [“over”] [“the”] [“lazy”] [“dog.”]}

% Example: Extracting numbers from a string
str = ‘There are 12 apples, 3 oranges, and 100 bananas.’;
numbers = regexp(str, ‘\d+’, ‘match’); % ‘match’ extracts the matching substrings
disp(numbers);
% Output:
% {[“12”] [“3”] [“100”]}

%Example: Splitting using Lookarounds
str = ‘var1=value1;var2=value2,var3=value3’;
tokens = regexp(str, ‘(?<=[;,])’, ‘split’); %Split on ; or , but keep them!
disp(tokens);
% {[“var1=value1”] [“var2=value2”] [“var3=value3”]}
%Notice, no ; or , are in output. Lets keep the separators:
tokens = regexp(str, ‘(.?)([,;]|$)’, ‘tokens’);
disp([tokens{:}]); %Show the content of the resulting nested cell array.
%output: var1=value1;var2=value2,var3=value3
%That’s not what we expected! The tokens option provides capture groups*, not split results.
%We must use a more elaborate expression:
tokens = regexp(str, ‘([^;,]+)([;,]|$)’, ‘tokens’);
disp([tokens{:}]);
%Output is now a cell array of cells. Each sub-cell contains the token and the delimiter:
%{‘var1=’ ‘;’} {‘value1’ ‘;’} {‘var2=’ ‘,’} {‘value2’ ‘,’} {‘var3=’ ”} {‘value3’ ”}

%Cleaner approach using match:
tokens = regexp(str, ‘[^;,]+’, ‘match’);
disp(tokens)
% {[“var1=value1”] [“var2=value2”] [“var3=value3”]}

“`

Regular expressions are a vast topic, and a full tutorial is beyond the scope of this article. However, the examples above demonstrate their basic use for tokenization. The key is to learn the syntax of regular expressions (metacharacters, quantifiers, character classes, etc.) to construct patterns that match your specific needs. MATLAB’s documentation provides extensive information on regular expressions.

5.4. textscan (Formatted Data)

For reading data from formatted text files or strings, textscan is often the best choice. It allows you to specify the format of each field, including delimiters and data types.

“`matlab
% Example: Reading comma-separated values (CSV) data
str = ‘1,John,Doe,30\n2,Jane,Smith,25’;
data = textscan(str, ‘%d,%s,%s,%d’, ‘Delimiter’, ‘,’);
disp(data);
% Output:
% [2×1 int32] {2×1 cell} {2×1 cell} [2×1 int32]
% [1] {[“John”]} {[“Doe”]} [30]
% [2] {[“Jane”]} {[“Smith”]} [25]

% Accessing the data:
ids = data{1};
firstNames = data{2};
lastNames = data{3};
ages = data{4};
“`

textscan is highly flexible and can handle various data formats, including strings, numbers, dates, and more. It’s particularly useful when dealing with structured data.

5.5. Manual Looping and Character-by-Character Analysis

In some rare cases, you might need to implement your own tokenization logic using manual looping and character-by-character analysis. This gives you complete control over the process but is generally more complex and error-prone than using built-in functions.

“`matlab
str = ‘apple,,banana,,orange’;
delimiter = ‘,’;
tokens = {};
currentToken = ”;

for i = 1:length(str)
if str(i) == delimiter
tokens{end+1} = currentToken;
currentToken = ”;
else
currentToken = [currentToken, str(i)];
end
end
tokens{end+1} = currentToken; % Add the last token

disp(tokens); % Output: {[“apple”] [“”] [“banana”] [“”] [“orange”]}
`` This approach gives the empty tokens and is included to illustrate the underlying mechanism. It is almost *always* better to use the built-in functions (split,regexp`, etc.).

6. Practical Examples and Applications

Let’s look at some practical examples of how tokenization can be used in real-world scenarios.

6.1. Parsing a CSV File

“`matlab
% Assume you have a file named ‘data.csv’ with the following content:
% Name,Age,City
% John,30,New York
% Jane,25,London
% Peter,40,Paris

fileID = fopen(‘data.csv’, ‘r’);
header = fgetl(fileID); % Read the header line
header_tokens = split(header, ‘,’); %Get headers

data = {};
while ~feof(fileID)
line = fgetl(fileID);
tokens = split(line, ‘,’);
data = [data; tokens]; % Append tokens as a new row
end
fclose(fileID);

disp(header_tokens)
disp(data);

% Accessing specific data:
names = data(:, 1);
ages = str2double(data(:, 2)); % Convert ages to numbers
cities = data(:, 3);

disp(names);
disp(ages);
disp(cities);
“`

6.2. Processing User Input (Command-Line Interface)

“`matlab
while true
userInput = input(‘Enter a command (e.g., “plot x,y”): ‘, ‘s’);

if strcmp(userInput, 'exit')
    break;
end

tokens = split(userInput);
command = tokens{1};

switch command
    case 'plot'
        if length(tokens) > 1
            variables = split(tokens{2}, ',');
            % ... (code to plot the variables) ...
            disp(['Plotting: ', strjoin(variables, ', ')]);
        else
            disp('Error: Missing variables for plot command.');
        end
    case 'help'
         disp('Available commands: plot, exit, help');
    otherwise
        disp(['Unknown command: ', command]);
end

end
“`

6.3. Analyzing Text Data

“`matlab
text = ‘This is a sample text. It contains some words, and some punctuation.’;

% Tokenize the text into words (removing punctuation):
words = regexp(text, ‘[a-zA-Z]+’, ‘match’); % Match sequences of letters
disp(words);

% Count the frequency of each word:
wordCounts = containers.Map(); % Use a map to store word counts
for i = 1:length(words)
word = lower(words{i}); % Convert to lowercase for case-insensitivity
if isKey(wordCounts, word)
wordCounts(word) = wordCounts(word) + 1;
else
wordCounts(word) = 1;
end
end

disp(wordCounts);
%Display the word counts nicely:
keys = wordCounts.keys;
for iKey = 1:numel(keys)
fprintf(‘%s\t:\t%d\n’, keys{iKey}, wordCounts(keys{iKey}));
end

“`

7. Conclusion

Tokenization is a fundamental string processing technique, and strtok is a useful tool in MATLAB for this purpose, particularly in older versions or when a simple iterative approach is sufficient. However, for more complex scenarios or when dealing with modern MATLAB code, split and regexp offer greater flexibility, power, and often better readability. Understanding the strengths and limitations of each method allows you to choose the most appropriate tool for the task at hand, ensuring efficient and robust string manipulation in your MATLAB programs. The textscan function provides a structured method for handling formatted data. By mastering these various tokenization approaches, you’ll be well-equipped to handle a wide range of string processing challenges in MATLAB.

Leave a Comment Cancel Reply