Splitting Up Text in C

A common question I often find is:

How can I split this incoming data into parts?

It's especially asked in conjunction with reading data through serial. So I thought I'd introduce you to two completely different approaches, each with benefits and drawbacks depending on the kind of data you are splitting.

So you have some data coming in through serial, or some similar stream, and you need to cut it up into different parts. The two methods basically consist of either:

  • Read all the data into a string and slice it up (parse it) afterwards
  • Parse the data as it arrives character by character (or a few characters at a time maybe).

The first method is best for processing textual data with delimiters. The second is better for numeric data.

For instance, if you have the string:

463.98,328.14,2.49,-7.41

you may choose to read each digit as it arrives and "shift" it into a floating point variable, switching to a new variable every time you receive a ",". That falls into the second category. However, if you have the string:

set pin 3 high

you may decide to instead read the entire string into memory then slice it up on a space character so you can examine each word individually and decide what should be done with it.

The advantage of using the as-it-arrives parsing for the first string is that it is very efficient as regards memory usage. You don't store any more than the current character, and the actual values you are interested in. Another good example of a string format suited to parsing as each character arrives is a NEMA (GPS) string:

$GPGGA,123519,4807.038,N,01131.000,E,1,08,0.9,545.4,M,46.9,M,,*47

There you have regular delimiters (","), specific things you can look out for (N/S, E/W, etc) and just read in the numbers into the right variables as the arrive.

One good trick for reading those numbers one character at a time is to do two things:

  • Multiply-and-add, and
  • Count the decimal places.

Or, to put it another way: When you see a numeric character ("0" to "9") you simply multiply your current variable by 10 and add the numeric value of the character to it. If you see a decimal point you create a "divisor" variable initialised to 1, and any digits after that you multiply the divisor by 10 as well. Then at the end of it all you divide your value by the divisor to get the floating point value.

Here's a snippet from some code I wrote recently that does just that. It's from a switch statement which looks at the incoming character (c):

case '.':
    dp = 1;
    break;
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
    data *= 10;
    data += c - '0';
    if (dp > 0) dp *= 10;
    break;

Basically, "dp" is the divisor (decimal point), and "data" is the value, both integers. If a decimal point is seen then "dp" becomes 1. Any numbers get appended to the data (data is multiplied by 10, so "23" becomes "230", then the incoming character gets converted to decimal and added to it).

Finally, when you find your delimiter, you convert the integers to a floating point value:

val = data / (float)dp * sign;

You may notice the "sign" variable I slipped in there. That's recording the sign (+/-) of the value. Basically it starts out 1, and if a "-" is seen it changes to -1.

But that's about all I want to say about that method for now. I'd like to dedicate most of this post to the full string parsing method.

This method does have one disadvantage, especially on smaller MCUs - it uses more memory. However it does give you the most flexibility.

So you have read in a string of data into memory. First off, don't use the Arduino String class, since it actually makes things harder, and makes a complete mess of your memory (as you can read here). To learn more about how you can properly read data into memory from Serial you can cast your eye over this tutorial, which includes a handy function that works well with this method..

Now you have your string in a character array in memory. Let's parse it. I want to introduce you to a favourite function of mine, but one that needs some deep understanding to be able to use it right:

char *strtok(char *str, const char *delim);

To quote the manual page:

The strtok() function breaks a string into a sequence of zero or more nonempty tokens. On the first call to strtok(), the string to be parsed should be specified in str. In each subsequent call that should parse the same string, str must be NULL.

The delim argument specifies a set of bytes that delimit the tokens in the parsed string. The caller may specify different strings in delim in successive calls that parse the same string.

Each call to strtok() returns a pointer to a null-terminated string containing the next token. This string does not include the delimiting byte. If no more tokens are found, strtok() returns NULL.

A sequence of calls to strtok() that operate on the same string maintains a pointer that determines the point from which to start searching for the next token. The first call to strtok() sets this pointer to point to the first byte of the string. The start of the next token is determined by scanning forward for the next non-delimiter byte in str. If such a byte is found, it is taken as the start of the next token. If no such byte is found, then there are no more tokens, and strtok() returns NULL. (A string that is empty or that contains only delimiters will thus cause strtok() to return NULL on the first call.)

The end of each token is found by scanning forward until either the next delimiter byte is found or until the terminating null byte ('\0') is encountered. If a delimiter byte is found, it is overwritten with a null byte to terminate the current token, and strtok() saves a pointer to the following byte; that pointer will be used as the starting point when searching for the next token. In this case, strtok() returns a pointer to the start of the found token.

From the above description, it follows that a sequence of two or more contiguous delimiter bytes in the parsed string is considered to be a single delimiter, and that delimiter bytes at the start or end of the string are ignored. Put another way: the tokens returned by strtok() are always nonempty strings. Thus, for example, given the string "aaa;;bbb,", successive calls to strtok() that specify the delimiter string ";," would return the strings "aaa" and "bbb", and then a null pointer.

Ok, so I don't expect you to take all that in and understand it immediately. If you did, what would be the point of this post?

So what exactly is the role of strtok() then? Simple: it scans through a string looking for any delimiter characters you specify. Each time it finds one (or multiple copies of one) it returns the portion of the string up to that point. Or kind of, anyway.

What actually happens is that it replaces the delimiter within the string with the "end of string" marker (the NULL character, or ASCII code 0) and returns a pointer to the start of the string before that delimiter.

That means it actively modifies the data in memory. It is a destructive function - you can never get the original data including delimiters back once strtok() has done its work. However it does mean that it is efficient when it comes to using the memory it has. While it may use more memory than direct character-at-a-time parsing it's not as memory hungry as, say, splitting String objects up into new String objects.

After finishing with strtok() our string above would actually look like:

set\0pin\03\0high\0

Where "\0" denotes the NULL character.

So let's look at how we actually use strtok() to get those substrings, shall we?

First off we'll have our data in an array called "buffer". The first time we call strtok() we pass it that buffer array along with the delimiter(s) we want to look for, and it returns the first word it finds. For example:

char *command = strtok(buffer, " \t");

That scans through the "buffer" looking for either a space or a tab character ("\t"). If it finds one it replaces it, along with any subsequent ones, with "\0", and returns the pointer to the start of the string. Now the clever thing here is, it remembers where in the buffer it is. And because the buffer has been modified, you can't call the same command again.

Right now our buffer looks like:

set\0pin 3 high

The "\0" is an "end of string" marker, so if we were to print the buffer out we'd just get "set". Similarly if we print the "command" pointer we would just get "set", since the string only exists up as far as "\0".

And that is why strtok() has to remember where it is. To make it more graphical, let's add some actual pointers:

buffer
|    strtok_ptr
|    |
v    v
set\0pin 3 high
^
|
command

The trick now is that you call strtok() with NULL as the string you want to parse. That tells strtok() to "continue from where you left off last time". It will basically use "strtok_ptr" as the string (name invented by me) and continue from where it was. So we call:

char*param1 = strtok(NULL, " \t");

Again, strtok() scans forward and finds the first delimiter, then returns the pointer, which we assign to "param1". So let's add that to our drawing:

buffer
|         strtok_ptr
|         |
v         v
set\0pin\03 high
^    ^
|    |
|    param1
command

So you can see how it progresses. Each time you call strtok() with NULL as the string it moves on through the string finding a new token.

So after calling it a few more times we reach the actual end of the string. But there is no delimiter there to trigger a new token. That doesn't matter. strtok() always as the end of the string as an implicit delimiter. When it reaches the end it knows it must treat it as if it found a delimiter and return a pointer to the start of the previous token.

However, if there is no previous token (either we already reached the end of the string before, or the end is just made up of a selection of delimiters) then there is no token to return. So it just returns NULL, to say "There are no more tokens".

And that is something you must watch for if you are wanting to pass substrings to such things as atoi() which will completely barf if they are passed NULL.

So we have reached the end of our string and assigned each token to different variables. Let's look at how it looks now:

buffer
|            param3
|            |    strtok_ptr
v            v    v
set\0pin\03\0high\0
^    ^    ^      ^
|    |    param2 param4
|    param1
command

We kept reading and assigning variables until one told us we were at the end of the string by returning NULL. The best way of doing that would be with an array. So let's take a look at an actual bit of code using an array of character pointers:

char *argv[6]; // Max 6 arguments
int argc = 0;
argv[argc] = strtok(buffer, " \t");
while ((argv[argc] != NULL) && (argc < 5)) {
    argc++;
    argv[argc] = strtok(NULL, " \t");
}

And there's an array of up to 6 tokens using as little memory as is reasonably possible. Now you can use such things as strcmp(), atoi(), isdigit(), etc to examine them and do different things with them.

Of course you don't have to use " \t" for your delimiter. If your string is comma separated (like the NEMA string above, or a line from a CSV file), you can use "," as your delimiter.

One thing to note though with delimiters is that strtok() will consume multiple delimiters in sequence. So if you have the NEMA string I showed above:

$GPGGA,123519,4807.038,N,01131.000,E,1,08,0.9,545.4,M,46.9,M,,*47

Using strtok() may not give the actual results you would expect. You'd think tokenising it on "," would give you these tokens:

  • $GPGGA
  • 123519
  • 4807.038
  • N
  • 01131.000
  • E
  • 1
  • 08
  • 0.9
  • 545.4
  • M
  • 46.9
  • M
  •  
  • *47

But what it will actually give you is:

  • $GPGGA
  • 123519
  • 4807.038
  • N
  • 01131.000
  • E
  • 1
  • 08
  • 0.9
  • 545.4
  • M
  • 46.9
  • M
  • *47

You notice there's one less. That's because there are two delimiters next to each other: ",M,,47". strtok() munches both of those up together and just returns "M" followed by "47".

To get around this a similar function to strtok() exists, called strsep(). It functions like strtok() except it doesn't itself remember where in the string it is. Instead it actively modifies the address of the buffer you pass it (so you have to pass it the "address of the address", or the address of the pointer variable that points to the buffer), so takes a little more care. Here's an example using the same code as above, but tweaked to better fit the NEMA string:

char *argv[16]; // Max 6 arguments
int argc = 0;
char *bufptr = buffer;
argv[argc] = strsep(&bufptr, ",");
while ((argv[argc] != NULL) && (argc < 15)) {
    argc++;
    argv[argc] = strsep(&bufptr, ",");
}

You see we had to make a copy of the pointer to our buffer, otherwise we'd have lost where the buffer actually was.


Adding An OLED To My Computer