debuggable

 
Contact Us
 

Streaming UTF-8 (with node.js)

Posted on 18/5/10 by Felix Geisendörfer

UTF-8 is a variable-length character encoding for Unicode. This text is encoded in UTF-8, so the characters you are reading can consist of 1 to 4 bytes. As long as the characters fit within the ASCII range (0-127), there is exactly 1 byte used per character.

But if I want to express a character outside the ASCII range, such as '¢', I need more bytes. The character '¢' for example consists of: 0xC2 and 0xA2. The first byte, 0xC2, indicates that '¢' is a 2-byte character. This is easy to understand if you look at the binary representation of 0xC2:

11000010

As you can see, the bit sequence begins with '110', which as per the UTF-8 specification means: "2 byte character ahead!". Another character such as '€' (0xE2, 0x82, 0xAC) would work the same way. The first byte, 0xE2, looks like this in binary:

11100010

The prefix '1110' specifies that there are 3 bytes forming the current character. More exotic characters may even start with '11110', which indicates a 4 byte character.

As you can guess, UTF-8 text is not trivial to stream. Networks and file systems are not UTF-8 aware, so they will often split a chunk of text in the middle of a character.

To make sure you don't process a partial character, you have to analyze the last 3 bytes of any given chunk in your stream to check for the bit-prefixes that are used to announce a multibyte character. If you detect an incomplete character, you need to buffer the bytes you have for it, and then prepend them to the next chunk that comes in.

This way you can completely avoid breaking apart multibyte characters within a UTF-8 text, while still getting great performance and memory usage (only the last 3 bytes need checking / buffering).

As of yesterday, node.js's net / http modules are now fully UTF-8 safe, thanks to the streaming Utf8Decoder (undocumented, API may change) you can see below:

var Buffer = require('buffer').Buffer;

var Utf8Decoder = exports.Utf8Decoder = function() {
  this.charBuffer = new Buffer(4);
  this.charReceived = 0;
  this.charLength = 0;
};

Utf8Decoder.prototype.write = function(buffer) {
  var charStr = '';
  // if our last write ended with an incomplete multibyte character
  if (this.charLength) {
    // determine how many remaining bytes this buffer has to offer for this char
    var i = (buffer.length >= this.charLength - this.charReceived)
      ? this.charLength - this.charReceived
      : buffer.length;

    // add the new bytes to the char buffer
    buffer.copy(this.charBuffer, this.charReceived, 0, i);
    this.charReceived += i;

    if (this.charReceived < this.charLength) {
      // still not enough chars in this buffer? wait for more ...
      return;
    }

    // get the character that was split
    charStr = this.charBuffer.slice(0, this.charLength).toString();
    this.charReceived = this.charLength = 0;

    if (i == buffer.length) {
      // if there are no more bytes in this buffer, just emit our char
      this.onString(charStr)
      return;
    }

    // otherwise cut of the characters end from the beginning of this buffer
    buffer = buffer.slice(i, buffer.length);
  }


  // determine how many bytes we have to check at the end of this buffer
  var i = (buffer.length >= 3)
    ? 3
    : buffer.length;

  // figure out if one of the last i bytes of our buffer announces an incomplete char
  for (; i > 0; i--) {
    c = buffer[buffer.length - i];

    // See http://en.wikipedia.org/wiki/UTF-8#Description

    // 110XXXXX
    if (i == 1 && c >> 5 == 0x06) {
      this.charLength = 2;
      break;
    }

    // 1110XXXX
    if (i <= 2 && c >> 4 == 0x0E) {
      this.charLength = 3;
      break;
    }

    // 11110XXX
    if (i <= 3 && c >> 3 == 0x1E) {
      this.charLength = 4;
      break;
    }
  }

  if (!this.charLength) {
    // no incomplete char at the end of this buffer, emit the whole thing
    this.onString(charStr+buffer.toString());
    return;
  }

  // buffer the incomplete character bytes we got
  buffer.copy(this.charBuffer, 0, buffer.length - i, buffer.length);
  this.charReceived = i;

  if (buffer.length - i > 0) {
    // buffer had more bytes before the incomplete char, emit them
    this.onString(charStr+buffer.slice(0, buffer.length - i).toString());
  } else if (charStr) {
    // or just emit the charStr if any
    this.onString(charStr);
  }
};

I feel like this implementation could still be somewhat simplified, so if you have any suggestions or comments, please let me know!

--fg

PS: Another buffer-based project I'm working on is a fast multipart parser - stay tuned for another post!

 
&nsbp;

You can skip to the end and add a comment.

Marco Rogers  said on May 18, 2010:

Nice. This is definitely a problem I was aware of in node, but wasn't looking forward to handling myself. So does this decoder get applied to any stream if the encoding gets set as UTF8? Is that the only time it gets applied?

Felix Geisendörfer said on May 18, 2010:

Marco Rogers: Currently it gets applied to any net or http stream if you use setEncoding('utf-8'). I still have to add it for the fs module.

Matt  said on May 18, 2010:

That's excellent ! It's a must have for NodeJs for dealing with localized applications.

Thanks for this great addition Felix !

Matt

David Bender said on May 18, 2010:

Felix,
Great work, but you need to address a few issues:

-For what uses cases is your library intended? Keep in mind that XML and JSON parsing libraries already do their own UTF-8 verification so duplicating their work does not enhance performance.

-UTF-8 safeness also requires that no invalid characters such as 0xFF show up in the content. Therefore, safeness requires checking each byte.

-Consequently, I would not call your implementation UTF-8 "safe". It is not sufficient to just check the last 3 bytes. In this case, you are leaving it up to the user to check for invalid UTF-8 characters anyway.

Eli Grey said on May 18, 2010:

All of this is unnecessary, as JavaScript already has functions which can facillitate encoding and decoding of UTF-8. Encoding UTF-8 to a bytestring is unescape(encodeURIComponent(string)) and decoding UTF-8 is decodeURIComponent(escape(string)).

Matthew  said on May 18, 2010:

One suggestion that might or might not be any use: you might be able to match the UTF8 byte ranges at the end of the string with a single regexp.

Felix Geisendörfer said on May 18, 2010:

David Bender: Eli Grey: This class is merely splitting a bytes stream into portions that are guaranteed to not be in the middle of a multibyte character. The actual conversion to (and verification of) UTF-8 happens in Buffer.toString().

David Bender said on May 18, 2010:

@Felix Thanks for the clarification. Consequently, your code does not just read the last 3 bytes though, as toString() is an O(n) operation.

I'm still not sure when I would actually this though. Verifying input tends to be done when parsing structured text, and libraries that do that handle partial multibyte characters. As an example, I would never put this between my socket and my JSON parser. For what situation do you use this?

Felix Geisendörfer said on May 18, 2010:

David Bender: If you wanted to write a streaming JSON parser that operates on a character level (rather than byte level), this would be very useful to put in between your parser and your socket.

Eli Grey said on Aug 20, 2010:

Felix: decodeURIComponent(escape(string)) not only decodes the UTF-8, but also verifies it. Why exactly would I want to use your code instead of this simple trick? Is it substantially faster?

Felix Geisendörfer said on Aug 20, 2010:

Eli Grey: Because decodeURIComponent() can only handle a fixed-length string, not a stream of data. This code is really only useful for node.js, not client side stuff (in case that's what you're referring to).

Eli Grey said on Aug 20, 2010:

Ah, thanks. I thought this was just for fixed-length strings too.

This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.