m17n-Safe Low-Level Text Processing
In previous versions of Ruby, strings were pretty much sequences of bytes rather than characters. This meant the following code seldom caused anyone to bat an eyelash:
File.open("hello.txt") { |f| loop do break if f.eof? chunk = "CHUNK: #{f.read(5)}" puts chunk unless chunk.empty? end }
The purpose of the previous example is to print out the contents of the file in chunks of five bytes, which, when it comes to ASCII, means five characters. However, multibyte character encodings, especially variable-length ones such as UTF-8, cannot be processed using this approach. The reason is fairly simple.
Imagine this code running against a two-character, six-byte string
in UTF-8 such as “吴佳
”. If we read five
bytes of this string, we end up breaking the second character’s byte
sequence, resulting in the mangled string “吴\xE4\xBD
”. Of course, whether this is a problem
depends on your reason for reading a file in chunks.
If we are processing binary data, we probably don’t need to worry about character encodings or anything like that. Instead, just we read a fixed amount of data according to our needs, processing it however we’d like. But many times, the reason why we read data in chunks is not to process it at the byte level, but instead, to break it up into small parts as we work on it.
A perfect example of this, and a source of a good solution to the problem, is found within the CSV standard library. As we’ve seen before, this library is fully m17n-capable and ...
Get Ruby Best Practices now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.