Saturday, July 3, 2010

UTF-8 files and Perl

I ran into some problems trying to load some files with UTF-8 text (German) using Perl. Thing is, these files had a three-byte byte order marker (BOM) of ef bb bf [here is another useful link] and Perl freaks out. You have to check the first line for those three bytes; if present, you toss them and keep a flag. Then for the rest of the file, you have to set the UTF-8 flag on each line read.

The code:
use Encode;

open F, "$d/$file";
$utf8_file = 0;
$firstline = scalar <F>;
if ($firstline) {
if ($firstline =~ /^\xef\xbb\xbf/) {
$firstline =~ s/^\xef\xbb\xbf//g;
$utf8_file = 1;
Encode::_utf8_on($firstline);
}
[ consume $firstline ];
}
while (<F>) {
Encode::_utf8_on($_) if $utf8_file;
[ consume $_ ];
}
This is tedious.

I see two approaches to dealing with this. The first is to create Yet Another Module (this includes adding code to Class::Declarative), then always use that module when coding. This is kind of the default, and ultimately it is unsatisfying.

The other approach is some kind of pattern / macro / template system that would include this knowledge and would somehow generate the appropriate code as needed. That's where semantic programming needs to be headed.

Boy, that's vague.

No comments:

Post a Comment