Semantic programming: UTF-8 files and Perl

Saturday, July 3, 2010

UTF-8 files and Perl

I ran into some problems trying to load some files with UTF-8 text (German) using Perl. Thing is, these files had a three-byte byte order marker (BOM) of ef bb bf [here is another useful link] and Perl freaks out. You have to check the first line for those three bytes; if present, you toss them and keep a flag. Then for the rest of the file, you have to set the UTF-8 flag on each line read.

The code:

use Encode;

open F, "$d/$file";
$utf8_file = 0;
$firstline = scalar <F>;
if ($firstline) {
  if ($firstline =~ /^\xef\xbb\xbf/) {
     $firstline =~ s/^\xef\xbb\xbf//g;
     $utf8_file = 1;
     Encode::_utf8_on($firstline);
  }
  [ consume $firstline ];
}
while (<F>) {
  Encode::_utf8_on($_) if $utf8_file;
  [ consume $_ ];
}

This is tedious.

I see two approaches to dealing with this. The first is to create Yet Another Module (this includes adding code to Class::Declarative), then always use that module when coding. This is kind of the default, and ultimately it is unsatisfying.

The other approach is some kind of pattern / macro / template system that would include this knowledge and would somehow generate the appropriate code as needed. That's where semantic programming needs to be headed.

Boy, that's vague.

Semantic programming

Saturday, July 3, 2010

UTF-8 files and Perl

No comments:

Post a Comment

Random Post

More information

Search This Blog

Blog Archive

Topics of interest

Alphabetically