출처 : http://www.latiumsoftware.com/en/pascal/0003.php
3. PORTING ISSUES: UTF-8 STRINGS
This article is intended mainly for future programmers for the Linux
environment and intends to present some of the differences that will
exist regarding string processing between Windows and Linux.
Strings types in Delphi
=======================
A string (as you probably know by now :-) is a sequence of characters.
Delphi has three types of strings:
* Short strings
Short strings are declared using the ShortString keyword. This string
type comes from the old times of Turbo Pascal and is supported for
backwards compatibility. A short string variable normally uses 256
bytes in total, although its length (stored in the first byte) can
vary from 0 to 255.
For example:
var
s: shortstring;
begin
s := 'Hello!';
The string s takes 256 bytes. s[0] is the length of the string, so in
the example its value would be #6. You can't access s[0] directly,
but rather you should use Length and SetLength. s[1] is the first
character ('H'), s[2] is the second character ('e'), and so on.
From s[7] to s[255] the values would be undefined.
* ANSI strings
Usually called "long strings", ANSI strings are declared using the
AnsiString keyword. ANSI strings are actually pointers to a data
structure consisting of two integers (that hold the length of the
string and the reference count) and the sequence of bytes allocated
for the string, that can range from 1 byte to almost 2 GB (providing
you have enough memory).
For example:
var
s: ansistring;
begin
s := 'Hello!';
The variable s itself takes 4 bytes (a 32-bit pointer). The data
structure it points to takes 8 bytes for the two integers and in this
case 6 bytes for the 6 characters, giving 14 bytes in total. Like
with the short string, s[1] is the first character ('H'), and so on.
* Wide strings
Wide strings, also named UNICODE strings, are special strings where
each character (of type WideChar) takes two bytes (a word). In the
UNICODE character set, the first 256 values correspond to the ANSI
character set. Wide strings are pointers, like ANSI strings, but
they are not reference counted, so when you make an assignment
between two wide-string variables, the string is actually copied (in
the case of ANSI strings the reference count is incremented), so
they are inefficient in comparison, but the COM and OLE APIs use this
type of strings, and so do ActiveX objects.
For example:
var
s: widestring;
begin
s := 'Hello!';
Here, the variable s takes 4 bytes for the pointer, and the data
structure takes 4 bytes for the length and 12 bytes for the 6
characters (2 bytes each), giving 16 bytes in total. s[1] is the
first character ('H'), except it is of type WideChar instead of
AnsiChar and takes two bytes instead of one. s[2] is the second
character ('e') and starts in the third byte (the first two bytes
are for s[1]).
The type String is mapped by default to AnsiString. Char is mapped to
AnsiChar, and PChar is mapped to PAnsiChar.
MultiByte Character Strings in Windows (MBCS)
=============================================
When working with Ansi strings, normally we consider that each character
occupies one byte, which is true for Western European languages, but for
most Asian languages, 256 characters are simply not enough.
A possible solution is using wide strings, and another solution is
encoding some characters in one byte and others in two (DBCS: Double-
Byte Character Strings). For this to work, there must be a way to know
whether a byte in a string is a character, or is the "lead byte" of a
two byte character. Delphi defines a character set named LeadBytes that
contains the characters that are lead bytes in the current Windows
locale. For Western locales, this set is empty (there are no lead bytes
since there is an equivalence between bytes and characters), and in
general for other locales, if the value of the byte ranges from 0 to 127
it is an ASCII character, and if it is greater than 127, then it is a
lead byte and the next character is called "trail byte" (may range from
0 to 255).
For reasons of efficiency and backwards compatibility, Delphi comes with
different versions of string functions for SBCS (Single-Byte Character
Strings) and DBCS. For SBCS (one byte = one character) there is no point
in going thru the overhead of trying to see if each byte is a character
or a lead byte (since there are no lead bytes), so for SBCS you can use
the standard functions like Pos, LowerCase, etc., while for DBCS you
should use funtions like AnsiPos, AnsiLowerCase, etc. which take into
account that some characters may be represented by more than one byte
(and thus these functiones are slower).
Length of an ANSI string
========================
Indexing a DBCS can be tricky, since s[i] represents the i-th byte, not
necessarily the i-th character because previous characters could have
had two bytes. The number of bytes in a string returned by the Length
function may or may not represent the actual number of characters
contained in a DBCS. To determine this number you can use a function
like the following:
var
i, n: integer;
begin
Result := 0;
n := Length(s);
i := 1;
while i <= n do begin
inc(Result);
if s[i] in LeadBytes then inc(i);
inc(i);
end;
end;
Introduction to UTF-8 (UCS Transformation Format)
=================================================
Windows can work with Unicode strings, as well as SBCS and DBCS, but the
Linux kernel works with UTF-8 strings, where one character may take up
to six bytes! Normally one or two in Western languages and from one to
three in Asian languages. UTF-8 is a multibyte character encoding that
can accommodate all the characters of the UCS (Universal Character Set),
which contains 31-bit characters that can represent practically all the
characters of known languages living and dead, as well as scripts like
Hiragana, Kiragana, etc. It also leaves space for more languages,
scripts and hieroglyphics, so in the future we can expect to be able to
read Klingon poetry, the Ferengi Acquisition Rules and Bajoran
prophecies in their original versions... :-)
UTF-8 has these important features:
* Variable-length encoding for UCS characters
UTF-8 can encode UCS (ISO 10646) characters in up to 6 bytes.
* Transparency and uniqueness for ASCII characters
7-bit ASCII characters (#0..#127) are encoded as plain 7-bit ASCII
(1 byte per character). All non-ASCII characters (#128..#255) are
represented purely with non-ASCII 8-bit values (#128..#255) so that
non-ASCII characters cannot be mistaken for ASCII characters, and
ASCII-based text processing tools can be used on UTF-8 text as long
as they pass 8-bit characters without interpretation.
* Null character
Character #0 (ASCII NULL) only appears where a NULL is intended. It
can't be a trail byte for instance.
* Self-synchronization for fast speed processing
High bit patterns unambiguates character boundaries, and makes it easy
to know whether a byte is a single-byte character (0xxxxxxx), a lead
byte (11yyyyyx) or a fill byte (10xxxxxx). This feature is very
important because it allows UTF-8 strings processing functions be by
far a lot more efficient than Windows DBCS. For example, an UTF-8
string can be parsed backwards and also string searches for a
multibyte character beginning with a lead byte will never match on the
fill byte in the middle of an unwanted multibyte character. And as the
lead-byte announces the length of the multibyte character you can
quickly tell how many bytes to skip for fast forward parsing.
* Processor-friendliness
UTF-8 can be read and written quickly with simple bitmask and bitshift
operations without any multiplication or division (that are slow CPU
operation).
* Reasonable compression
UTF-8 is not as compact as Windows DBCS, but for Western languages it
is better than Unicode, and in the worst case (Eastern languages) it
is no worse than UCS-4.
* Canonical sort-order
UTF-8 preserves the sort ordering for plain 8-bit comparison routines
like strcmp (a C standard function).
* Flag characters
The octets #$FE and #$FF never appear, so you can use them as flags
to signal a special meaning (avoiding the possibility of mistaking a
flag with a real character).
* Detectability
It's easy to detect an UTF-8 input with high probability if you see
the UTF-8 signature #$EF#$BB#$BF ('??¿') or if you see valid UTF-8
multibyte characters since it is very unlikely that they accidentally
appear in ISO 8859-1 (Latin-1) text.
UTF-8 encoding
==============
This is the general format used to encode UCS characters in UTF-8:
Bits Bytes Representation
7 1 0xxxxxxx
11 2 110xxxxx 10xxxxxx
16 3 1110xxxx 10xxxxxx 10xxxxxx
21 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Notice that the number of leading 1 bits in the lead byte is the number
of bytes in a multibyte sequence.
The copyright sign ('?' = #169 = #$A9) in binary would be 10101001 and
since it needs 8 bits, we would have to use two bytes:
110xxxxx 10xxxxxx
We have to fill 11 bits (x), so we add three zeroes to the left of
10101001:
00010 101001
The UTF-8 representation for the copyright character would then be:
11000010 10101001
It could also be represented with more bytes than needed in an overlong
string sequence. For example with four bytes it would be:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
000 000000 000010 101001
----------------------------------------
11110000 10000000 10000010 10101001
Overlong sequences are usually used to "camouflage" characters to cheat
UTF-8 substring tests. For example, if you look for the copyright sign
exactly as 11000010 10101001 (the shortest possible encoding), then you
won't find it.
Length of an UTF-8 string
=========================
In Delphi for Linux, long strings will be in UTF-8 format, while wide
strings will remain as two-byte Unicode, although they will be reference
counted. To know the number of characters stored in a UTF-8 string we
could use a function like the following:
var
i, n: integer;
c: byte;
begin
Result := 0;
n := Length(s);
i := 1;
while i <= n do begin
inc(Result);
c := byte(s[i]);
if (c and $80) = 0 then inc(i)
else if (c and $E0) = $C0 then inc(i, 2)
else if (c and $F0) = $E0 then inc(i, 3)
else if (c and $F8) = $F0 then inc(i, 4)
else if (c and $FC) = $F8 then inc(i, 5)
else if (c and $FE) = $FC then inc(i, 6)
else
raise Exception.Create('Not an UTF-8 string!');
end;
if i > n + 1 then
raise Exception.Create('Not an UTF-8 string!');
end;
Of course this function should be written using pointers and a bit of
assembler to improve its performance, but let's leave that for the
pros... :)