마음의 안정을 찾기 위하여 - [Tip] 델파이에서의 UTF-8 STRINGS의 포팅하기
2400673
30
406
관리자새글쓰기
태그위치로그방명록
별일없다의 생각
dawnsea's me2day/2010
색상(RGB)코드 추출기(Color...
Connection Generator/2010
최승호PD, '4대강 거짓말 검...
Green Monkey**/2010
Syng의 생각
syng's me2DAY/2010
천재 작곡가 윤일상이 기획,...
엘븐킹's Digital Factory/2010
[Tip] 델파이에서의 UTF-8 STRINGS의 포팅하기
Delphi | 2007/08/17 15:57

출처 : http://www.latiumsoftware.com/en/pascal/0003.php

3. PORTING ISSUES: UTF-8 STRINGS


This article is intended mainly for future programmers for the Linux
environment and intends to present some of the differences that will
exist regarding string processing between Windows and Linux.


Strings types in Delphi
=======================

A string (as you probably know by now :-) is a sequence of characters.
Delphi has three types of strings:

 * Short strings
   Short strings are declared using the ShortString keyword. This string
   type comes from the old times of Turbo Pascal and is supported for
   backwards compatibility. A short string variable normally uses 256
   bytes in total, although its length (stored in the first byte) can
   vary from 0 to 255.

   For example:

     var
       s: shortstring;
     begin
       s := 'Hello!';

   The string s takes 256 bytes. s[0] is the length of the string, so in
   the example its value would be #6. You can't access s[0] directly,
   but rather you should use Length and SetLength. s[1] is the first
   character ('H'), s[2] is the second character ('e'), and so on.
   From s[7] to s[255] the values would be undefined.

 * ANSI strings
   Usually called "long strings", ANSI strings are declared using the
   AnsiString keyword. ANSI strings are actually pointers to a data
   structure consisting of two integers (that hold the length of the
   string and the reference count) and the sequence of bytes allocated
   for the string, that can range from 1 byte to almost 2 GB (providing
   you have enough memory).

   For example:

     var
       s: ansistring;
     begin
       s := 'Hello!';

   The variable s itself takes 4 bytes (a 32-bit pointer). The data
   structure it points to takes 8 bytes for the two integers and in this
   case 6 bytes for the 6 characters, giving 14 bytes in total. Like
   with the short string, s[1] is the first character ('H'), and so on.

 * Wide strings
   Wide strings, also named UNICODE strings, are special strings where
   each character (of type WideChar) takes two bytes (a word). In the
   UNICODE character set, the first 256 values correspond to the ANSI
   character set. Wide strings are pointers, like ANSI strings, but
   they are not reference counted, so when you make an assignment
   between two wide-string variables, the string is actually copied (in
   the case of ANSI strings the reference count is incremented), so
   they are inefficient in comparison, but the COM and OLE APIs use this
   type of strings, and so do ActiveX objects.

   For example:

     var
       s: widestring;
     begin
       s := 'Hello!';

   Here, the variable s takes 4 bytes for the pointer, and the data
   structure takes 4 bytes for the length and 12 bytes for the 6
   characters (2 bytes each), giving 16 bytes in total. s[1] is the
   first character ('H'), except it is of type WideChar instead of
   AnsiChar and takes two bytes instead of one. s[2] is the second
   character ('e') and starts in the third byte (the first two bytes
   are for s[1]).

The type String is mapped by default to AnsiString. Char is mapped to
AnsiChar, and PChar is mapped to PAnsiChar.


MultiByte Character Strings in Windows (MBCS)
=============================================

When working with Ansi strings, normally we consider that each character
occupies one byte, which is true for Western European languages, but for
most Asian languages, 256 characters are simply not enough.

A possible solution is using wide strings, and another solution is
encoding some characters in one byte and others in two (DBCS: Double-
Byte Character Strings). For this to work, there must be a way to know
whether a byte in a string is a character, or is the "lead byte" of a
two byte character. Delphi defines a character set named LeadBytes that
contains the characters that are lead bytes in the current Windows
locale. For Western locales, this set is empty (there are no lead bytes
since there is an equivalence between bytes and characters), and in
general for other locales, if the value of the byte ranges from 0 to 127
it is an ASCII character, and if it is greater than 127, then it is a
lead byte and the next character is called "trail byte" (may range from
0 to 255).

For reasons of efficiency and backwards compatibility, Delphi comes with
different versions of string functions for SBCS (Single-Byte Character
Strings) and DBCS. For SBCS (one byte = one character) there is no point
in going thru the overhead of trying to see if each byte is a character
or a lead byte (since there are no lead bytes), so for SBCS you can use
the standard functions like Pos, LowerCase, etc., while for DBCS you
should use funtions like AnsiPos, AnsiLowerCase, etc. which take into
account that some characters may be represented by more than one byte
(and thus these functiones are slower).


Length of an ANSI string
========================

Indexing a DBCS can be tricky, since s[i] represents the i-th byte, not
necessarily the i-th character because previous characters could have
had two bytes. The number of bytes in a string returned by the Length
function may or may not represent the actual number of characters
contained in a DBCS. To determine this number you can use a function
like the following:

  function AnsiLength(const s: string): integer;
  var
    i, n: integer;
  begin
    Result := 0;
    n := Length(s);
    i := 1;
    while i <= n do begin
      inc(Result);
      if s[i] in LeadBytes then inc(i);
      inc(i);
    end;
  end;


Introduction to UTF-8 (UCS Transformation Format)
=================================================

Windows can work with Unicode strings, as well as SBCS and DBCS, but the
Linux kernel works with UTF-8 strings, where one character may take up
to six bytes! Normally one or two in Western languages and from one to
three in Asian languages. UTF-8 is a multibyte character encoding that
can accommodate all the characters of the UCS (Universal Character Set),
which contains 31-bit characters that can represent practically all the
characters of known languages living and dead, as well as scripts like
Hiragana, Kiragana, etc. It also leaves space for more languages,
scripts and hieroglyphics, so in the future we can expect to be able to
read Klingon poetry, the Ferengi Acquisition Rules and Bajoran
prophecies in their original versions... :-)

UTF-8 has these important features:

* Variable-length encoding for UCS characters
  UTF-8 can encode UCS (ISO 10646) characters in up to 6 bytes.

* Transparency and uniqueness for ASCII characters
  7-bit ASCII characters (#0..#127) are encoded as plain 7-bit ASCII
  (1 byte per character). All non-ASCII characters (#128..#255) are
  represented purely with non-ASCII 8-bit values (#128..#255) so that
  non-ASCII characters cannot be mistaken for ASCII characters, and
  ASCII-based text processing tools can be used on UTF-8 text as long
  as they pass 8-bit characters without interpretation.

* Null character
  Character #0 (ASCII NULL) only appears where a NULL is intended. It
  can't be a trail byte for instance.

* Self-synchronization for fast speed processing
  High bit patterns unambiguates character boundaries, and makes it easy
  to know whether a byte is a single-byte character (0xxxxxxx), a lead
  byte (11yyyyyx) or a fill byte (10xxxxxx). This feature is very
  important because it allows UTF-8 strings processing functions be by
  far a lot more efficient than Windows DBCS. For example, an UTF-8
  string can be parsed backwards and also string searches for a
  multibyte character beginning with a lead byte will never match on the
  fill byte in the middle of an unwanted multibyte character. And as the
  lead-byte announces the length of the multibyte character you can
  quickly tell how many bytes to skip for fast forward parsing.

* Processor-friendliness
  UTF-8 can be read and written quickly with simple bitmask and bitshift
  operations without any multiplication or division (that are slow CPU
  operation).

* Reasonable compression
  UTF-8 is not as compact as Windows DBCS, but for Western languages it
  is better than Unicode, and in the worst case (Eastern languages) it
  is no worse than UCS-4.

* Canonical sort-order
  UTF-8 preserves the sort ordering for plain 8-bit comparison routines
  like strcmp (a C standard function).

* Flag characters
  The octets #$FE and #$FF never appear, so you can use them as flags
  to signal a special meaning (avoiding the possibility of mistaking a
  flag with a real character).

* Detectability
  It's easy to detect an UTF-8 input with high probability if you see
  the UTF-8 signature #$EF#$BB#$BF ('??¿') or if you see valid UTF-8
  multibyte characters since it is very unlikely that they accidentally
  appear in ISO 8859-1 (Latin-1) text.


UTF-8 encoding
==============

This is the general format used to encode UCS characters in UTF-8:

 Bits  Bytes  Representation
   7     1    0xxxxxxx
  11     2    110xxxxx  10xxxxxx
  16     3    1110xxxx  10xxxxxx  10xxxxxx
  21     4    11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
  26     5    111110xx  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx
  31     6    1111110x  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx

Notice that the number of leading 1 bits in the lead byte is the number
of bytes in a multibyte sequence.

The copyright sign ('?' = #169 = #$A9) in binary would be 10101001 and
since it needs 8 bits, we would have to use two bytes:

                            110xxxxx  10xxxxxx

We have to fill 11 bits (x), so we add three zeroes to the left of
10101001:

                               00010    101001

The UTF-8 representation for the copyright character would then be:

                            11000010  10101001

It could also be represented with more bytes than needed in an overlong
string sequence. For example with four bytes it would be:

                  11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
                       000    000000    000010    101001
                 ----------------------------------------
                  11110000  10000000  10000010  10101001

Overlong sequences are usually used to "camouflage" characters to cheat
UTF-8 substring tests. For example, if you look for the copyright sign
exactly as 11000010 10101001 (the shortest possible encoding), then you
won't find it.


Length of an UTF-8 string
=========================

In Delphi for Linux, long strings will be in UTF-8 format, while wide
strings will remain as two-byte Unicode, although they will be reference
counted. To know the number of characters stored in a UTF-8 string we
could use a function like the following:

  function UTF8Length(const s: string): integer;
  var
    i, n: integer;
    c: byte;
  begin
    Result := 0;
    n := Length(s);
    i := 1;
    while i <= n do begin
      inc(Result);
      c := byte(s[i]);
      if (c and $80) = 0 then        inc(i)
      else if (c and $E0) = $C0 then inc(i, 2)
      else if (c and $F0) = $E0 then inc(i, 3)
      else if (c and $F8) = $F0 then inc(i, 4)
      else if (c and $FC) = $F8 then inc(i, 5)
      else if (c and $FE) = $FC then inc(i, 6)
      else
        raise Exception.Create('Not an UTF-8 string!');
    end;
    if i > n + 1 then
      raise Exception.Create('Not an UTF-8 string!');
  end;

Of course this function should be written using pointers and a bit of
assembler to improve its performance, but let's leave that for the
pros... :)


2007/08/17 15:57 2007/08/17 15:57
Article tag list Go to top
View Comment 0
Trackback URL :: 이 글에는 트랙백을 보낼 수 없습니다
 
 
 
 
: [1] ... [735][736][737][738][739][740][741][742][743] ... [1323] :
«   2024/12   »
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        
전체 (1323)
출판 준비 (0)
My-Pro... (41)
사는 ... (933)
블로그... (22)
My Lib... (32)
게임 ... (23)
개발관... (3)
Smart ... (1)
Delphi (93)
C Builder (0)
Object... (0)
VC, MF... (10)
Window... (1)
Open API (3)
Visual... (0)
Java, JSP (2)
ASP.NET (0)
PHP (6)
Database (12)
리눅스 (29)
Windows (25)
Device... (1)
Embedded (1)
게임 ... (0)
Web Se... (2)
Web, S... (21)
잡다한... (7)
프로젝트 (0)
Personal (0)
대통령... (13)
Link (2)