Switch to: V9V8V7V6V5

About Normalization of Unicode Text

It can be important to be aware about normalization of unicode text. This page gives required theory, descriptions of possible problems and solutions.

USC-2 and UTF-16 Encodings

UCS-2: Universal Character Set coded in 2 octets, which means that characters are represented in 16-bits per character.

UCS-2 is a subset of UTF-16. UCS-2 is identical to UTF-16 except that UTF-16 also supports combining characters and surrogates. Since UCS-2 is a simpler form of UTF-16, UCS-2 data will typically perform better than UTF-16.8

UTF-16: Unicode Transformation Format, a 16-bit encoding form designed to provide code values for over a million characters and a superset of UCS-2. UTF-16 data is stored in graphic data types. The CCSID value for data in UTF-16 format is 1200.

Both UTF-8 and UTF-16 data can contain combining characters. Combining character support allows a resulting character to be comprised of more than one character. After the first character, up to 300 different non-spacing accent characters (umlauts, accents, etc.) can follow in the data string. The resulting character may already be defined in the character set. In this case, there are multiple representations for the same character. For example, in UTF-16, an é can be represented either by X'00E9' (the normalized representation) or X'00650301' (the non-normalized combining character representation).

Since multiple representations of the same character will not compare equal, it is usually not a good idea to store both forms of the characters in the database

Normalization

Normalization is a process that replaces the string of combining characters with the hex value of the defined character. After normalization has occurred, only one representation of the character will exist in the data. If the resulting character is not a defined character, the combining character string is unchanged after normalization. For example, in UTF-16, any instances of X'00650301' (the non-normalized combining character representation of é ) will be converted to X'00E9' (the normalized representation of é ).

Normalization in Valentina 4.9.1

We have realized this problem at version 4.9 and have implement tools into 4.9.1 to work with normalization issue.

We have introduce two SQL functions IsNormalized() and Normalize().

These functions allows you:

1) check existing records

SELECT RecID, * 
FROM T
WHERE IsNormalized( fld ) = false

2) Normalize existed data in any String/VarChar/Text field using command as:

UPDATE T
    SET fld = Normalize(fld)
WHERE IsNormalized( fld ) = false   -- we use WHERE to minimize changed in the Field and its index.

3) Correct inserted/update values if you want that with help of Triggers.

CREATE TRIGGER T_BIU_Normalize
   BEFORE INSERT OR UPDATE ON T 
   REFERENCING NEW AS newRow
   FOR EACH ROW
BEGIN
      newRow.f1 = Normalize(newRow.f1)   -- some string field of this table
      newRow.f5 = Normalize(newRow.f5)   -- some varchar field of this table
      newRow.f7 = Normalize(newRow.f7)   -- some text field field of this table
END

3a) For local databases you can use also “SELECT normalize(str)” syntax to “calculate” normalized string before insert or update if your language/IDE do not provide similar feature. For example, Cocoa API, NET API and Java give you normalize() function.

Stored Procedures to Normalize

The following couple of stored procedures allows you to easily normalize records of an existed Table. If you want to normalize the whole database you can easily make proc_for_each_table_do(), using given SPs as example.

You should call these procedures as

CALL proc_for_each_column_of_table_do( 'my_table_name', 'proc_normalize_column' )
-- This function gets the name of the table and field.
-- Then UPDATE that field using normalize() function.
-- Note, that fld_name is wrapped by "" in the UPDATE command.
--
CREATE OR REPLACE procedure proc_normalize_column( IN tbl_name STRING, IN fld_name STRING )
BEGIN
 DECLARE query String;
 
 SET query = 'UPDATE ' || tbl_name || ' SET "' || fld_name || '" = normalize( "' || fld_name || '" ) 
              WHERE isnormalized( "' || fld_name || '" ) = false';
 PRINT query;
 
 EXECUTE query; 
end
-- This function gets the name of a Table.
-- Then for each its column it executes the specified proc_name procedure,
-- which should take two parameters: table and field names.
-- 
CREATE OR REPLACE procedure proc_for_each_column_of_table_do( IN tbl_name STRING, IN proc_name STRING )
BEGIN
    DECLARE cmd_show STRING;
 
    SET cmd_show = 
        'SELECT fld_name FROM (SHOW columns OF ' || tbl_name || ') WHERE fld_method_text = \'\';   ';    
 
    DECLARE curs CURSOR FOR cmd_show;    
    OPEN curs;
 
    BEGIN
        DECLARE fld_name STRING;
        FETCH FIRST curs INTO fld_name; 
        LOOP
            DECLARE cmd_call STRING;    
	        SET cmd_call = 'CALL ' || proc_name || '( \'' || tbl_name || '\', \'' || fld_name || '\' );' ;
 
            EXECUTE cmd_call;
 
            FETCH NEXT curs INTO fld_name;
        END LOOP 
    EXCEPTION WHEN ERR_CURSOR_WRONG_POSITION THEN 
    END
 
    CLOSE curs;
end

After this procedures done, you should do REINDEX DATABASE. This is because engine usually does not see that value of field was changed.