Table of Contents
About Normalization of Unicode Text
It can be important to be aware about normalization of unicode text. This page gives required theory, descriptions of possible problems and solutions.
USC-2 and UTF-16 Encodings
UCS-2: Universal Character Set coded in 2 octets, which means that characters are represented in 16-bits per character.
UCS-2 is a subset of UTF-16. UCS-2 is identical to UTF-16 except that UTF-16 also supports combining characters and surrogates. Since UCS-2 is a simpler form of UTF-16, UCS-2 data will typically perform better than UTF-16.8
UTF-16: Unicode Transformation Format, a 16-bit encoding form designed to provide code values for over a million characters and a superset of UCS-2. UTF-16 data is stored in graphic data types. The CCSID value for data in UTF-16 format is 1200.
Both UTF-8 and UTF-16 data can contain combining characters. Combining character support allows a resulting character to be comprised of more than one character. After the first character, up to 300 different non-spacing accent characters (umlauts, accents, etc.) can follow in the data string. The resulting character may already be defined in the character set. In this case, there are multiple representations for the same character. For example, in UTF-16, an é can be represented either by X'00E9' (the normalized representation) or X'00650301' (the non-normalized combining character representation).
Normalization
Normalization is a process that replaces the string of combining characters with the hex value of the defined character. After normalization has occurred, only one representation of the character will exist in the data. If the resulting character is not a defined character, the combining character string is unchanged after normalization. For example, in UTF-16, any instances of X'00650301' (the non-normalized combining character representation of é ) will be converted to X'00E9' (the normalized representation of é ).
Normalization in Valentina 4.9.1
We have realized this problem at version 4.9 and have implement tools into 4.9.1 to work with normalization issue.
We have introduce two SQL functions IsNormalized() and Normalize().
These functions allows you:
1) check existing records
SELECT RecID, * FROM T WHERE IsNormalized( fld ) = FALSE
2) Normalize existed data in any String/VarChar/Text field using command as:
UPDATE T SET fld = Normalize(fld) WHERE IsNormalized( fld ) = FALSE -- we use WHERE to minimize changed in the Field and its index.
3) Correct inserted/update values if you want that with help of Triggers.
CREATE TRIGGER T_BIU_Normalize BEFORE INSERT OR UPDATE ON T REFERENCING NEW AS newRow FOR EACH ROW BEGIN newRow.f1 = Normalize(newRow.f1) -- some string field of this table newRow.f5 = Normalize(newRow.f5) -- some varchar field of this table newRow.f7 = Normalize(newRow.f7) -- some text field field of this table END
3a) For local databases you can use also “SELECT normalize(str)” syntax to “calculate” normalized string before insert or update if your language/IDE do not provide similar feature. For example, Cocoa API, NET API and Java give you normalize() function.
Stored Procedures to Normalize
The following couple of stored procedures allows you to easily normalize records of an existed Table. If you want to normalize the whole database you can easily make proc_for_each_table_do(), using given SPs as example.
You should call these procedures as
CALL proc_for_each_column_of_table_do( 'my_table_name', 'proc_normalize_column' )
-- This function gets the name of the table and field. -- Then UPDATE that field using normalize() function. -- Note, that fld_name is wrapped by "" in the UPDATE command. -- CREATE OR REPLACE PROCEDURE proc_normalize_column( IN tbl_name STRING, IN fld_name STRING ) BEGIN DECLARE query String; SET query = 'UPDATE ' || tbl_name || ' SET "' || fld_name || '" = normalize( "' || fld_name || '" ) WHERE isnormalized( "' || fld_name || '" ) = false'; PRINT query; EXECUTE query; END
-- This function gets the name of a Table. -- Then for each its column it executes the specified proc_name procedure, -- which should take two parameters: table and field names. -- CREATE OR REPLACE PROCEDURE proc_for_each_column_of_table_do( IN tbl_name STRING, IN proc_name STRING ) BEGIN DECLARE cmd_show STRING; SET cmd_show = 'SELECT fld_name FROM (SHOW columns OF ' || tbl_name || ') WHERE fld_method_text = \'\'; '; DECLARE curs CURSOR FOR cmd_show; OPEN curs; BEGIN DECLARE fld_name STRING; FETCH FIRST curs INTO fld_name; LOOP DECLARE cmd_call STRING; SET cmd_call = 'CALL ' || proc_name || '( \'' || tbl_name || '\', \'' || fld_name || '\' );' ; EXECUTE cmd_call; FETCH NEXT curs INTO fld_name; END LOOP EXCEPTION WHEN ERR_CURSOR_WRONG_POSITION THEN END CLOSE curs; END
REINDEX DATABASE
. This is because engine usually does not see that value of field was changed.