Part III: PL/SQL Program Data
Chapter 8. Strings
8.1 The Impact of Character Sets
8.1.4 Character Set Issues
One byte or two? Fixed or variable? Does all this really matter? Yes, it does. When you are declaring string variables, the character set matters in different ways:
• It affects the maximum number of characters you can fit into a string.
• It impacts the interpretation of results returned by certain string functions.
• It can affect sort order (i.e., whether one string is greater or less than another).
8.1.4.1 Bytes versus characters
One of the first issues you need to wrap your mind around is that ofbytes versus characters.
Consider the following declaration for a string variable:
feature_name CHAR(50);
Does this declaration allow for 50 characters or for 50 bytes? If your database character set is of the single-byte variety, then the distinction hardly matters: 50 characters is the same as 50 bytes.
But if your database uses a multibyte character set, then you really do need to know whether you are getting a 50-byte variable, or a variable capable of holding 50 multibyte characters.
In Oracle8i and earlier releases, string variables are always declared in terms of bytes, so in an Oracle8i database, CHAR(50) would result in a variable capable of holding 50 bytes. The number of characters that would fit into these 50 bytes would depend on the character set. Oracle's JA16EUC character set supports both ASCII and Japanese characters, is variable-width, and uses up to three bytes per character. If all characters in a given string were three bytes, a CHAR(50) column would hold only 16 of those characters.
To further complicate matters, most multibyte character sets are variable-width, so the number of characters that will fit into 50 bytes will vary. For example, the Unicode UTF-8 character set is a variable-width character set where characters are represented by up to three bytes. To prepare for the case in which the UTF-8 characters all have three bytes (surrogate pairs aren't considered here), you'd need to declare that variable as a CHAR(150) to be absolutely certain a variable could hold any 50 UTF-8 characters.
8.1.4.2 Oracle9i string declarations
Beginning with Oracle9i, you have the option when declaring string variables of specifying whether your length is in terms of bytes or in terms of characters. For example, to declare a CHAR variable capable of holding 50 bytes, you can do the following:
feature_name CHAR(50 BYTE)
However, if you want to be certain that your variable can hold 50 characters from the underlying character set, you would declare as follows:
feature_name CHAR(50 CHAR)
And what of the following simple declaration?
feature_name CHAR(50)
This is ambiguous unless you know a critical piece of information about how the database has been configured. You need to know the setting of the NLS_LENGTH_SEMANTICS parameter, which you can find using the following query:
SQL> SELECT *
2 FROM nls_session_parameters
3 WHERE parameter = 'NLS_LENGTH_SEMANTICS';
PARAMETER VALUE
--- --- NLS_LENGTH_SEMANTICS BYTE
A value of BYTE is the default NLS_LENGTH_SEMANTICS setting, indicating that byte semantics are the default and that CHAR(50) is the same as CHAR(50 BYTE). A value of CHAR indicates character semantics, in which case CHAR(50) would be interpreted as CHAR(50 CHAR). Be aware that your DBA can change the NLS_LENGTH_SEMANTICS setting from its default using an ALTER SYSTEM command; you yourself can change it at the session level via the ALTER SESSION command. If you have any doubts about NLS_LENGTH_SEMANTICS, query the NLS_SESSION_PARAMETERS view for the current setting.
Your use of the BYTE and CHAR qualifiers has an effect that outlasts a variable's declaration.
The qualifiers also affect the way in which certain string functions treat the variables and whether CHAR variables are padded (or not padded) with spaces to their maximum length. Consider the following two tables, created on a system using UTF-8 as the database character set:
SQL> DESCRIBE ut est
Name Null? Type
--- --- --- UCHAR CHAR(1 CHAR)
SQL> DESCRIBE utest2
Name Null? Type
--- --- --- UCHAR2 CHAR(3)
One column has been declared using character semantics, while the other has been declared using byte semantics. Yet both columns consume three bytes, as the following data dictionary query demonstrates:
SQL> SELECT table_name, column_name, data_length, char_length, char_used
2 FROM user_tab_columns
3 WHERE column_name IN ('UCHAR','UCHAR2');
TABLE_NAME COLUMN_NAM DATA_LENGTH CHAR_LENGTH CHAR_USED --- --- --- --- --- UTEST UCHAR 3 1 C
UTEST2 UCHAR2 3 3 B
You can see that both columns use three bytes (DATA_LENGTH), and that Oracle has kept track of whether character or byte semantics were used to declare the columns (CHAR_USED).
The following PL/SQL code snippet shows the effect of the differing semantics on the operation of the LENGTH function:
DECLARE
uchar utest.uchar%TYPE;
uchar2 utest2.uchar2%TYPE;
BEGIN
uchar := 'ã';
uchar2 := 'ã';
DBMS_OUTPUT.PUT_LINE(LENGTH(uchar));
DBMS_OUTPUT.PUT_LINE(LENGTH(uchar2));
DBMS_OUTPUT.PUT_LINE(LENGTHB(uchar));
DBMS_OUTPUT.PUT_LINE(LENGTHB(uchar2));
END;
The output is:
1 2 2 3
The key to understanding this output lies in knowing that the LENGTH function (which you'll learn more about later in this chapter) always returns the length of a string in terms of characters.
Likewise, LENGTHB counts the number of bytes occupied by the characters in a string, not the number of bytes used to declare the string. Here are some things to note about this code example:
• The uchar and uchar2 variable declarations are based on the underlying database columns. Therefore, each variable occupies three bytes.
• Both variables are set to the single character ã, which is represented in Unicode UTF-8 using two bytes.
• Because uchar was declared using character semantics, its length is reported in terms of characters. ã is one character. The length of uchar will always be one character.
• Because uchar2 was declared using byte semantics, there is one byte left over after the two-byte character ã. This extra byte is filled with a space character (as is always the case with extra bytes in CHAR strings). The resulting length, however, is still reported in terms of characters.
• The fact that both variables consume three bytes is not reflected in the values returned by LENGTH because that function is counting characters, not bytes.
• LENGTHB returns the length of each variable in terms of bytes, but the semantics used to declare each variable still affect the results. LENGTHB counts not the number of physical bytes occupied by the variable, but rather the number of bytes occupied by each variable's value. The value ã in uchar is two bytes long. The value in uchar2, however, is ã followed by a space character, for a total of three bytes.
Because of the confusion that can arise when character and byte semantics are mixed, Oracle recommends that you not use the CHAR and BYTE qualifiers and instead rely on the database default, which you can set using NLS_LENGTH_SEMANTICS. There isn't yet a large enough body of knowledge on this issue to enable us to agree or disagree with Oracle's recommendation.
However, we strongly believe that if you are using multibyte character sets, you must familiarize yourself with the issues we've described in this section, and understand how the CHAR and BYTE qualifiers affect the operation of string variables.
8.1.4.3 Character function semantics
Character functions such as SUBSTR and INSTR deal with positions inside a string. SUBSTR, for example, allows you to specify starting and ending positions for a substring you want to extract from a containing string. You can choose to express such positions in terms of bytes or characters.
Functions such as SUBSTR and INSTR always deal in characters; functions such as SUBSTRB and INSTRB (note the trailing B) always deal in bytes. If you're working with multibyte character sets, it matters which function you use in any given situation. When using Unicode, you have even more character function variations to learn about.
8.1.4.4 Code points and code units
When working with Unicode, there is more than just bytes and characters to deal with. You must also understand something about code points and code units. In Unicode, a code point is a numeric value corresponding to an entry in an encoding table. For example, 0x0061 is the code point for the letter "a". Code points are sometimes combined in order to form one character. The code point for the letter "a" can be combined with 0x0303, the code point corresponding to the tilde to form the character ã. The 0x0303 code point happens to represent a combining diacritical mark, meaning that it is always used to modify a character, never used standalone.
In Unicode, it's not only possible for multiple code points to represent one character, but one code point can represent many characters. If you find that confusing, don't feel bad—we do too.
While a code point represents a specific mapping of a character to a numeric value in an encoding table, a code unit refers to the actual representation of the code point. Take, for example, the Unicode code point 0x0061 representing the letter "a". The UTF-8 representation of that code point uses just one byte: 0x61. That one byte is a code unit. The UTF-16 representation, on the other hand, uses two bytes to represent the code point: 0x0061. Those two bytes together form one code unit. As you can see, the code unit size (i.e., the number of bytes in the code unit) varies depending on which form of Unicode is being used. Sometimes, a code point value is too large for the underlying code unit. In such cases, the code point is represented using two or more code units. For example, the code point 0x1D11E, which represents the musical symbol G clef ( ), is represented in UTF-16 using two code units: 0xD834 and 0xDD1E. Neither value represents a code point by itself—only together do the two code units represent a code point, and in this case the code point in turn represents a single character.
The distinction between bytes, characters, code points, and code units becomes important when you work with string functions such as LENGTH and INSTR. When you want the length of a string, do you want it in terms of bytes, characters, code points, or code units? Oracle9i supports variations of these functions that allow you to choose the semantics that you wish to use for a given invocation; for example, you can useLENGTHC to look at the length of a string in terms of the number of Unicode characters it holds. Certain Unicode characters can be represented in multiple ways. In UTF-16, the character ã can be represented as the single code point 0x00E3 or as the tw o code points 0x0061 and 0x0303. The standard LENGTH function will see the two code points as two characters. The LENGTHC function will recognize that 0x0061 followed by 0x0303 represents only a single character.
Unicode has many subtle facets. We've been told there may be unusual cases where a single Unicode character could be interpreted as multiple characters by a person who speaks the language in question. Because of subtle issues like this one, we strongly recommend that you become familiar with the resources at http://unicode.org if you work with Unicode.
In addition to LENGTH and LENGTHC, you should also considerLENGTH2 and LENGTH4.
LENGTH2 counts the number of code units in a string, and LENGTH4 counts the number of code points in a string. This is why it's important to understand the distinction between code units and code points. Other character string functions, such as SUBSTR and INSTR, also have the same variations as LENGTH.
In Oracle's documentation you'll see references to UCS-2 and UCS-4. These acronyms are where the 2 and 4 in LENGTH2 and LENGTH4 come from. Originally defined by an ISO standard that mirrored the Unicode specification, the UCS-2 and UCS-4 acronyms are now obsolete. Whenever you see UCS-2, think code unit. Whenever you see UCS-4, think code point.
8.1.4.5 Equality of Unicode strings
When testing two character strings for equality, PL/SQL doesn't take into account the fact that a single Unicode character may have more than one representation. A string containing the single code point 0x00E3 to represent the character ã is not considered the equivalent of a string containing the two code points 0x0061 and 0x0303. For example:
DECLARE
x NVARCHAR2(30);
y NVARCHAR2(30);
BEGIN
x := UNISTR(' \00E3');
y := UNISTR(' \0061\0303');
IF x = y THEN
DBMS_OUTPUT.PUT_LINE('x = y');
ELSIF x <> y THEN
DBMS_OUTPUT.PUT_LINE('x <> y');
END IF;
END;
The output is:
x <> y
You can use the COMPOSE function, described later in this chapter, to deal with this particular situation. In this specific instance, you would write:
IF COMPOSE(x) = COMPOSE(y)
There is one situation where COMPOSE won't help, though, and that is when a character can be represented both by a surrogate pair and by its code point value.
In Oracle9i Release 1, you must call COMPOSE from within a SQL statement. In Release 2, you can also invoke COMPOSE in a PL/SQL statement.
8.1.4.6 Sort order
Sort order isperhaps the only character set issue that is not an extension of the basic bytes-versus - characters issue. Different character sets use different numeric values for the various characters represented. This affects comparison statements in which one string or character is compared to another. Is the uppercase letter "A" greater than or less than the lowercase letter "a"? It depends. If
you're using an ASCII-based character set, you'll find that uppercase letters are "less than"
lowercase letters. If you're using EBCDIC, however, you'll find the opposite to be true.