SOUNDEX - Buku Oracle PL/SQL (Third Edtion)

The SOUNDEX function allows you to perform string comparisons based on phonetics (the way a word sounds) as opposed to semantics (the way a word is spelled).^[1]

[1] Oracle Corporation uses the algorithm in Donald Knuth's The Art of Computer Programming, Volume 3, to generate the phonetic representation.

SOUNDEX returns a character string that is the "phonetic representation" of the argument. The specification of the SOUNDEX function is as follows:

FUNCTION SOUNDEX (string1 IN VARCHAR2) RETURN VARCHAR2

Here are some of the values SOUNDEX generated, and their variations according to the input string:

SOUNDEX ('smith') --> 'S530' SOUNDEX ('SMYTHE') --> ''S530' SOUNDEX ('smith smith') --> 'S532' SOUNDEX ('smith z') --> 'S532' SOUNDEX ('feuerstein') --> 'F623' SOUNDEX ('feuerst') --> 'F623'

Keep the following SOUNDEX rules in mind when using this function:

• The SOUNDEX value always begins with the first letter in the input string.

• SOUNDEX uses only the first five consonants in the string to generate the return value.

• Only consonants are used to compute the numeric portion of the SOUNDEX value.

Except for leading vowels, all vowels are ignored.

• SOUNDEX is not case-sensitive. Upper- and lowercase letters return the same SOUNDEX value.

The SOUNDEX function is useful for ad hoc queries, and any other kinds of searches where the exact spelling of a database value is not known or easily determined.

The SOUNDEX algorithm is English-centric and may not work well (or at all) for other languages.

SUBSTR, SUBSTRB, SUBSTRC, SUBSTR2, and SUBSTR4

The SUBSTR family of functions is one of the most common and useful set of character functions. The SUBSTR functions allow you to extract a subset of contiguous characters from a string. The substring is specified by starting position and length, and the functions differ in the units they use:

SUBSTR

Starting position and length are in terms of characters.

SUBSTRB

Starting position and length are in terms of bytes. When you use a single-byte character set, SUBSTRB and SUBSTR will return the same results.

SUBSTRC

Starting position and length are in terms of Unicode characters, after any decomposed characters have been composed.

SUBSTR2

Starting position and length are in terms of code units.

SUBSTR4

Starting position and length are in terms of code points.

All of the function specifications follow the same pattern:

FUNCTION SUBSTR

(string_in IN VARCHAR2, start_position_in IN NUMBER [, substr_length_in IN NUMBER]) RETURN VARCHAR2

where the arguments are as follows:

string_in

The source string start_position_in

The starting position of the substring in string_in

substr_length_in

The length of the substring desired (the number of characters to be returned in the substring)

The last parameter, substr_length_in, is optional. If you do not specify a substring length, then SUBSTR returns all the characters to the end of string_in (from the starting position specified).

The substr_length_in argument, if present, must be greater than zero.

The starting position cannot be zero. If it is less than zero, then the substring is retrieved from the back of the string. SUBSTR counts backwards substr_length_in number of characters from the end of string_in. In this case, however, the characters that are extracted are still to the right of the starting position. See Figure 8-3 for an illustration of how the different arguments are used by SUBSTR.

Figure 8-3. How arguments are used by SUBSTR

You will find that in practice SUBSTR is very forgiving. Even if you violate the rules for the values of the starting position and the number of characters to be substringed, SUBSTR will not generate errors. Instead, for the most part, it will return NULL—or the entire string—as its answer.

Here are some examples of SUBSTR:

• Return the last character in a string:

SUBSTR ('Another sample string', -1) --> 'g'

This is the cleanest way to get the single last character. A more direct, but also more cumbersome approach is this:

SUBSTR

('Sample string', LENGTH ('Sample string'), 1) -->

'g'

In other words, calculate the LENGTH of the string and the one character from the string that starts at that last position. Yuck.

• Remove an element from a string list. This is, in a way, the opposite of SUBSTR: I want to extract a portion or substring of a string and leave the rest of it intact. Oddly enough, I

will use SUBSTR to perform this task. Suppose that my screen maintains a list of selected temperatures as follows:

The vertical bar delimits the different items on the list. When the user deselects

"LUKEWARM", I have to remove it from the list. The best way to accomplish this task is to determine the starting and ending positions of the item to be removed, and then use SUBSTR to take apart the list and put it back together—without the specified item. Let's walk through this process a step at a time. For example:

DECLARE

my_list VARCHAR2(50);

to_delete VARCHAR2(20);

start_pos NUMBER;

end_pos NUMBER;

BEGIN

to_delete := 'LUKEWARM';

start_pos := INSTR(my_list, to_delete); --first char to delete

end_pos := start_pos + LENGTH(to_delete); --last char to delete

my_list := SUBSTR (my_list, 1, start_pos-1) ||

SUBSTR (my_list, end_pos+1);

DBMS_OUTPUT.PUT_LINE(my_list);

END;

The output is:

|HOT|COLD|SCALDING|

• Use SUBSTR to extract the portion of a string between the specified starting and ending points. I run into this requirement all the time. SUBSTR requires a starting position and the number of characters to pull out. Often, however, I have only the starting position and the ending posit ion—and I then have to compute the number of characters in between. Is that just the difference between the end and start positions? Or is it one more or one less than that? Invariably, I get it wrong the first time and have to scribble a little example on scrap paper to prove the formula to myself.

So to save you the trouble, I offer a tiny function called betwnstr (for "BETWeeN STRing"). This function encapsulates the calculation you must perform to come up with the number of characters between the start and end positions, which is end_position - start_position + 1.

/* File on web: betwnstr.sf */

FUNCTION betwnstr

(string_in IN VARCHAR2, start_in IN INTEGER, end_in IN INTEGER)

RETURN VARCHAR2 IS

BEGIN

RETURN SUBSTR (string_in, start_in, end_in - start_in + 1);

END;

While this function does not provide the full flexibility offered by SUBSTR (for example, with negative starting positions), it offers a starting point for the kind of encapsulation you should be performing in these situations.

Like the INSTR and LENGTH families of functions, SUBSTR offers permutations useful in dealing with multibyte character sets and Unicode. The following PL/SQL block illustrates the difference between character and byte semantics:

DECLARE

--NVARCHAR2 = UTF-16 in this example.

x NVARCHAR2(50 CHAR) :=

UNISTR('The character a\0303 is the same as ã');

BEGIN

DBMS_OUTPUT.PUT_LINE(SUBSTR(x,25,4));

DBMS_OUTPUT.PUT_LINE(SUBSTRB(x,49,8));

END;

The output is:

same same

The word "same" occupies four characters beginning at character position 25. In terms of bytes, though, it occupies eight bytes beginning at byte position 49. Because we're using the UTF-16 character set, each character occupies two bytes. The first 24 characters occupy the first 48 bytes;

thus the 25th character begins at the 49th byte, and occupies bytes 49 and 50.

Dalam dokumen Buku Oracle PL/SQL (Third Edtion) (Halaman 187-191)