What's in a Name?
The problem of finding a record by name is familiar to anyone who has set up a system to manage a membership or a customer list. Names are not unique and are often misheard and mis-spelled. Finding the first match and then presenting successive records to allow a choice to be made on the basis of the other details is an easy way to handle common names but the onus is on the keyboard operator to get the name exactly right and to cross-check with other details until completely satisfied that a match has been found. I am sure mine is not the only family with several members having accounts at the same branch of a bank to have found we are paying each others’ standing orders.
A more challenging case is to find a record which is believed to be in the database but an exact match for the supposed name cannot be found. The member or customer may not be on hand, may not speak good English or may not have legible handwriting. In such circumstances the operator could make successive guesses or try to set up a search on another field for which some details are known like address or post code. Even better would be to be given a list of near matches.
I found myself in need of some such retrieval functions recently when setting up a customer database for a shop selling TV and white goods on hire purchase or rental. Many customers do not know their account number and a facility to enter the first few characters of their surname and retrieve a list of near matches allows a quick response. The database holds the list of names as a matrix and uses it for sorting before printing. I decided, therefore, to use the sort matrix to control the sequence in which records would be put to the screen according to the ‘goodness of fit’ of the matrix name to the name input, which might be only the first few characters.
My first attempt was the function match below which compares the key input with the same number of columns down the matrix of names and returns the 2 decode of the resulting booleans. This gives more weight to a match closer to the beginning of the names but, of course, the sequence of the key must be an exact match over those columns selected by the length of the key. It is probably faster to confine the decode to those names having the correct initial letter as shown. One would normally hold the names matrix as upper case to save conversion on retrieval. The result is a numeric vector which is used to re-order the database index before the records are displayed. By this function Smith is more closely matched by Smythe than it is by Smillie but the column positions of characters must be the same for a match to contribute to the weighting, and most spelling variations do not have this property.
∇match[⎕]∇ ∇ r←names match key;boo;s [1] ⍝ returns weighted match [2] names←ucase names [3] s←⍴key←clip ucase key [4] boo←names[;⎕io]=1↑key [5] r←names[boo/⍳⍴boo;⍳s]=((+/boo),s)⍴key [6] r←boo\2⊥⍉r ∇ names[¥names match 'camacho';] Camacho Camaccio Comacho Comanche the remaining names, having a zero weight, are not shown names[¥names match 'mahomet';] Mahon Mahmoud Macleod MacIntyre Mohamet Mohammed McCloud McIntire Moon and after this zero weighted
Then I remembered reading in a Zark Newsletter about a coding system based on the sound rather than the spelling of names. It was devised to resolve problems with early US Census returns where the large immigrant population Anglicised their names with very variable results. It has also proved useful for doing genealogical research where spelling is less standardised and varies over the generations. The principle is to group the letters of the alphabet according to similarity of sound. Those familiar with Pitmans shorthand will recognize this idea. The set of vowels and near-vowels is assigned to Group 0 and are only taken into account as spacers during the assessment of consonants, being omitted from the final code. The grouping is as follows:
0 A E H I O U W Y 1 B F P V 2 C G J K Q S X Z 3 D T 4 L 5 M N 6 R
To obtain the code for a name, case is ignored and spaces and non-alphabetic characters are compressed out. Then the group number is substituted for each character. Contiguous duplicate numbers are compressed out and the first number is dropped and the first character of the name is restored in its place. Finally all zeros are compressed out and the code truncated to the alpha plus three digits with trailing zero fill if necessary. The result is called the Soundex Code for the name. Applying this algorithm to my list of names and using the upgrade of the result to order them should bring names close together which are similar in sound although not in spelling. Thus Macleod will have the same Soundex code as McCloud. However the initial alpha has an overriding effect as shown by the second of the namelists below. If as is seen in the third column the initial digit is retained instead of substituting the first character then Feezy can be matched with Pheasey.
Of course one has to accept that many more possible matches will be thrown up by the use of Soundex and this may be difficult to manage for a large database. Zark suggests that the return from the soundex function should be made numeric by using the index of the initial character in the alphabet as the start of the Soundex code. The Zark article was presented as a competition to find the fastest version which my attempt below will certainly not be. Has anyone seen the winning one?
∇soundex[⎕]∇ ∇ codes←soundex names;vector;alph;nums;s;ab [1] ⍝ see Zark News 1993 Quarter 2 [2] names←ucase leftjustify(1 0 ⌈¯2↑⍴names)⍴names [3] vector←,'*',names [4] alph←'AEHIOUWYBFPVCGJKQSXZDTLMNR*' [5] nums←'00000000111122222222334556*' [6] vector←(vector∊alph)/vector [7] codes←nums[alph⍳vector] [8] codes←(codes≠¯1⌽codes)/codes [9] codes←,'*', 0 1 ↓'0' vim codes [10] s←1↑⍴codes←'0' vim(codes≠'0')/codes [11] codes←names[;⎕io],(s,3)↑codes ∇vim is a slightly amended version of the Sharp vtom
alphabetic Soundex Soundex, no alpha Barclay Barclay B624 Feezy 1200 Berkeley Berkeley B624 Pheasey 1200 Braun Braun B650 Barclay 1624 Brown Brown B650 Berkeley 1624 Camaccio Camaccio C520 Braun 1650 Camacho Camacho C520 Brown 1650 Comacho Comacho C520 Shaksper 2216 Comanche Comanche C552 Shakespeare 2221 Feezy Feezy F200 Camaccio 2520 Graeme Graeme G650 Camacho 2520 Graham Graham G650 Comacho 2520 Macleod Macleod M243 Smith 2530 MacIntyre McCloud M243 Smythe 2530 Mahmoud MacIntyre M253 Smiley 2540 Mahon McIntire M253 Smillie 2540 McCloud Mahon M500 Comanche 2552 McIntire Moon M500 Graeme 2650 Mohamet Mahmoud M530 Graham 2650 Mohammed Mohamet M530 Yates 3200 Moon Mohammed M530 Yeats 3200 Newell Newell N400 Macleod 5243 Newhall Newhall N400 McCloud 5243 Pheasey Pheasey P200 MacIntyre 5253 Reiter Reiter R360 McIntire 5253 Rider Rider R360 Newell 5400 Ryder Ryder R360 Newhall 5400 Shakespeare Shaksper S216 Mahon 5500 Shaksper Shakespeare S221 Moon 5500 Smiley Smith S530 Mahmoud 5530 Smillie Smythe S530 Mohamet 5530 Smith Smiley S540 Mohammed 5530 Smythe Smillie S540 Reiter 6360 Yates Yates Y320 Rider 6360 Yeats Yeats Y320 Ryder 6360
Wildcard Matching
The method of searching against an incomplete key which is included in DOS uses * and ? as wild cards. This technique is available as an ISIAPL toolkit function called wildcard which returns a boolean vector showing the rows of the matrix for which a match is found. The search argument is summarised below.x Select exact match on x. x* Select names starting with the sequence x. *x Select names ending with the sequence x. x*y Select names starting with x and ending with y. *x* Select names containing the sequence x. * Select all names.In summary, '*' matches an arbitrary sequence of any length, including 0. Also, the special character '?' is permitted within <x> or <y>, and has the significance of matching any single non-blank character.
(names wildcard 'C*cho')⌿names Camacho Comacho (names wildcard 'Cam*')⌿names Camaccio Camacho (names wildcard '*Int*')⌿names MacIntyre McIntire
There is a very similar and somewhat more flexible facility in Microsoft Access: see the specification below. Of course this only allows one record at a time to be found so one has to step through the file using the Find Next button. In order to pull out all matches one would have to write some SQL and that involves reading the manual!
The asterisk (*), question mark (?), number sign (#), exclamation point (!), hyphen (-), and brackets ([ ]) are wildcard characters. You can use these characters in queries, commands, and expressions to include all records, file names, or other items that begin with specific characters or match a certain pattern. You can also use wildcard characters and matching characters to further refine a search.* wh* finds what, white, and why. *at finds cat, bat, and what. Like the MS-DOS asterisk (*) wildcard character, this asterisk matches any number of characters. But unlike MS-DOS, it can be used as the first or last character in the character string. ? b?ll finds ball, bell, and bill. Like the MS-DOS ? wildcard character, this symbol matches any single character. # 1#3 finds 103, 113, 123. Matches any single digit. [ ] b[ae]ll finds ball and bell but not bill. Matches any single character within the brackets. ! b[!ae]ll finds bill and bull but not bell. Matches any character not in the list. - b[a-c]d finds bad, bbd, and bcd. Matches any one of a range of characters.
Is anyone prepared to amend the Sharp wildcard function to meet the MS Access specification? Or maybe, of course, to better it? Happy hunting!
(webpage generated: 28 March 2006, 06:49)