Current issue

Vol.26 No.4

Vol.26 No.4

Volumes

© 1984-2024
British APL Association
All rights reserved.

Archive articles posted online on request: ask the archivist.

archive/23/4

Volume 23, No.4

Review

First Experiences with Unicode in Dyalog 12

by Adrian Smith (adrian@apl385.com)

Abstract This is a first summary of the experience I have had in moving the bulk of my daily schedule across to Dyalog 12 (Unicode edition). The little traps this process set you in no way invalidate the decision to move, but they can really spoil your day if you don’t take good care of them. I will take things mostly in order of importance, starting with possible application crashes and moving on to minor presentational annoyances which can be fixed over a longer time.

Things to Test in the Interpreter

In spite of a long beta period, it was impossible to catch all the ‘funnies’ that the interpreter came up with. There may well be some gremlins left, for example I very recently discovered (when updating the source for one of my help files) that when I selected some text and hit Ctrl+B it was correctly wrapped with <b>tags</b>, but when I did the same with Ctrl+I nothing appeared to happen. Hmmm – then when I just tried to type the tag manually into the edit field, the first character I typed magically produced the ‘i’ for me.

Along the same lines, I was teaching a new Dyalog user the basics of CPro, and we got to the point where I needed to type a typical indexing expression such as inx⊃myvec into the dialogue designer. That was when I found that AltGr+C and AltGr+V were firmly wired to Copy/Paste and completely ignored my keyboard definition!

This sort of thing is almost impossible to test for, and I’m sure there will be other similar little annoyances. The moral of this story is to believe what you see (even if you only see it once) and let someone know about it. Dan (Baronet) is the man who can repro pretty well anything if you give him a good few clues where to look.

Things that may Break your Application

Things they tell you about

There are two places code has to change: any ⎕NA calls will need some revision, and any calls to ⎕DR to check for character data will definitely not work any more.

Most ⎕NA calls can be fixed automatically with the new star syntax, for example:

     ∇ ret←GetCurrentDirectory;func;buffer_size
[1]   ⍝ Cover for the windows GetCurrentDirectory function
[2]   ⍝ Returns the current directory name
[3]    buffer_size←10000
[4]    'func'⎕NA'U4 kernel32.dll.C32|GetCurrentDirectory* U4 >0T'
[5]    ret←func buffer_size buffer_size
[6]    :If 0≠⊃ret ⍝ Function OK
[7]      ret←2⊃ret
[8]    :Else
[9]      ret←''
[10]   :EndIf
     ∇

This will call the correct DLL function (GetCurrentDirectoryA for Dyalog 12, or GetCurrentDirectoryW for 12-unicode) and so you can share these functions between interpreter versions. The only place I got snagged was with:

     ∇ {RC}←HANDLE PutString NV;SUBKEY;VALUE;RegSetValueEx
[1]   ⍝ Stores a text value in a Registry SUBKEY
[2]   ⍝ HANDLE is the handle for an existing Registry Key
[3]   ⍝ Function by JS based on  SE.WSDoc.GetRegKeyValue
[4]    (SUBKEY VALUE)←NV
[5]    ⎕NA'I ADVAPI32.dll.C32|RegSetValueEx* U <0T I I <0T I4'
[6]    RC←RegSetValueEx HANDLE SUBKEY 0 1 VALUE(1+2×⍴,VALUE)
     ∇

This is the Unicode version – spot the 2 in the last line! I spent quite a while wondering what was eating my registry settings (e.g. the last saved file in the Rain viewer) as every time I read the settings and re-saved them, the length halved!

Checking for numeric data-type used to be done like this:

      1=0=1↑0⍴2 3 4
1

… with something very similar for character. Maybe old ways are best:

      ⎕dr 'Fat cat'
80
      ⎕dr 'Let’s try ∑x also'
160

I had various bits of code looking for 82 here, and of course things tended to crash some way down the track, as the test ‘works’ but returns the wrong answer. I suggest:

      ' '=1↑0⍴∊'Let’s try ∑x also'
1

This should keep working for ever! It also gives you a good way to test if you are in a Unicode interpreter:

      unc←80=⎕dr ' '

Things you find out the hard way

The only real nasty I have hit so far is with native files – the problem here is that you may have a long-running logfile (saved out as plain text) to which you occasionally append messages.

      'c:\temp\unc1.txt'⎕NCREATE ¯1
      ('Hello',⎕av[4 3]) ⎕nappend ¯1
      ('World',⎕av[4 3]) ⎕nappend ¯1
      ⎕nuntie ¯1

This may have started life several years ago in a previous version of APL, and to start with all seems fine when you move your application:

      'c:\temp\unc1.txt'⎕ntie ¯1
      ('Wow, here we are again',⎕ucs 13 10) ⎕nappend ¯1
      ('How about x²',⎕ucs 13 10) ⎕nappend ¯1
      ('How about ∑x² then?',⎕ucs 13 10) ⎕nappend ¯1
DOMAIN ERROR
      ('How about ∑x² then?',⎕UCS 13 10)⎕NAPPEND ¯1
     ∧

This can strike at any time, if you allow the user to edit the log message, and the user has the ability to type some symbol that falls outside the base 256 characters. Even if you clear the file down, you still have the problem unless you always specify the translation:

       0 ⎕nresize ¯1
      ('Wow, here we are again',⎕ucs 13 10) ⎕nappend ¯1 160
      ('How about ∑x² then?',⎕UCS 13 10)⎕NAPPEND ¯1 160

… and you have to be really careful not to mix translations in the same file:

      0 ⎕nresize ¯1
      ('Wow, here we are again',⎕ucs 13 10) ⎕nappend ¯1
      ('How about ∑x² then?',⎕UCS 13 10)⎕NAPPEND ¯1 160
      ⎕nuntie ¯1

This appears to have ‘worked’ but if you open the file in Notepad, it looks like this:

file in Notepad

Analysing such a file at some point in the future will be a nightmare. So what to do? I have a couple of ‘old faithful’ utilities for writing and reading text files, so these were the obvious place to start:

     ∇ {r}←fi Putcrv txt;fh;cr;lf;msk;pos;⎕TRAP
[1]   ⍝ Put a ⎕TCNL vector to file 
[2]   ⍝ Returns 1 if OK; errors if failed.
[3]    ⎕IO←1 ⋄ r←0 ⋄ (lf cr)←⎕AV[3 4]
[4]
[5]   ⍝ Try for existing file (includes 'prn' etc.) ...
[6]    ⎕TRAP←(22 'E' '→New')(19 'E' '⎕SIGNAL ⎕EN') ⋄ fh←fi ⎕NTIE 0
[7]
[8]   ⍝ We cannot tell if this is a printer, so if the resize fails just carry on!
[9]   Resize:⎕TRAP←22 'E' '→Append' ⋄ 0 ⎕NRESIZE fh ⋄ →Append
[10]
[11]  New:⎕TRAP←22 'E' '⎕SIGNAL ⎕EN' ⋄ fh←fi ⎕NCREATE 0
[12]
[13]  Append:⎕TRAP←22 'E' '⎕SIGNAL ⎕EN'
[14]  ⍝ Pair CR/LF from ⎕TCNL
[15]   msk←txt=cr ⋄ pos←msk/⍳⍴msk
[16]   txt←(1+msk)/txt
[17]   txt[1+pos+0,+\¯1↓(⍴pos)⍴1]←lf
[18]
[19]   :If 160≤⎕DR txt ⍝ Make it UTF8 - and mark it with ef bb bf
[20]     txt←⎕UCS 239 187 191,'UTF-8'⎕UCS txt
[21]   :End
[22]
[23]   txt ⎕NAPPEND fh ⋄ ⎕NUNTIE fh ⋄ r←1
     ∇

This writes UTF-8 files rather than full 2-byte Unicode files which has the advantage that most text-editors can handle them, and the file size is generally much smaller. To read either this or any other common format, we have to examine the header bytes and decode appropriately:

     ∇ txt←Getcrv fi;fh;ptn;⎕IO;⎕TRAP
[1]   ⍝ Read entire text file as vector
[2]    txt←'' ⋄ ⎕IO←1
[3]    ⎕TRAP←22 'E' '⎕SIGNAL ⎕EN'
[4]    fh←fi ⎕NTIE 0
[5]    txt←unpick ⎕NREAD fh 83(⎕NSIZE fh)
[6]    ⎕NUNTIE fh
     ∇

     ∇ txt←unpick bytes
[1]   ⍝ See what we read from file and decode depending on the leading byte(s)
[2]   ⍝ Make sure it is unsigned, and drop the UTF-8 marker bytes if present
[3]    bytes+←256×bytes<0
[4]    :If 239 187 191≡3↑bytes    ⍝ UTF-8 marker
[5]      bytes↓⍨←3
[6]      txt←'UTF-8'⎕UCS bytes
[7]    :ElseIf 255 254≡2↑bytes    ⍝ Unicode (normal byte order)
[8]      bytes↓⍨←2
[9]      txt←⎕UCS 256⊥⍉⌽((0.5×⍴bytes),2)⍴bytes
[10]   :ElseIf 254 255≡2↑bytes    ⍝ Unicode (big end up)
[11]     bytes↓⍨←2
[12]     txt←⎕UCS 256⊥⍉((0.5×⍴bytes),2)⍴bytes
[13]   :Else   ⍝ Don't try UTF-8 as we may have chars >127 in here
[14]     txt←⎕UCS bytes
[15]   :End
     ∇

Obviously, if you want a logfile to persist across a Dyalog-12 upgrade, you should re-create it using something similar, or just check for 160 in the message-type and ‘do something’ to make sure you don’t simply trash the file. I think a good solution would be to check for the 160 case, convert the text to UTF-8 as above, check if the first 3 bytes of the file had the UTF-8 marker and regenerate the entire file if not. Otherwise the only safe option is to substitute a ‘standard’ marker character (‘?’ is normal) for anything above ⎕UCS 256 which will at least keep things running while you think about it.

Mappings that Stop Working

In the ‘non-catastrophic’ section, we may have been exploiting the DIN mapping from ⎕AV to the font position for several of the APL characters. When Chris Lee first developed this Windows font layout, he tried to choose ‘good’ associations, for example the APL comment symbol was mapped to the © character. Dyalog copied the mapping fairly faithfully, and we have all got to know it over the years. For example, I regularly use:

  • delta and del for the ‘typographic’ single quotes. grade up and grade down are the “speechmarks” also.
  • execute is – endash and format gives you a • bullet.
  • reverse and transpose (rather surprisingly) give superscripts²³ which is nice for equations.
  • jot is good for the degree symbol, as in °C in chart captions

I am sure there are others – it took me best part of a day to find and fix all of these in the examples in RainPro. You also need to be aware that some of these fixes are a one-way door to Unicode. If you attempt to )LOAD a workspace with any text outside the normal range, it will simply say TRANSLATION ERROR and refuse at the first fence. Identifying the cause of the problem can be a bit tricky, so here is another little helper I wrote to get me moving:

     ∇ list←Scan4u ns;nl;fn;cr;ok;dodgy;nslist;vn;var;badchars
[1]   ⍝ Report any dodgy functions - chars not in ⎕AV are dangerous
        … as these won't copy into Classic 12
[2]   ⍝ We know Rain12 has a test called Unicode which should report here
[3]   ⍝ Takes a namespace as argument - scans current NS if empty
[4]   ⍝ Returns 2-column matrix with namespace.function name
        …  and distinct uncopyable chars
[5]   ⍝ By ACDS Dec 2007
[6]   ⍝
[7]
[8]    :If 0∊⍴ns
[9]      ns←''⎕CS''
[10]   :End
[11]
[12]   nl←(ns⍎'⎕NL')¯2
[13]   list←0 2⍴''
[14]   nslist←(⊂ns,'.'),¨(ns⍎'⎕NL')¯9.1
[15]   badchars←'‘’“”–—'    ⍝ These will load in classic, but map incorrectly!
[16]
[17]   :For vn :In nl
[18]     var←⍕∊ns⍎vn
[19]     ok←∧/(var∊⎕AV)∧~var∊badchars
[20]     :If ~ok
[21]       dodgy←∪(var~⎕AV),var∩badchars
[22]       list⍪←(ns,'.',vn)dodgy
[23]     :End
[24]   :End
[25]
[26]   nl←(ns⍎'⎕NL')¯3
[27]   :For fn :In nl
[28]     cr←,(ns⍎'⎕CR')fn
[29]     ok←∧/(cr∊⎕AV)∧~cr∊badchars
[30]     :If ~ok
[31]       dodgy←∪(cr~⎕AV),cr∩badchars
[32]       list⍪←(ns,'.',fn)dodgy
[33]     :End
[34]   :End
[35]
[36]   :For ns :In nslist ⍝ Go walkies
[37]     list⍪←Scan4u ns
[38]   :End
     ∇

One other thing that caught me here – I had a global variable in the workspace to check for ‘ascii’ characters which included the ^ (ASCII caret) character. When I moved the workspace across, this became an APL logical and which stopped this particular function working in a rather hard-to-detect way.

A side-effect of losing the mappings, was that I suddenly found that it was really hard to type these very useful characters. For chart captions and the like, you often want “proper” quotes and dashes, as well as expressions like 3x²+2x to label regression lines. The pragmatic solution was to grab a few more keystrokes and set myself up with a little typographic pad on Ctrl+[] and nearby keys:

Keyboard mappings - 1

… with a similar collection on the shifted keys. Most of these were a straight copy from Aldus Pagemaker, and they cover nearly everything you need in day-to-day typing. As you can see from the first character in this paragraph, they also come in quite handy when typing HTML text into Notepad! No degree symbol yet though … poor old ellipsis might be for the chop, although that ® looks a bit vulnerable too!

Keyboard mappings - 2

Either way, being able to build your own keyboard map and use it everywhere in Windows is really great.

Removing Workarounds

Now we hit the downhill part of the ski-run – getting rid of all the nasty stuff we had to do to get around the 256-char limit. A typical example I had was writing SVG (a very picky XML format) files to represent my charts. Any characters above ASCII-128 must be encoded as &#nnnn;, and several common characters in the 145 range (like the quotes) required the pukka Unicode values. I won’t bore you with the messy stuff I had before – all I need now is to use the native character translator which automatically gives me all the right numbers:

⍝ Hex any remaining hi-bit characters (reliable translation now)
 vec←,vec ⋄ asc←128>⎕UCS vec ⋄ hi←(~asc)/⍳⍴vec
 :If 0<⍴hi
   aix←⎕UCS vec[hi]
   hex←,(((⍴aix),2)⍴'&#'),('ZI4'⎕FMT aix),';'
   vec←(1+6×~asc)/vec
   hi←hi++\¯1↓0,(⍴hi)⍴6
   vec[,hi∘.+0 1 2 3 4 5 6]←hex
 :End

…and of course I had to beef up my C# translator to handle this:

      cc.chk 'x←⎕ucs 65 67 3348'
       x     ←     ⎕ucs  65 67 3348
 MARK  NOUN  ASGN  VERB  NOUN
 x = AE_ucs(new int[] {65,67,3348});   // Using ...

static string AE_ucs(int[] iv)  // Int array translate
{
  char[] charvec = new char[iv.Length];
  for(int i=0;i<iv.Length;i++)
    charvec[i]=(char)iv[i];
  return new String(charvec);
}

This spins off a suitable helper function the first time it sees a particular call (I just supported the 4 obvious variants with scalars and vectors either way round) and then calls it in-line.

Wrap-up

There is clearly some pain involved, but I think the gain beats it hands down. I do like throwing away code! Fewer lines always means fewer bug-opportunities, and translate tables are a great place for bugs to hang out.

script began 15:36:11
caching off
debug mode off
cache time 3600 sec
indmtime not found in cache
cached index is fresh
recompiling index.xml
index compiled in 0.1746 secs
read index
read issues/index.xml
identified 26 volumes, 101 issues
array (
  'id' => '10011990',
)
regenerated static HTML
article source is 'HTML'
source file encoding is 'UTF-8'
URL: smith/unc1.png => trad/v234/smith/unc1.png
URL: smith/kbctrl.png => trad/v234/smith/kbctrl.png
URL: smith/kbsctrl.png => trad/v234/smith/kbsctrl.png
completed in 0.2033 secs