Current issue

Vol.26 No.4

Vol.26 No.4

Volumes

© 1984-2017
British APL Association
All rights reserved.

Archive articles posted online on request: ask the archivist.

archive/10/3

Volume 10, No.3

ASLGREG: Predictions with Confidence Limits

by Ellis Morgan

I present a possible approach to adding predictions to the ASLGREG Regression and Generalised Linear Models Workspace. I include ideas to increase robustness and a method that allows STEPWISE to return a linear predictor equation that can be passed to FIT and other ASLGREG functions.

Many of the good ideas behind this are due to Alan Sykes and I am grateful for his help and his tolerance as I change his code and workspace structure. I propose that my extensions should be added to ASLGREG and welcome comments and suggestions for improvement.

ASLGREG

My descriptions of the following examples are sparse where they cover old ground, see: ASLGREG by Alan Sykes (Vector 7.4); the ASLGREG manual (available from BAA Administration); ASLGREG reviewed by Jon Sandles (Vector 9.2).

      HOUSEDATA
      ⍴PRICE
29
      PRICE
60500 47500 46000 110000 148000 65000 79950 54000 120000
      85000 65000 55000 70000 82500 78500 49950 46950
      46950 77000 102000 37500 68000 65000 145000 98000
      75000 96500 53950 75000

The workspace is ASLGREG, HOUSEDATA sets up Alan Sykes data on 29 houses. Other variables are TYPE (1 terraced, 2 semi-detached, 3 detached, 4 bungalow), AGE, BEDS(number of bedrooms) and AREA.

      START
Regression Volume 1 Version 2
A.M.Sykes & E.P.Morgan, February 1993
Copyright(C) British APL Association 1991-1993

      YVAR'PRICE÷1000'
Y variable is PRICE÷1000
Units set to 29
      FACTOR'TYPE,BEDS'
      1 LP'GM+AGE+BEDS+TYPE+AREA+TYPE.AREA+BEDS/TYPE'
      TABLES 0
      LPA←STEPWISE 3 3
      LPA
GM+AREA+(TYPE=1)×AREA+(BEDS=4)×TYPE=3

STEPWISE has selected this model to estimate price and in the new version you can now write functions that call STEPWISE and go on to use the result in FIT and PREDICT. It can be dangerous to proceed from STEPWISE to PREDICT without thought but, provided diagnostics are displayed too, it is good to be able to do so. In the new workspace LP is ambivalent and the left argument, when present and equal to one, makes each item in the linear predictor an APL expression. With any other left argument the result is the same as in the original version.

      TABLES 1
Output in Table form
      FIT LPA

Source              ss          df     ms
------              --          --     --
Due to model        1.861E4      3     6205
Residual            3.826E3     25      153.1
---------------------------------------------
Total Corrected     2.244E4     28      801.5

Percentage Variation Accounted for =  82.95
F-statistic =                         40.54
p-value=                               0.00
      DISPLAY 'ESTP'

                                                                  Var                Est        Std Err    t-stat    p-val                            ---                ---        --- ---    ------    -----                            GM                31.06       6.095       5.096    2.909E¯5                    AREA               0.05541    0.007573    7.317    1.149E¯7                    (TYPE=1)×AREA     ¯0.02236    0.005579   ¯4.007    4.861E¯4    (BEDS=4)×TYPE=3   38.67       8.292       4.664    8.906E¯5              

This model estimates the average house price at £31060 plus £55.4 per square foot, but subtract £22.4 per square foot if it is a terraced house and add on £38670 if it is a four bedroomed detached house.

     +/(BEDS=4)×TYPE=3
3
     +/BEDS=4
6
     +/TYPE=3
5

There are only 5 detached houses (three of which are 4 bedroomed) and 6 four bedroomed houses, obviously you would want to gather more evidence before using this predictor.

      ⍴8⍴≡DIAGNOSTICS 'XFO'
Unit   AREA   (TYPE=1)
×AREA  
(BEDS=4)
×TYPE=3  
Fitted
Value  
Observed  
----   ----   --------   --------   ------   --------  
1   1326   1326   0   74.89   60.5  
2   782   0   0   74.39   47.5  
3   312   0   0   48.35   46  
4   1460   0   0   112   110  
5   1056   0   1   128.2   148

I have shown the first eight lines of the diagnostics, you will see that I am using dyalog APL. Some new right arguments have been added to DIAGNOSTICS, X calls for all the explanatory variables.

PREDICTING from the original data

      ⍴8⍴≡PREDICT 'SYMUL'
Row   AREA   BEDS   TYPE   Observed   Predicted   Upper   Lower  
---   ----   ----   ----   --------   ---------   -----   -----  
1   1326   4   1   60.5   74.89   82.3   67.48  
2   782   3   2   47.5   74.39   78.24   70.55  
3   312   1   2   46   48.35   53.83   42.87  
4   1460   3   3   110   112   120.3   103.6  
5   1056   4   3   148   128.2   137.7   118.8

PREDICT is similar to DIAGNOSTICS in that the right argument specifies the columns to print. In this case I am calling for: S the explanatory variables before interactions, nesting and other equations; Y for the observed Y value; M for the predicted Mean value; U and L for the Upper and Lower confidence limits of the mean. The optional left argument of PREDICT is described below and specifies the confidence level (default is 80%) and the type of limit (default is independent for each prediction).

PREDICTING for a few extra observations

The simplest way is to add a few extra observations to your data, giving the response variable a value of zero and then keep them out of the fitting process by giving them a weight of zero. All real cases will have a weight of one.

      PRICE←0 0,PRICE
      TYPE←3 4,TYPE
      BEDS←4 3,BEDS
      AREA←1200 800,AREA

Two new “cases” at the beginning.

      YVAR'PRICE÷1000'
Y variable is PRICE÷1000
Units set to 31

      W←0 0,29⍴1
      PRIORWEIGHT 'W'
Prior Weight Variable is W

      SINK←FIT LPA
      DISPLAY 'ESTP'
Var   Est   Std Err   t-stat   p-val  
---   ---   --- ---   ------   -----  
GM   31.06   6.095   5.096   2.909E¯5  
AREA   0.05541   0.007573   7.317   1.149E¯7  
(TYPE=1)×AREA   ¯0.02236   0.005579   ¯4.007   4.861E¯4  
(BEDS=4)×TYPE=3   38.67   8.292   4.664   8.906E¯5

The same answer as before.

      ⍴10⍴≡PREDICT 'SYWMUL'
Row   AREA   BEDS   TYPE   Observed   Priorwt   Predicted   Upper   Lower  
---   ----   ----   ----   ------   -----   ------   -----   -----  
1   1200   4   3   0   0   136.2   145.7   126.8  
2   800   3   4   0   0   75.39   79.27   71.51  
3   1326   4   1   60.5   1   74.89   82.3   67.48  
4   782   3   2   47.5   1   74.39   78.24   70.55  
5   312   1   2   46   1   48.35   53.83   42.87  
6   1460   3   3   110   1   112   120.3   103.6  
7   1056   4   3   148   1   128.2   137.7   118.8

Now printing out the first ten lines with an extra column W to show the prior weights we see from row one that the average 1200 square foot, four bedroomed, detached house should sell for between 127 and 145 thousand pounds.

PREDICTING for a table of new observations

Any number of sets of explanatory data can be used for predicting.

      A←800 800 1000 1000 1000 1200 1200
      B←2 3 3 4 5 4 5
      T←1 2 2 3 3 3 4
      SUBSTITUTE'BEDS,B:AREA,A:TYPE,T'

Here I have associated the variables A, B and T with the original explanatory variables using the function SUBSTITUTE.

      REFRESH 7
Prediction New offset Variable Reset to the Default

REFRESH stores the new data in a “predicting matrix”, a global variable called newmat which PREDICT will use in much the same way as DIAGNOSTICS uses the variable desmat set by FIT. The REFRESH right argument is the number of new cases, if any of your new data is the wrong length a message results. Of course in a function you would call it with something like REFRESH ⍴A. Here you see evidence that ASLGREG does generalised as well as ordinary linear regression, the message tells you that the default value is used for the offset associated with this set of explanatory variables (zero in this case as I have set no offset).

      PREDICT'SMUL'
Row   A   B   T   Predicted   Upper   Lower  
---   -   -   -   ---------   -----   -----  
1   800   2   1   57.5   62.38   52.63  
2   800   3   2   75.39   79.27   71.51  
3   1000   3   2   86.47   91.23   81.72  
4   1000   4   3   125.1   134.6   115.7  
5   1000   5   3   86.47   91.23   81.72  
6   1200   4   3   136.2   145.7   126.8  
7   1200   5   4   97.56   103.7   91.38

Now PREDICT detects that newmat is present and uses it to calculate predictions and limits. Note that newmat is expunged by YVAR or by SUBSTITUTE ''. This suggests the following sequence of functions:

To predict with the original data

SUBSTITUTE '', PREDICT

To recalculate the equation and predict

FIT, PREDICT

To change one of the explanatory variables and predict again

A[2]←900 (say), REFRESH, PREDICT

To use a different set of explanatory variables

SUBSTITUTE, REFRESH, PREDICT

Confidence Limit Options

      PREDICT'SMTK'
Row   A   B   T   Predicted   Upper New   Lower New  
---   -   -   -   ---------   ----- ---   ----- ---  
1   800   2   1   57.5   74.51   40.5  
2   800   3   2   75.39   92.13   58.65  
3   1000   3   2   86.47   103.4   69.51  
4   1000   4   3   125.1   144   106.3  
5   1000   5   3   86.47   103.4   69.51  
6   1200   4   3   136.2   155.1   117.4  
7   1200   5   4   97.56   115   80.14

You can calculate confidence limits for new observations (T, K).

      90 PREDICT'SMUL'
Row   A   B   T   Predicted   Upper   Lower  
---   -   -   -   ---------   -----   -----  
1   800   2   1   57.5   63.84   51.17  
2   800   3   2   75.39   80.43   70.36  
3   1000   3   2   86.47   92.64   80.3  
4   1000   4   3   125.1   137.4   112.8  
5   1000   5   3   86.47   92.64   80.3  
6   1200   4   3   136.2   148.5   124  
7   1200   5   4   97.56   105.6   89.54

You can change the confidence level (90% in this case).

      80 2 PREDICT'SMUL'
Row   A   B   T   Predicted   Upper   Lower  
---   -   -   -   ---------   -----   -----  
1   800   2   1   57.5   66.12   48.88  
2   800   3   2   75.39   82.24   68.54  
3   1000   3   2   86.47   94.87   78.07  
4   1000   4   3   125.1   141.9   108.4  
5   1000   5   3   86.47   94.87   78.07  
6   1200   4   3   136.2   152.9   119.6  
7   1200   5   4   97.56   108.5   86.65

You can calculate simultaneous confidence limits, in this case Bonferroni limits (2 in left argument).

      80 3 PREDICT'SMUL'
Row   A   B   T   Predicted   Upper   Lower  
---   -   -   -   ---------   -----   -----  
1   800   2   1   57.5   67.08   47.93  
2   800   3   2   75.39   83   67.78  
3   1000   3   2   86.47   95.8   77.15  
4   1000   4   3   125.1   143.7   106.6  
5   1000   5   3   86.47   95.8   77.15  
6   1200   4   3   136.2   154.8   117.7  
7   1200   5   4   97.56   109.7   85.45

Or use Scheffe simultaneous limits (3 in left argument). As you would expect with so few cases the Bonferroni limits are narrower.

      80 4 PREDICT'SMUL'
Row   A   B   T   Predicted   Upper   Lower  
---   -   -   -   ---------   -----   -----  
1   800   2   1   57.5   66.12   48.88  
2   800   3   2   75.39   82.24   68.54  
3   1000   3   2   86.47   94.87   78.07  
4   1000   4   3   125.1   141.9   108.4  
5   1000   5   3   86.47   94.87   78.07  
6   1200   4   3   136.2   152.9   119.6  
7   1200   5   4   97.56   108.5   86.65

Or get the program to choose the narrowest for you (4 in left argument).

      80 3 PREDICT 'SMTK'
Row   A   B   T   Predicted   Upper New   Lower New  
---   -   -   -   ---------   ----- ---   ----- ---  
1   800   2   1   57.5   100   15  
2   800   3   2   75.39   117.3   33.53  
3   1000   3   2   86.47   128.9   44.05  
4   1000   4   3   125.1   172.3   78.03  
5   1000   5   3   86.47   128.9   44.05  
6   1200   4   3   136.2   183.3   89.16  
7   1200   5   4   97.56   141.1   54.01

Scheffe simultaneous limits for new observations.

      A←28⍴A⋄B←28⍴B⋄T←28⍴T
      REFRESH 28
Prediction New offset Variable Reset to the Default

Increase the number of cases from 7 to 28.

      ⍴12⍴≡80 2 PREDICT'SMUL'
Row   A   B   T   Predicted   Upper   Lower  
---   -   -   -   ---------   -----   -----  
1   800   2   1   57.5   68.39   46.62  
2   800   3   2   75.39   84.05   66.73  
3   1000   3   2   86.47   97.08   75.87  
4   1000   4   3   125.1   146.3   104  
5   1000   5   3   86.47   97.08   75.87  
6   1200   4   3   136.2   157.3   115.2  
7   1200   5   4   97.56   111.3   83.78  
8   800   2   1   57.5   68.39   46.62  
9   800   3   2   75.39   84.05   66.73

The first 12 rows of Bonferroni simultaneous limits for the mean which get wider as the number of cases increase.

      ⍴12⍴≡80 3 PREDICT'SMUL'
Row   A   B   T   Predicted   Upper   Lower  
---   -   -   -   ---------   -----   -----  
1   800   2   1   57.5   67.08   47.93  
2   800   3   2   75.39   83   67.78  
3   1000   3   2   86.47   95.8   77.15  
4   1000   4   3   125.1   143.7   106.6  
5   1000   5   3   86.47   95.8   77.15  
6   1200   4   3   136.2   154.8   117.7  
7   1200   5   4   97.56   109.7   85.45  
8   800   2   1   57.5   67.08   47.93  
9   800   3   2   75.39   83   67.78

Scheffe simultaneous confidence limits for the mean are independent of the number of cases.

      ⍴12⍴≡80 4 PREDICT'SMUL'
Row   A   B   T   Predicted   Upper   Lower  
---   -   -   -   ---------   -----   -----  
1   800   2   1   57.5   67.08   47.93  
2   800   3   2   75.39   83   67.78  
3   1000   3   2   86.47   95.8   77.15  
4   1000   4   3   125.1   143.7   106.6  
5   1000   5   3   86.47   95.8   77.15  
6   1200   4   3   136.2   154.8   117.7  
7   1200   5   4   97.56   109.7   85.45  
8   800   2   1   57.5   67.08   47.93  
9   800   3   2   75.39   83   67.78

For this model with 28 cases the Scheffe limits are narrowest for simultaneous means. Remember that these limits are approximations see, for example, page 152 of “Introduction to Linear Regression Analysis”, DC Montgomery and EA Peck, Second Edition, John Wiley ISBN 0-471-53387-4.

      ⍴12⍴≡80 3 PREDICT'SMTK'
Row   A   B   T   Predicted   Upper New   Lower New  
---   -   -   -   ---------   ----- ---   ----- ---  
1   800   2   1   57.5   138.3   ¯23.31  
2   800   3   2   75.39   155   ¯4.186  
3   1000   3   2   86.47   167.1   5.833  
4   1000   4   3   125.1   214.7   35.59  
5   1000   5   3   86.47   167.1   5.833  
6   1200   4   3   136.2   225.7   46.76  
7   1200   5   4   97.56   180.3   14.77  
8   800   2   1   57.5   138.3   ¯23.31  
9   800   3   2   75.39   155   ¯4.186

For new observations (unlike means) the Scheffe simultaneous limits do depend on the number of observations. On the subject of simultaneous limits for many cases Montgomery and Peck warn “the individual intervals may be so wide that they are practically useless”. What this tells me is that there is so much variation in house prices not explained by this model that it is not practical to draw a boundary below which not even one out of 28 houses could be expected to fall.

A General Model

      KILLED←0 5 15 28 37 49 50
      DOSE←⍳7
      START
Regression Volume 1 Version 2
A.M.Sykes & E.P.Morgan, February 1993
Copyright(C) British APL Association 1991-1993

      YVAR'KILLED'
Y variable is KILLED
Units set to 7

      ERROR 3
Error Chosen is BINOMIAL
Use NVAR to declare the N-variable

Link selected is LOGIT
      NVAR'50+DOSE-DOSE'

Fairly standard up to the last command, Alan Sykes's data on 50 insects exposed to different dose levels of insecticide. Alan would have said N←7⍴50 and NVAR'N'. If I specify an APL expression that returns the right number of 50s based on the length of DOSE then, if I later substitute DOSE with another variable for predicting, I get an appropriate new NVAR automatically. It is not necessary but by taking pains now I can be lazy later on.

      FIT'GM+DOSE'
Cycle     Deviance     df     Scale Parameter
-----     --------     --     ----- ---------
4         8.343        5      1
      DISPLAY 'ESTP'
Var       Est        Std Err     t-stat     p-val
---       ---        --- ---     ------     -----
GM        ¯5.089     0.5356      ¯9.503     0.0002182
DOSE       1.333     0.133       10.03      0.0001687
      DIAGNOSTICS'XNFO'
Unit     DOSE     Out of     Fitted Value     Observed
----     ----     --- --     ------ -----     --------
1        1        50          1.142            0
2        2        50          4.073            5
3        3        50         12.59            15
4        4        50         28.04            28
5        5        50         41.44            37
6        6        50         47.42            49
7        7        50         49.29            50
      SUBSTITUTE'DOSE,NEWDOSE'
      NEWDOSE←¯0.5+⍳8
      REFRESH 8
Prediction New offset Variable Reset to the Default

      newnname
50+NEWDOSE-NEWDOSE

I have set up NEWDOSE to get a table of predicted results for the intervening extra half a dose strengths and REFRESH has inspected nname and made the substitution to create newnname that will provide an appropriate n-variable for use with NEWDOSE.

      PREDICT'XNPUL'
Row   DOSE   Out of   Predicted   Upper   Lower  
---   ----   --- --   ---------   -----   -----  
1   0.5   50   0.593   1.186   0.2947  
2   1.5   50   2.178   3.57   1.313  
3   2.5   50   7.366   9.931   5.374  
4   3.5   50   19.8   22.91   16.85  
5   4.5   50   35.66   38.34   32.65  
6   5.5   50   45.21   46.72   43.1  
7   6.5   50   48.64   49.23   47.62  
8   7.5   50   49.63   49.83   49.22

      PREDICTED
0.593 2.178 7.366 19.8 35.66 45.21 48.64 49.63

The function PREDICTED returns a vector of predictions.

      NEWDOSE←3 4 5
      NEWNVAR'100 200 300'

      REFRESH 3
Prediction New offset Variable Reset to the Default
      PREDICT'XNPUL'
Row   DOSE   Out of   Predicted   Upper   Lower  
---   ----   --- --   ---------   -----   -----  
1   3   100   25.18   31.15   20.02  
2   4   200   112.2   124.1   99.81  
3   5   300   248.7   261.5   232.7

You can take control of the n-variable used for predicting by specifying it explicitly.

Using offsets

      Y←2 1 3 2 5 10
      X←⍳6
      SINK←START
      SINK←YVAR'Y'
      ERROR'BIN'
Error Chosen is BINOMIAL
Use NVAR to declare the N-variable

Link selected is LOGIT
      NVAR'10+X-X'

      OFFSET'.2×X'
Offset Variable is .2×X

      FIT'GM+X'
Cycle,Deviance,df,Scale Parameter
-----,--------,--,----- ---------
4,9.979,4,1
      SUBSTITUTE'X,NEWX'

      NEWX←6+⍳4
      REFRESH 4
Prediction New offset Variable is .2×NEWX

      newoffname
.2×NEWX
      newnname
10+NEWX-NEWX
      PREDICT'XNZMUL'
Row   X   Out of   Offset   Predicted   Upper   Lower  
---   -   --- --   ------   ---------   -----   -----  
1   7   10   1.4   8.797   9.575   7.035  
2   8   10   1.6   9.397   9.849   7.88  
3   9   10   1.8   9.707   9.948   8.523  
4   10   10   2   9.86   9.982   8.991

      NEWOFFSET'2 3 4 5'
Prediction New offset Variable is 2 3 4 5

      NEWNVAR'12 14 16 18'

      REFRESH 4
Prediction New offset Variable is 2 3 4 5

      PREDICT'XNZMUL'
Row   X   Out of   Offset   Predicted   Upper   Lower  
---   -   --- --   ------   ---------   -----   -----  
1   7   12   2   11.16   11.71   9.746  
2   8   14   3   13.78   13.95   13.13  
3   9   16   4   15.95   15.99   15.7  
4   10   18   5   17.99   18   17.9

If you specify your original offset carefully then REFRESH will take care of it for you but you can take control if you need to.

Robustness

A number of places where you used to get APL errors are now trapped. I agree with Jon Sandles that when this happens something is terribly wrong with what you are trying to do, I try to tell you why.

Function             Protect against (and stop with explanation if necessary):
COOKSDIST, PRESS, ST LEVERAGE = 1
HATMATRIX            scp=0
FIT                  The case(s) causing the problem are indicated when the solution goes outside the domain of the link function.
BOXCOXLOGLIKE        Y not positive. Also adjust for alias, gmflg and priorwt>0
Several              Divide by zero

Corrections to version 1

START:               labels” not “lables” and include “forder” when setting up “sysvar” near the top.
YVAR:                labels” not “lables” and / before “alias” when expunging variables near the top.
OFFSET               offset” defaults to zero not one, look for offset←units⍴1 and change to offset←units⍴0
¨MU5                 make RES a local variable

Implementation

I have tried to make sure that user functions written for version 1 will work with version 2. This will be completely true if you use only those functions defined in the ASLGREG manual with a version of APL that allows ambivalent functions. The functions LP, FIT and TESTHYP are now ambivalent, providing a left argument of zero in you current functions is all you need to do. If you are using any subroutines (that is functions whose name begins with ¨) then you may find they have changed, in particular the link and error functions have changed and your own new link functions will need modifying.

Here are DIAGNOSTICS and PREDICT right arguments:

   DIAGNOSTICS   PREDICT  
A      Variance  
C   Cooks Distance   Cumulative  
D   Dev Residual   Cum Variance  
E      Upper Cum  
F   FittedValue   Lower Cum  
G      Upper Cum New  
H      Lower Cum New  
K      Lower New Obs  
L   Leverage   Lower limit  
M      Predicted  
N   Outof   Out of  
O   Observed   Observed  
P      Predicted  
R   Residual     
S   Std Residual   All substitutes  
T   t-Residual   Upper New Obs  
U      Upper limit  
V   Valid Fit   Valid  
W   Priorwt   Priorwt  
X   All explanatory   All explanatory  
Y      Observed  
Z   Offset   Offset  
1   X   X  
?   This list   This list

The V for valid option confirms that these explanatory values lead to a valid prediction or fitted value when the link function is applied. The numbers 1, 2, ... refer to the first, second, ... explanatory variable and when you use argument ? their names are listed.


(webpage generated: 18 February 2006, 02:17)

script began 3:08:38
caching off
debug mode off
cache time 3600 sec
indmtime not found in cache
cached index is fresh
recompiling index.xml
index compiled in 0.2578 secs
read index
read issues/index.xml
identified 26 volumes, 101 issues
array (
  'id' => '10006030',
)
regenerated static HTML
article source is 'HTML'
source file encoding is 'ASCII'
read as 'Windows-1252'
completed in 0.286 secs