Optimization Considerations
(Life in the FastForth Lane)

Forth Dimensions - January / February 1993 - pages 6 - 12

Charles Curley

Smaller is faster, and faster means smaller - except when it comes to subroutine vs. direct threading? The author optimized Forth for the 68000, stating,"Anything one can get by cranking up the clock speed, one can get by both cranking up the clock speed and by using other techniques ..." The 68000's rich instruction set and addressing modes make it ripe for such improvements. The result is "a JSR/BSR threaded Forth interpreter/compiler."

Abstract

This paper describes a 68000-based JSR/BSR threaded Forth interpreter/compiler. The compiler compares a variable and a header field, and either assembles a JSR or BSR to a called word, or copies its code inline. The definition of ; is smart enough to replace a JSR/BSR at the end of a word with a JMP or BRA, as appropriate. Several words which are not traditionally immediate become so, such as >R and constants.

Historical Note

The Forth described herein is FastForth, a full 32-bit BSR/JSR threaded Forth for the 68000. It is a direct modification of a indirect threaded Forth, Real-Forth. This is, in turn, a direct descendent of fig-Forth. (Remember fig-Forth?) Vocabularies, among other things, retain their original flavor.

For those not familiar with 32-bit Forths, the memory operators with the prefix W operate on word, or 16-bit, memory locations.

The Implementation

It is conventional wisdom among Forth gurus that smaller is faster, and faster means smaller. The commonly accepted exception to this has been when it comes to subroutine threading vs. indirect threading. Here, the traditional argument has been that the two bytes per call (say, on a PDP-11) is worth the overhead, compared to four bytes per call. This argument is less attractive on a 8-bit processor, such as the 6502, where a subroutine call is three bytes, and the interpreter for the indirect threading is some 14 instructions.

"But, if we crank up the clock speed..." someone said. Probably, someone at Intel, or with equal imagination.

Anything one can get by cranking up the clock speed, one can get by both cranking up the clock speed and by using other techniques, such as better compilers. Or better coding. The 68000's rich instruction set and plentiful supply of addressing modes make it ripe for such improvements.

The traditional Forth compiler looks rather like this:

: INTERPRET   BEGIN -FIND  IF  ( found) STATE @ <
    IF  CFA ,  ELSE  CFA EXECUTE  THEN
    ELSE  HERE NUMBER  DPL @ 1+
    IF  [COMPILE] DLITERAL  ELSE  DROP [COMPILE] LITERAL  THEN
        THEN  ?STACK  AGAIN     STOP

Paleoforthwrights will no doubt recognize this as the fig-Forth compiler. This system is simple, easy to understand, and fast.

"Forthwright" is a term invented by Al Kreever. The "paleo" prefix is my own perversion. I also use the term "neologist": someone who creates new words. Forth is, after all, language for people who like to play with words.

It runs a lot faster if parts of it are written in code, of course. With a 32-bit data path and 32-bit code fields, optimization by assembly language re-coding can go hog wild on the 68000. For example, the word , (comma) becomes:

CODE ,   OFUSER DP AR0 MOV,
     4 # OFUSER DP ADDQ,   ' ! 2+ *+ BRA, ;C

Even with this scheme, any word called will still occupy four bytes for each call, and have the overhead of NEXT and the return code. But even with this overhead, many words in the nucleus become both smaller and faster.

A major step is taken when one moves from indirect threaded code to subroutine threading. Whole aspects of Forth are affected, often in a very subtle manner. The code interpreter can stay the much the same. However, it now calls another word to assemble its calls:

: <BSR>    2-  HERE -  DUP -80  80 WITHIN
  IF  FF AND  6100 OR  ELSE  6100 W,  THEN  W, ;

: <SUB>          \ addr --  | compile a subroutine to addr
  HERE OVER - -8000  7FFF WITHIN
  IF  <BSR>  ELSE  4EB9 W,  ,  THEN ;

: INTERPRET   BEGIN -FIND  IF  ( found) STATE @ <
    IF  <SUB>  ELSE  EXECUTE  THEN
    ELSE  HERE NUMBER  DPL @ 1+
    IF  [COMPILE] DLITERAL  ELSE  DROP [COMPILE] LITERAL  THEN
        THEN  ?STACK  AGAIN     STOP

<SUB> is now a lot more than , is, and we have added a lot to the dictionary that wasn't already there. <SUB> calculates whether to use a BSR or JSR instruction, and uses the appropriate one. <BSR> is smart enough to use a short or long relative call, as needed.

By making this change, we must also redefine NEXT. Instead of a three-instruction (six-byte) macro, we now have a one-instruction (two-byte) macro. The instruction is, of course, RTS.

Now a reference to a called word may be two bytes, four bytes or six, depending on how far away the call is. We have made NEXT a lot smaller. In the nucleus, there are no six-byte calls, and quite a lot of two-byte calls. We have reduced the size of the nucleus considerably, and gained speed.

Another innovation is to get rid of the words 0BRANCH and BRANCH, which do the work of controlling flow in conditional branches. These, of course, are replaced with processor instruction equivalents. 0BRANCH and BRANCH occupy six bytes per call, four for the CFA and two for the displacement if the branch is taken. The processor instructions occupy four or six bytes each, and run much faster. The relevant code is:

: 0BRAN 201B6700 ( s [+ dr0 .l mov, 2 *+ eq bcc, ) , 0 W,
                        HERE  2- ;

: BRAN  60000000 ( 2 *+ bra, ) ,  HERE 2- ;

: RESOLVE       HERE OVER -  SWAP  W! ;

: IF    ?COMP  0BRAN 2 ;                        IMMEDIATE

: THEN  ?COMP  2 ?PAIRS  RESOLVE ;              IMMEDIATE

: ELSE  2 ?PAIRS  BRAN
        SWAP 2 [COMPILE] THEN  2 ;              IMMEDIATE

: BEGIN ?COMP HERE 1 ;                          IMMEDIATE

| : SBKWD  10000 /  HERE ROT SWAP - 2-   FF AND OR  W, ;

| : LBKWD           HERE ROT SWAP - 2- FFFF AND OR   , ;

| : BKWD  OVER HERE -  2-  -7F 7F WITHIN
        IF SBKWD ELSE LBKWD THEN ;

: UNTIL 1 ?PAIRS  201B W,  67000000 BKWD ;      IMMEDIATE

: AGAIN 1 ?PAIRS           60000000 BKWD ;      IMMEDIATE

: REPEAT        >R >R [COMPILE] AGAIN R> R> 2-
                [COMPILE] THEN  ;               IMMEDIATE

: WHILE [COMPILE] IF 2+ ;                       IMMEDIATE

These words, again, occupy more room in the dictionary than their predecessors did, but the code compiled by them is so much smaller and faster that the overhead is worth it.

Because the 68000 has an efficient instruction set, and calls to a word may be as much as six bytes long, it is possible to have calls to words which occupy more room than the word itself. Why not copy the guts of such words inline, and forget the call? We may or may not save space in the dictionary, but we can get rid of the overhead of the call and the return instructions. This requires a change in the interpreter:

: <BSR>   REL OFF  2-  HERE -  DUP -80  80 WITHIN
  IF  FF AND  6100 OR  ELSE  6100 W,  THEN  W, ;

: <SUB>         \ addr --  | compile a subroutine to addr
  HERE  OVER - -8000  7FFF WITHIN
  IF  <BSR>  ELSE  4EB9 W,  ,  THEN ;

: <COMP>        \ addr --  | subroutine or inline?
  DUP 2- W@  -DUP IF LENGTH  @ 1+ <
  IF  HERE OVER 2-  W@  DUP ALLOT  CMOVE
       ELSE  <SUB>  THEN  ELSE  <SUB>  THEN ;

: INTERPRET   BEGIN -FIND  IF  ( found) STATE @ <
    IF  <COMP>  ELSE  EXECUTE  THEN
    ELSE  HERE NUMBER  DPL @ 1+
    IF  [COMPILE] DLITERAL  ELSE  DROP [COMPILE] LITERAL  THEN
        THEN  ?STACK  AGAIN     STOP

The word <COMP> looks at a field in the word's header, the length field. If the contents of the field is 0, no copy is made - a subroutine is called instead. If the length field is less than or equal to the current contents of the user variable LENGTH, an inline copy of the target word is made, instead of a subroutine call. If the length field is greater than LENGTH, a subroutine call is made in the normal fashion.

The use of a variable to determine the cutoff for copying code allows the programmer to select the best length for such copying. For most uses, this is set to 6, so that dictionary size is the main consideration. However, it can be set to any value up to 32K, if the user wants to really go for speed. (None of this "We're from Microsoft and we know more about your application than you do" stuff here!)

Even if the length value for compiling the nucleus is set to the reasonable minimum, four, we still gain. DUP, for example, has one two byte instruction in it. Is shows up about 70 times in the FastForth nucleus, for 140 bytes of savings. Thus, even though we have overhead in the nucleus to copy inline, we still come out ahead in nuclear size. Other one-instruction two-byte words abound, such as DROP.

The compiler must know the value to which it must set the length field. This value is best calculated after the word is fully compiled, so the logical place to do it is in ;. That code looks as follows:

: NXT   HERE LATEST N>C  -  SETLEN  4E75 W,  ;

The code to be copied (and hence the length of the word for copying purposes) must exclude the return instruction at the end. So we make the calculation before adding the return.

We may have a problem, however. Are there circumstances under which it is inadvisable to make an inline copy of a word? Answer, yes. One circumstance is where the source word contains a relative reference, such as a program counter relative offset. Or a BSR instruction, which is often. That is why the word <BSR> sets the user variable REL (for: relocatable) to the off state. Thus, the code called by the word ; which sets the length field in the head, must examine REL.

: NXT REL @ IF HERE LATEST N>C - SETLEN THEN 4E75 W, ;

This code requires that the length field be set to 0, which it is by CREATE. And, the relocation indicator must be set to its default state:

: ?EXEC !CSP CURRENT @ CONTEXT !  CREATE SMUDGE
  REL ON  LATEST N>C  DP !  ] ;

The phrase LATEST N>C DP ! is there because CREATE sets up a code field pointing to to the code for variables, and this must be overwritten by a : definition. This is a by-product of the decision to modernize the syetm by having CREATE produce variables, instead of the ancient fig-Forth practice of having it produce headers for code definitions.

A gotcha of 68000s is that the 68000 is word-aligned for word and long-word memory accesses. That is, either @ must pick up data a byte at a time and assemble the four bytes, or else it cannot be used on odd address cells. The latter alternative would be incompatible with other Forths running on other processors, so the former was selected for Real-Forth, the immediate ancestor of FastForth. The result is as follows:

CODE @   S [ AR0 .L MOV,       \ avoid byte boundary
   AR0 [+   S  [ .B MOV,       \ problems
   AR0 [+ 1 S &[ .B MOV,  AR0 [+ 2 S &[ .B MOV,
   AR0 [  3 S &[ .B MOV,  NEXT  ;C

Aside from being ugly, the word takes up 16 bytes in memory. It probably will be referred to a lot by subroutine calls. However, why not provide both types of memory access? A version of @ requiring word alignment produces a four-byte word:

CODE F@         \ @ from even address only
  S [ AR0 MOV,  AR0 [ S [ MOV,  NEXT  ;C

This word will invariably be copied inline. Furthermore, it will get used a lot: all variables, user variables, and word or long-word arrays are word aligned. Thus, careful editing of the nucleus produces a much faster nucleus, using F@ where appropriate. Alas, the nucleus grows - but not much. Because this word will be copied inline, application references to it will produce smaller applications, so the cost is well worth it.

Similar logic produces F! from !:

CODE F!         \ store to even address only
  S [+ AR0 MOV,  S [+ AR0 [ MOV,  NEXT  ;C

Can we squeeze more room out of the nucleus and still accelerate things? Well, it seems a bit absurd for the last instruction in a word (before NEXT) to be a subroutine call. Why not force the call to become a jump? Once execution of the called word ends, the return instruction there will force execution back to the word which called the current word. The RTS instruction may then be omitted from the end of the word. This saves us an instruction in the dictionary, and two return stack accesses in execution.

The resulting code gets tricky. There are two circumstances under which this trick is inadvisable: when the last instruction before the return is not a call, and when a forward branch within the word refers to where the RTS would be if there were one. To check for the latter condition, we simply examine a variable maintained by the compiler word THEN. THEN is now defined as:

: THEN  HERE LASTTHEN F!
        ?COMP  2 ?PAIRS  RESOLVE ;              IMMEDIATE

So that we know where the last subroutine call was made, the compiler now maintains a variable, LASTSUB.

: <SUB>         \ addr --  | compile a subroutine to addr
  HERE DUP LASTSUB F!  OVER - -8000  7FFF WITHIN
  IF  <BSR>  ELSE  4EB9 W,  ,  THEN ;

The code to make it work all operates from ;.

| : DOBSR   = IF    60  LASTSUB F@ C!  0  ELSE  1  THEN ;

| : 6SR LASTSUB F@  W@ 4EB9
            = IF  4EF9  LASTSUB F@ W!  0  ELSE  1  THEN ;

| : 2SR LASTSUB F@  W@ FF00 AND 6100  DOBSR ;

| : 4SR LASTSUB F@  W@ 6100           DOBSR ;

\ --- fl | 1 indicates failure to change bsr to bra, etc

| : LAST?  REL F@ 0=  HERE  LATEST N>C -  LENGTH F@ 1- >  OR
  IF  HERE LASTSUB F@ -
    DUP 2 = IF  DROP 2SR  ELSE    DUP 4 = IF  DROP 4SR  ELSE
        6 = IF  6SR  ELSE  1  THEN  THEN  THEN  ELSE 1 THEN ;

: NXT   REL F@ IF  HERE LATEST N>C  -  SETLEN  THEN  4E75 W,  ;

: ;  ?COMP ?CSP  HERE LASTTHEN F@ -
  IF  LAST?  IF  NXT  THEN  ELSE  NXT  THEN
  SMUDGE [COMPILE] [ ;   IMMEDIATE

The three words 6SR, 4SR, and 2SR each handle the three possible subroutine call instructions. They do this by munging the last subroutine call's opcode into a BRA or JMP opcode, as appropriate.

Munging is a PDP-11 hacker slang for stompting on the object code directly. Attributed to Mung the Merciless.

The word LAST? determines whether to call one of these words and, if so, which one. It also returns a flag to indicate whether the final RTS instruction should be added to the word. Also, the whole process is bypassed if there is a forward branch pointing to where the RTS would be.

This has the effect of giving us a free, zero-instruction, return, without having to build a custom processor to do it.

Further Optimizations

One can make further optimizations along the same lines. The fig-Forth word LIT goes away entirely, to be replaced by an inline literal instruction. For large values, the following instruction obtains:

<value> # s [ mov,

For smaller values, the MOVQ instruction proves useful:

<value> # dr0 movq,  dr0 s [ mov,

LITERAL (still figgishly state smart) is redefined as follows:

: LITERAL   STATE F@ IF  DUP -80 7F WITHIN IF
  FF AND 7000 OR W,  2700 W,  ELSE 273C W, ,  THEN  THEN ;  IMMEDIATE

Constants can be reworked in a major way. We can produce a word which is not relocatable, and therefore requires a subroutine call for each reference. Instead, we make constants into immediate words (!) which simply produce literals as needed:

: CONSTANT   CREATE IMMEDIATE ,  DOES> F@ [COMPILE] LITERAL ;

Similar surgery may be committed on variables.

: VARIABLE   CREATE , IMMEDIATE  DOES>              LITERAL ;

User variables require a more complex operation at compile time, as they must compile an opcode and the offset from the user pointer (maintained in a register on the 68000). In addition, execution at interpretation time is more complex. The resulting word has a hybrid code and high-level ;CODE portion.

: USER   CREATE   W, IMMEDIATE   ;CODE
  RP [+ AR0 MOV,  U AC &[  ( ofuser state)
  TST,  NE IF,
    DR0 CLR,  AR0 [ DR0 .W MOV,
    DR0 S -[ MOV,  ]] 41EE W,  W,  2708 W,
    \ ofuser <n> ar0 mov,  ar0 S -[ mov,
    [[  ELSE,   U DR0 MOV,
    AR0 [ DR0  .W ADD,
    DR0 S -[ MOV,   THEN,   NEXT  ;C

The first two lines of the ;CODE portion examines the user variable STATE to determine whether the system is compiling or interpreting. If the system is compiling, the next two lines are executed. The offset of the user variable is brought into a register, sign extended, and pushed onto the data stack Then high-level words are executed to comma in the first opcode, 41ee, the offset (an argument to the first opcode), and then finally the second opcode, 2708. This results in the assembly in line of the following code fragment:

ofuser <n> ar0 mov,  ar0 s -[ mov,

If the system is not compiling, the actual address of the user variable is calculated by adding the offset to the contents of the user register. The results are pushed onto the data stack.

In most other languages, a lot of hand coding would be done to make these compact definitions possible. Fortunately, Real-Forth and FastForth both provide both an assembler and a disassembler, so code definitions can be prototyped and the object values determined rapidly. Such tools are essential for language development.

Another area of optimization is to move the indices and limits of loops into the 68000's data registers. This will produce faster and probably more compact code. (Isn't it nice to have an adequate supply of registers?) Rather arbitrarily, data registers five and six were selected to hold the index and the limit, respectively. (DO) pushes these onto the return stack, and the loop ending operators pop them off.

ASSEMBLER BEGIN,
  2DUP  >R >R  2 # AR0 ADDQ,
  RP [+ DR6 MOV,  RP [+ DR5 MOV,
  AR0 [ JMP,

CODE (LOOP)
  RP [+ AR0 MOV,    1 # DR5 ADDQ,
    LABEL LP2       DR5 DR6 CMP,
                    R> R> GT UNTIL,
    LABEL LP5       AR0 [ AR0 .W ADD,
                    AR0 [ JMP,  ;C

FIXED
>R >R

CODE (+LOOP)
 RP [+ AR0 MOV,
    S [+ DR0 MOV,   DR0 DR5 ADD,
    DR0 TST,        PL LP2 *+ BCC,
    DR5 DR6 CMP,    LT LP5 *+ BCC,
	R> R> AGAIN,   ;C  FIXED

\   dr5: index

dr6: limit
CODE (DO)
RP [ AR0 MOV,       DR5 PP [ MOV,
  DR6 PP -[ MOV,
  S [+ DR5 MOV,     S [+ DR6 MOV,
  AR0 [ JMP, ;C     FIXED

Nuclear gurus are reminded that this is still a fig-Forth nudeus, and there are differences in how the loop operators work between fig-Forth and later standards.

(DO) operates by pushing two items from registers onto the return stack. In order to do this, it first pops the return address into AR0. The loop registers are pushed, and the new index and limit are popped from the data stack. An RTS is emulated by jumping indirect through AR0, which holds the return address.

(LOOP) works by comparing the two data registers. In all cases, the return address is first popped into AR0. If the loop is not exhausted, the offset to return to the beginning of the loop is added to AR0, and a jump indirect through AR0 is executed. If the loop is exhausted, execution branches to the code fragment ahead of (LOOP). There, the return address is adjusted to skip over the offset. The two data registers are popped from the return stack, and execution is resumed with the usual indirect jump through AR0.

Since we have moved loop indices and limits from their traditional places on the return stack, index and limit operators must also change. I must be recoded:

CODE I DR5 S -[ MOV, NEXT ;C

R3 can no longer be aliased to I, and must now be a separate word.

J and other words which access nested loop limits and indices must also be recoded. J now looks like the old I.

We also need a way for the Real-Forth hackers to twiddle the loop index while in a loop. For example, (EXPECT) plays with the loop index when it sees a backspace character. This is handled by writing the new word I!, which allows sufficiently unstructured code to be an eyesore.

CODE I!  S [+ DR5 MOV,  NEXT ;C
\	use to play with index

: (EXPECT)  OVER + OVER
\	add for Atari/IBM PC keyboard
    DO KEY DUP 14 +ORIGIN W@ =
    OVER 16 +ORIGIN W@ = OR
        IF DROP 08 OVER I = DUP I 2- + I! -
        ELSE DUP 0D =
            IF LEAVE DROP BL 0
            ELSE  DUP THEN
        I C!  0 I 1+ C! THEN
    EMIT ( DROP)
    LOOP  DROP ;

The Implications

Such drastic surgery on a nucleus has implications elsewhere in the nucleus, for application coding, and for utility code such as decompilers. Even one's conceptual view of Forth is affected.

The most profound shock, especially for those of us accustomed to fig-Forth styled dictionaries, is that the concepts of the parameter field and code field merge and become one. (This is not, however, an approach toward a Grand Unified field Theory.) The most disconcerting thing for a fig-Forth user is that ' and its relatives can no longer return the parameter field. It may or may not be the same as the code field. However, the code field will always exist. So and its FastForth brethren now return the code field address.

This changes the family of words used to maneuver in the header of a word to the point where they had to be renamed. They now take their names from the field address they expect and the one they return. For example, to navigate from the code field to the name field, one uses C>N. To go the other direction, N>C.

The name change has the benefit of aiding conversion of code from Real-Forth (or other Forths) to FastForth.

A word in this family is C>P, used to get from the code field to the parameter field if there is one. This word must skip over the instruction at the code field, which will be one of three possible subroutine calls. This it does by detecting which instruction is there. It works as follows:

: C>P ( cfa --- pfa | find the parameter field )
  DUP  W@ 6100 = IF   4+  ELSE
  DUP  W@ 4EB9 = IF  6 +  ELSE  2+  THEN THEN ;

Occasionally one has need to go back the other way. That is stranger:

: P>C   \ pfa --- cfa | jump from pseudonfa (pfa) of a voc
                                \ to its code field
  DUP  2- W@  FF00 AND  6100 = IF  2-  ELSE
    DUP  4- W@  6100 =  IF  4-  ELSE
      DUP  6 - W@  4EB9 =  IF  6 -  ELSE  ABORT" bad link"
           THEN THEN THEN ;

P>C makes guesses about which instruction was used, and where it would be if it had been used. This word is not in the nucleus, because it is used so rarely. It was originally constructed to allow vocabulary traversing code to print out the names of the vocabularies in the system as it traversed the linked list of vocabularies.

Another conceptual change will hit the Forth nucleus guru or the person who does much assembly language programming under Forth. This is that the IP and W registers have moved. The Forth instruction pointer is now the processor's instruction pointer - sometimes. W is now the first cell on the processor stack. Usually.

For an example of how this works, let us look at the new definition of VARIABLE. Note that the code field is set by the word CREATE.

: VARIABLE  CREATE , ;

  ASSEMBLER  HERE  *VARIABLE*  !
  RP [+ S -[ MOV,  NEXT

: (CREATE)     FIRST HERE 0A0 + U< 2 ?ERROR
  ?ALIGN -FIND IF DROP C>N ID.   4 MESSAGE SPACE THEN
  HERE  DUP C@ WIDTH F@ MIN 1+ =CELLS ALLOT
  DUP 80 TOGGLE HERE 1- 80 TOGGLE
  LATEST , CURRENT F@ F!  0 W,
  { *VARIABLE* @ } LITERAL <SUB> ;

Since the length field of a variable is never changed from its initial 0, all references to variables are by subroutine. This subroutine call places the return address on the stack. The first instruction in the variable is another subroutine call, to the working code routine for variables. This instruction also places a return address on the return stack. But the second return address points to the variable's allocated storage area, not to code. So all the working code has to do is pop the address off the return stack and push it to the data stack. The next instruction, the RTS, resumes execution at the code which called the variable.

The ability to copy inline code into a word means that the locations of return stack items get rather fuzzy. An item is going to be somewhere on the return stack, but where depends on whether the calling word copied the target word inline or not. For example, a subroutine version of R would have to reach over the return address to get the value on the return stack to be copied. An inline version would not have to skip the return address.

An inline version of is only one instruction, two bytes. It makes sense to copy it inline wherever possible. But it isn't always possible. But it isn't always possible: some of us use the return stack to store things at interpretation time:

BASE @ >R HEX  ...  R> BASE !

The implementor could tell you not to do things like that. Or he could have written another set of state-stupid words for use inside compiled words, and another set of state-stupid words for use outside of compilation, and he could have expected you, the user, to remember the difference.

Instead, we have three words, and which become state smart immediate words! For example:

\ rp [ s [ mov, => 2717
CODE R  OFUSER STATE TST,  NE IF, 2717 #L S -[ MOV,
  'NF W, *+ BRA,  THEN,  4 RP &[  S -[ MOV,
  NEXT  ;C      IMMEDIATE

They all work on the same model. If the system is compiling, then the appropriate opcode is assembled inline with W,. Otherwise, a subroutine version is executed.

This also means that LENGTH may never be set so that is called by subroutine. That is, it may never be less than two.

Execution arrays have also mutated under FastForth. With indirect threading, all references to words in the array were the same length. Thus, indexing into the array was easy: multiply the index by the sized of the reference, add it to the base address of the array, fetch the value there, and execute it. In 32-bit Real-Forth, EXEC is defined as follows (except that it is done in code):

: EXEC   4*  R> + @ EXECUTE ;

The new version is a bit more elaborate. The old EXEC mutates into:

CODE <EXEC>  \ index ---  | index into execution array
  S [+ DR0 MOV,  2 # DR0 ASL,  RP [+ AR0 MOV,  DR0 AR0 ADD,
  AR0 [ AR0 MOV,  AR0 [ JMP,  ;C

And a new compiler directive is added:

: EXEC   COMPILE <EXEC>   BEGIN -FIND  IF  ( found) STATE F@ <
    IF  ,  ELSE  EXECUTE  THEN
    ELSE  0 ?ERROR  THEN  ?STACK  STATE F@ 0= UNTIL  ; IMMEDIATE

EXEC simply compiles a series of cfas inline, until it finds that compilation has been turned off, usually by the word STOP.

Other Improvements

There is much work once can to do to optimize FastForth. Most of these suggestions have been done, at least experimentally. Their implementations and implications will be left as an exercise for the student. There will be a quiz.

Forward referring branches can be made smart enough to make two- or four-byte branches, if one cares to write the code to move the intervening code appropriately at branch resolution time.

Since the processor has a variety of conditional branch instructions, why not make the Forth conditional branches reflect this? The traditional Forth typically compiles two words, one of which performs a test, and the other of which does the branch. Instead, why not make the branch instruction also do the test? For example, the phrase 0=IF might become two inline instructions at compile time, instead of three or more.

We have seen how to move the indices and limits for loops into regiaters. Why not save more time at run time, and force (DO) and (LOOP) (and their ilk) to always be copied inline? This will require changes in the way DO and LOOP operate at compile time.

A major improvement can be made in any Forth by changing the header structure. The traditional fig-Forth header structure places the link field after the name field in memory. This requires dictionary searches to traverse each name field to go to the next word in the dictionary. By placing the link field before the name field, the traverse loop is replaced with a single instruction. Since compilation consists largely of dictionary searches, compilation is greatly speeded up by this improvement.

Interim Results

There are plenty of optimizations yet to make in FastForth. In spite of this one may make some preliminary assessments. The results are not all in yet, but they are definitely promising. For a quick and dirty benchmark, I looked to the Eight Queens problem, as coded by LeVan, Forth Dimensions II #1 p.6, and modified by Wilson M. Federici, (GEnie W.FEDERICI, Compuserve 74756,2716). As I am also using an Atari ST, my results compare directly with Mr. Federici's. However, to speed things up, I made the arrays byte arrays, which eliminate a two place shift and replaced F@ with C@. I also found that the greatest speed for any given version was achieved with FastForth's LENGTH set to 16.

To Federici's results, I add the final five entries:

F32:	8.90 sec.
ForST with CALLS:	7.23 sec.
ForST with MACROS:	3.77 sec.

Real-Forth 1.3 (ITC):
LW cells & 22	7.60 sec.
LW cells & cell*	7.06 sec.
byte cells	5.65 sec.
FastForth 2.0 (J/BTC):
LW cells & 2* 2*	4.87 sec.
byte cells	3.04 sec.

Compilation times improved. For example, compiling the target compiler and then target compiling the FastForth nucleus, approximately 13 kilobytes in size, takes about 120 seconds under Real-Forth. Under FastForth, this improves to under 70 seconds. (As the Atari ST has a real processor, there is enough room to hold the source for all this in Forth's memory, so speed of disk access is excluded from consideration.)

Conclusions

Properly done, conversion of a 68000 32-bit Forth from indirect threaded code to subroutine threaded code will be rewarding in both speed improvements and in application and nucleus size improvements. The speed improvements were expected when the conversion process was started, as was the smaller nucleus. The improved application size was a pleasant surprise.

But the key point is this: however much snappy compilers or other tools may help (or hinder), they are no substitute for competent programming or competent software design. They are especially no substitute for good optimization. And those are all still arts.

Availibility

Persons wishing to experiment with FastForth may implement these techniques on their own target compilers for personal use and experimentation. Those who wish to run the complete FastForth package may obtain a beta site copy for the Atari ST from the author. The author will also discuss ports to other 680x0 machines and ports to other processors with interested parties.

FastForth for the Atari ST, including the above code, may be had in alpha release from the author. Please consult the author for the current state of documentation, etc.

Charles Curley is a long time Forth nuclear guru. When not working on computers he teaches firearms safety and personal self defense. His forthcoming book, Polite Society, covers federal and state firearms legislation in layman terms.

Charles Curley
http://www.charlescurley.com

Optimization Considerations (Life in the FastForth Lane)

Forth Dimensions - January / February 1993 - pages 6 - 12

Charles Curley

Optimization Considerations
(Life in the FastForth Lane)