Optimization Considerations
(Life in the FastForth Lane)
Forth Dimensions - January / February 1993 - pages 6 - 12
Smaller is faster, and faster means smaller - except when it comes to
subroutine vs. direct threading? The author optimized Forth for the 68000,
stating,"Anything one can get by cranking up the clock speed, one can get
by both cranking up the clock speed and by using other techniques ..." The
68000's rich instruction set and addressing modes make it ripe for such
improvements. The result is "a JSR/BSR threaded Forth interpreter/compiler."
Abstract
This paper describes a 68000-based JSR/BSR threaded Forth interpreter/compiler. The compiler
compares a variable and a header field, and either assembles a
JSR or BSR to a
called word, or copies its code inline. The definition of
; is smart enough to replace a
JSR/BSR at the
end of a word with a JMP or
BRA, as appropriate. Several words which are
not traditionally immediate become so, such as >R and constants.
Historical Note
The Forth described herein is FastForth, a full 32-bit BSR/JSR threaded
Forth for the 68000. It is a direct modification of a indirect threaded Forth,
Real-Forth. This is, in turn, a direct descendent of fig-Forth. (Remember
fig-Forth?) Vocabularies, among other things, retain their original flavor.
For those not familiar with 32-bit Forths, the memory operators with the
prefix W operate on word, or 16-bit, memory locations.
The Implementation
It is conventional wisdom among Forth gurus that smaller is faster, and
faster means smaller. The commonly accepted exception to this has been when it
comes to subroutine threading vs. indirect threading. Here, the traditional
argument has been that the two bytes per call (say, on a PDP-11) is worth the
overhead, compared to four bytes per call. This argument is less attractive on
a 8-bit processor, such as the 6502, where a subroutine call is three bytes,
and the interpreter for the indirect threading is some 14 instructions.
"But, if we crank up the clock speed..." someone said. Probably,
someone at Intel, or with equal imagination.
Anything one can get by cranking up the clock speed, one can get by both
cranking up the clock speed and by using other techniques, such as better
compilers. Or better coding. The 68000's rich instruction set and plentiful
supply of addressing modes make it ripe for such improvements.
The traditional Forth compiler looks rather like this:
: INTERPRET BEGIN -FIND IF ( found) STATE @ <
IF CFA , ELSE CFA EXECUTE THEN
ELSE HERE NUMBER DPL @ 1+
IF [COMPILE] DLITERAL ELSE DROP [COMPILE] LITERAL THEN
THEN ?STACK AGAIN STOP
Paleoforthwrights will no doubt recognize this as the fig-Forth compiler.
This system is simple, easy to understand, and fast.
- "Forthwright" is a term invented by Al Kreever. The
"paleo" prefix is my own perversion. I also use the term
"neologist": someone who creates new words. Forth is, after all,
language for people who like to play with words.
It runs a lot faster if parts of it are written in code, of course. With a
32-bit data path and 32-bit code fields, optimization by assembly language
re-coding can go hog wild on the 68000. For example, the word
, (comma) becomes:
CODE , OFUSER DP AR0 MOV,
4 # OFUSER DP ADDQ, ' ! 2+ *+ BRA, ;C
Even with this scheme, any word called will still occupy four bytes for each
call, and have the overhead of NEXT and the
return code. But even with this overhead, many words in the nucleus become both
smaller and faster.
A major step is taken when one moves from indirect threaded code to
subroutine threading. Whole aspects of Forth are affected, often in a very
subtle manner. The code interpreter can stay the much the same. However, it now
calls another word to assemble its calls:
: <BSR> 2- HERE - DUP -80 80 WITHIN
IF FF AND 6100 OR ELSE 6100 W, THEN W, ;
: <SUB> \ addr -- | compile a subroutine to addr
HERE OVER - -8000 7FFF WITHIN
IF <BSR> ELSE 4EB9 W, , THEN ;
: INTERPRET BEGIN -FIND IF ( found) STATE @ <
IF <SUB> ELSE EXECUTE THEN
ELSE HERE NUMBER DPL @ 1+
IF [COMPILE] DLITERAL ELSE DROP [COMPILE] LITERAL THEN
THEN ?STACK AGAIN STOP
<SUB> is now a lot more than
, is, and we have added a lot to the dictionary
that wasn't already there. <SUB>
calculates whether to use a BSR or
JSR instruction, and uses the appropriate one.
<BSR> is smart enough to use a short or
long relative call, as needed.
By making this change, we must also redefine NEXT. Instead of a three-instruction (six-byte)
macro, we now have a one-instruction (two-byte) macro. The instruction is, of
course, RTS.
Now a reference to a called word may be two bytes, four bytes or six,
depending on how far away the call is. We have made NEXT a lot smaller. In the nucleus, there are no
six-byte calls, and quite a lot of two-byte calls. We have reduced the size of
the nucleus considerably, and gained speed.
Another innovation is to get rid of the words 0BRANCH and BRANCH,
which do the work of controlling flow in conditional branches. These, of
course, are replaced with processor instruction equivalents.
0BRANCH and BRANCH occupy six bytes per call, four for the
CFA and two for the displacement if the branch
is taken. The processor instructions occupy four or six bytes each, and run
much faster. The relevant code is:
: 0BRAN 201B6700 ( s [+ dr0 .l mov, 2 *+ eq bcc, ) , 0 W,
HERE 2- ;
: BRAN 60000000 ( 2 *+ bra, ) , HERE 2- ;
: RESOLVE HERE OVER - SWAP W! ;
: IF ?COMP 0BRAN 2 ; IMMEDIATE
: THEN ?COMP 2 ?PAIRS RESOLVE ; IMMEDIATE
: ELSE 2 ?PAIRS BRAN
SWAP 2 [COMPILE] THEN 2 ; IMMEDIATE
: BEGIN ?COMP HERE 1 ; IMMEDIATE
| : SBKWD 10000 / HERE ROT SWAP - 2- FF AND OR W, ;
| : LBKWD HERE ROT SWAP - 2- FFFF AND OR , ;
| : BKWD OVER HERE - 2- -7F 7F WITHIN
IF SBKWD ELSE LBKWD THEN ;
: UNTIL 1 ?PAIRS 201B W, 67000000 BKWD ; IMMEDIATE
: AGAIN 1 ?PAIRS 60000000 BKWD ; IMMEDIATE
: REPEAT >R >R [COMPILE] AGAIN R> R> 2-
[COMPILE] THEN ; IMMEDIATE
: WHILE [COMPILE] IF 2+ ; IMMEDIATE
These words, again, occupy more room in the dictionary than their
predecessors did, but the code compiled by them is so much smaller and faster
that the overhead is worth it.
Because the 68000 has an efficient instruction set, and calls to a word may
be as much as six bytes long, it is possible to have calls to words which
occupy more room than the word itself. Why not copy the guts of such words
inline, and forget the call? We may or may not save space in the dictionary,
but we can get rid of the overhead of the call and the return instructions.
This requires a change in the interpreter:
: <BSR> REL OFF 2- HERE - DUP -80 80 WITHIN
IF FF AND 6100 OR ELSE 6100 W, THEN W, ;
: <SUB> \ addr -- | compile a subroutine to addr
HERE OVER - -8000 7FFF WITHIN
IF <BSR> ELSE 4EB9 W, , THEN ;
: <COMP> \ addr -- | subroutine or inline?
DUP 2- W@ -DUP IF LENGTH @ 1+ <
IF HERE OVER 2- W@ DUP ALLOT CMOVE
ELSE <SUB> THEN ELSE <SUB> THEN ;
: INTERPRET BEGIN -FIND IF ( found) STATE @ <
IF <COMP> ELSE EXECUTE THEN
ELSE HERE NUMBER DPL @ 1+
IF [COMPILE] DLITERAL ELSE DROP [COMPILE] LITERAL THEN
THEN ?STACK AGAIN STOP
The word <COMP> looks at a field in
the word's header, the length field. If the contents of the field is 0, no copy
is made - a subroutine is called instead. If the length field is less than or
equal to the current contents of the user variable LENGTH, an inline copy of the target word is made,
instead of a subroutine call. If the length field is greater than
LENGTH, a subroutine call is made in the normal
fashion.
The use of a variable to determine the cutoff for copying code allows the
programmer to select the best length for such copying. For most uses, this is
set to 6, so that dictionary size is the main consideration. However, it can be
set to any value up to 32K, if the user wants to really go for speed. (None of
this "We're from Microsoft and we know more about your application than
you do" stuff here!)
Even if the length value for compiling the nucleus is set to the reasonable
minimum, four, we still gain. DUP, for example,
has one two byte instruction in it. Is shows up about 70 times in the FastForth
nucleus, for 140 bytes of savings. Thus, even though we have overhead in the
nucleus to copy inline, we still come out ahead in nuclear size. Other
one-instruction two-byte words abound, such as DROP.
The compiler must know the value to which it must set the length field. This
value is best calculated after the word is fully compiled, so the logical place
to do it is in ;. That code looks as follows:
: NXT HERE LATEST N>C - SETLEN 4E75 W, ;
The code to be copied (and hence the length of the word for copying
purposes) must exclude the return instruction at the end. So we make the
calculation before adding the return.
We may have a problem, however. Are there circumstances under which it is
inadvisable to make an inline copy of a word? Answer, yes. One circumstance is
where the source word contains a relative reference, such as a program counter
relative offset. Or a BSR instruction, which is
often. That is why the word <BSR> sets
the user variable REL (for: relocatable) to the
off state. Thus, the code called by the word ; which sets the length field in
the head, must examine REL.
: NXT REL @ IF HERE LATEST N>C - SETLEN THEN 4E75 W, ;
This code requires that the length field be set to 0, which it is by CREATE. And, the relocation indicator must be set to
its default state:
: ?EXEC !CSP CURRENT @ CONTEXT ! CREATE SMUDGE
REL ON LATEST N>C DP ! ] ;
The phrase LATEST N>C DP ! is there
because CREATE sets up a code field pointing to
to the code for variables, and this must be overwritten by a
: definition. This is a by-product of the
decision to modernize the syetm by having CREATE produce variables, instead of the ancient
fig-Forth practice of having it produce headers for code definitions.
A gotcha of 68000s is that the 68000 is word-aligned for word and long-word
memory accesses. That is, either @ must pick up
data a byte at a time and assemble the four bytes, or else it cannot be used on
odd address cells. The latter alternative would be incompatible with other
Forths running on other processors, so the former was selected for Real-Forth,
the immediate ancestor of FastForth. The result is as follows:
CODE @ S [ AR0 .L MOV, \ avoid byte boundary
AR0 [+ S [ .B MOV, \ problems
AR0 [+ 1 S &[ .B MOV, AR0 [+ 2 S &[ .B MOV,
AR0 [ 3 S &[ .B MOV, NEXT ;C
Aside from being ugly, the word takes up 16 bytes in memory. It probably
will be referred to a lot by subroutine calls. However, why not provide both
types of memory access? A version of @
requiring word alignment produces a four-byte word:
CODE F@ \ @ from even address only
S [ AR0 MOV, AR0 [ S [ MOV, NEXT ;C
This word will invariably be copied inline. Furthermore, it will get used a
lot: all variables, user variables, and word or long-word arrays are word
aligned. Thus, careful editing of the nucleus produces a much faster nucleus,
using F@ where appropriate. Alas, the nucleus
grows - but not much. Because this word will be copied inline, application
references to it will produce smaller applications, so the cost is well worth
it.
Similar logic produces F! from
!:
CODE F! \ store to even address only
S [+ AR0 MOV, S [+ AR0 [ MOV, NEXT ;C
Can we squeeze more room out of the nucleus and still accelerate things?
Well, it seems a bit absurd for the last instruction in a word (before
NEXT) to be a subroutine call. Why not force
the call to become a jump? Once execution of the called word ends, the return
instruction there will force execution back to the word which called the
current word. The RTS instruction may then be
omitted from the end of the word. This saves us an instruction in the
dictionary, and two return stack accesses in execution.
The resulting code gets tricky. There are two circumstances under which this
trick is inadvisable: when the last instruction before the return is not a
call, and when a forward branch within the word refers to where the
RTS would be if there were one. To check for
the latter condition, we simply examine a variable maintained by the compiler
word THEN. THEN
is now defined as:
: THEN HERE LASTTHEN F!
?COMP 2 ?PAIRS RESOLVE ; IMMEDIATE
So that we know where the last subroutine call was made, the compiler now
maintains a variable, LASTSUB.
: <SUB> \ addr -- | compile a subroutine to addr
HERE DUP LASTSUB F! OVER - -8000 7FFF WITHIN
IF <BSR> ELSE 4EB9 W, , THEN ;
The code to make it work all operates from ;.
| : DOBSR = IF 60 LASTSUB F@ C! 0 ELSE 1 THEN ;
| : 6SR LASTSUB F@ W@ 4EB9
= IF 4EF9 LASTSUB F@ W! 0 ELSE 1 THEN ;
| : 2SR LASTSUB F@ W@ FF00 AND 6100 DOBSR ;
| : 4SR LASTSUB F@ W@ 6100 DOBSR ;
\ --- fl | 1 indicates failure to change bsr to bra, etc
| : LAST? REL F@ 0= HERE LATEST N>C - LENGTH F@ 1- > OR
IF HERE LASTSUB F@ -
DUP 2 = IF DROP 2SR ELSE DUP 4 = IF DROP 4SR ELSE
6 = IF 6SR ELSE 1 THEN THEN THEN ELSE 1 THEN ;
: NXT REL F@ IF HERE LATEST N>C - SETLEN THEN 4E75 W, ;
: ; ?COMP ?CSP HERE LASTTHEN F@ -
IF LAST? IF NXT THEN ELSE NXT THEN
SMUDGE [COMPILE] [ ; IMMEDIATE
The three words 6SR,
4SR, and 2SR
each handle the three possible subroutine call instructions. They do this by
munging the last subroutine call's opcode into a BRA or JMP opcode, as
appropriate.
- Munging is a PDP-11 hacker slang for stompting on the object code directly.
Attributed to Mung the Merciless.
The word LAST? determines whether to call
one of these words and, if so, which one. It also returns a flag to indicate
whether the final RTS instruction should be
added to the word. Also, the whole process is bypassed if there is a forward
branch pointing to where the RTS would be.
This has the effect of giving us a free, zero-instruction, return, without
having to build a custom processor to do it.
Further Optimizations
One can make further optimizations along the same lines. The fig-Forth word
LIT goes away entirely, to be replaced by an
inline literal instruction. For large values, the following instruction
obtains:
<value> # s [ mov,
For smaller values, the MOVQ instruction
proves useful:
<value> # dr0 movq, dr0 s [ mov,
LITERAL (still figgishly state smart) is
redefined as follows:
: LITERAL STATE F@ IF DUP -80 7F WITHIN IF
FF AND 7000 OR W, 2700 W, ELSE 273C W, , THEN THEN ; IMMEDIATE
Constants can be reworked in a major way. We can produce a word which is not
relocatable, and therefore requires a subroutine call for each reference.
Instead, we make constants into immediate words (!) which simply produce
literals as needed:
: CONSTANT CREATE IMMEDIATE , DOES> F@ [COMPILE] LITERAL ;
Similar surgery may be committed on variables.
: VARIABLE CREATE , IMMEDIATE DOES> LITERAL ;
User variables require a more complex operation at compile time, as they
must compile an opcode and the offset from the user pointer (maintained in a
register on the 68000). In addition, execution at interpretation time is more
complex. The resulting word has a hybrid code and high-level
;CODE portion.
: USER CREATE W, IMMEDIATE ;CODE
RP [+ AR0 MOV, U AC &[ ( ofuser state)
TST, NE IF,
DR0 CLR, AR0 [ DR0 .W MOV,
DR0 S -[ MOV, ]] 41EE W, W, 2708 W,
\ ofuser <n> ar0 mov, ar0 S -[ mov,
[[ ELSE, U DR0 MOV,
AR0 [ DR0 .W ADD,
DR0 S -[ MOV, THEN, NEXT ;C
The first two lines of the ;CODE portion
examines the user variable STATE to determine
whether the system is compiling or interpreting. If the system is compiling,
the next two lines are executed. The offset of the user variable is brought
into a register, sign extended, and pushed onto the data stack Then high-level
words are executed to comma in the first opcode, 41ee, the offset (an argument to the first opcode),
and then finally the second opcode, 2708. This
results in the assembly in line of the following code fragment:
ofuser <n> ar0 mov, ar0 s -[ mov,
If the system is not compiling, the actual address of the user variable is
calculated by adding the offset to the contents of the user register. The
results are pushed onto the data stack.
In most other languages, a lot of hand coding would be done to make these
compact definitions possible. Fortunately, Real-Forth and FastForth both
provide both an assembler and a disassembler, so code definitions can be
prototyped and the object values determined rapidly. Such tools are essential
for language development.
Another area of optimization is to move the indices and limits of loops into
the 68000's data registers. This will produce faster and probably more compact
code. (Isn't it nice to have an adequate supply of registers?) Rather
arbitrarily, data registers five and six were selected to hold the index and
the limit, respectively. (DO) pushes these onto
the return stack, and the loop ending operators pop them off.
ASSEMBLER BEGIN,
2DUP >R >R 2 # AR0 ADDQ,
RP [+ DR6 MOV, RP [+ DR5 MOV,
AR0 [ JMP,
CODE (LOOP)
RP [+ AR0 MOV, 1 # DR5 ADDQ,
LABEL LP2 DR5 DR6 CMP,
R> R> GT UNTIL,
LABEL LP5 AR0 [ AR0 .W ADD,
AR0 [ JMP, ;C
FIXED
>R >R
CODE (+LOOP)
RP [+ AR0 MOV,
S [+ DR0 MOV, DR0 DR5 ADD,
DR0 TST, PL LP2 *+ BCC,
DR5 DR6 CMP, LT LP5 *+ BCC,
R> R> AGAIN, ;C FIXED
\ dr5: index
dr6: limit
CODE (DO)
RP [ AR0 MOV, DR5 PP [ MOV,
DR6 PP -[ MOV,
S [+ DR5 MOV, S [+ DR6 MOV,
AR0 [ JMP, ;C FIXED
Nuclear gurus are reminded that this is still a fig-Forth nudeus, and there
are differences in how the loop operators work between fig-Forth and later
standards.
(DO) operates by pushing two items from
registers onto the return stack. In order to do this, it first pops the return
address into AR0. The loop registers are
pushed, and the new index and limit are popped from the data stack. An
RTS is emulated by jumping indirect through
AR0, which holds the return address.
(LOOP) works by comparing the two data
registers. In all cases, the return address is first popped into
AR0. If the loop is not exhausted, the offset
to return to the beginning of the loop is added to AR0, and a jump indirect through
AR0 is executed. If the loop is exhausted,
execution branches to the code fragment ahead of (LOOP). There, the return address is adjusted to skip
over the offset. The two data registers are popped from the return stack, and
execution is resumed with the usual indirect jump through
AR0.
Since we have moved loop indices and limits from their traditional places on
the return stack, index and limit operators must also change.
I must be recoded:
CODE I DR5 S -[ MOV, NEXT ;C
R3 can no longer be aliased to
I, and must now be a separate word.
J and other words which access nested loop
limits and indices must also be recoded. J now
looks like the old I.
We also need a way for the Real-Forth hackers to twiddle the loop index
while in a loop. For example, (EXPECT) plays
with the loop index when it sees a backspace character. This is handled by
writing the new word I!, which allows
sufficiently unstructured code to be an eyesore.
CODE I! S [+ DR5 MOV, NEXT ;C
\ use to play with index
: (EXPECT) OVER + OVER
\ add for Atari/IBM PC keyboard
DO KEY DUP 14 +ORIGIN W@ =
OVER 16 +ORIGIN W@ = OR
IF DROP 08 OVER I = DUP I 2- + I! -
ELSE DUP 0D =
IF LEAVE DROP BL 0
ELSE DUP THEN
I C! 0 I 1+ C! THEN
EMIT ( DROP)
LOOP DROP ;
The Implications
Such drastic surgery on a nucleus has implications elsewhere in the nucleus,
for application coding, and for utility code such as decompilers. Even one's
conceptual view of Forth is affected.
The most profound shock, especially for those of us accustomed to fig-Forth
styled dictionaries, is that the concepts of the parameter field and code field
merge and become one. (This is not, however, an approach toward a Grand Unified
field Theory.) The most disconcerting thing for a fig-Forth user is that ' and
its relatives can no longer return the parameter field. It may or may not be
the same as the code field. However, the code field will always exist. So and
its FastForth brethren now return the code field address.
This changes the family of words used to maneuver in the header of a word to
the point where they had to be renamed. They now take their names from the
field address they expect and the one they return. For example, to navigate
from the code field to the name field, one uses C>N. To go the other direction,
N>C.
The name change has the benefit of aiding conversion of code from Real-Forth
(or other Forths) to FastForth.
A word in this family is C>P, used to get
from the code field to the parameter field if there is one. This word must skip
over the instruction at the code field, which will be one of three possible
subroutine calls. This it does by detecting which instruction is there. It
works as follows:
: C>P ( cfa --- pfa | find the parameter field )
DUP W@ 6100 = IF 4+ ELSE
DUP W@ 4EB9 = IF 6 + ELSE 2+ THEN THEN ;
Occasionally one has need to go back the other way. That is stranger:
: P>C \ pfa --- cfa | jump from pseudonfa (pfa) of a voc
\ to its code field
DUP 2- W@ FF00 AND 6100 = IF 2- ELSE
DUP 4- W@ 6100 = IF 4- ELSE
DUP 6 - W@ 4EB9 = IF 6 - ELSE ABORT" bad link"
THEN THEN THEN ;
P>C makes guesses about which instruction
was used, and where it would be if it had been used. This word is not in the
nucleus, because it is used so rarely. It was originally constructed to allow
vocabulary traversing code to print out the names of the vocabularies in the
system as it traversed the linked list of vocabularies.
Another conceptual change will hit the Forth nucleus guru or the person who
does much assembly language programming under Forth. This is that the
IP and W
registers have moved. The Forth instruction pointer is now the processor's
instruction pointer - sometimes. W is now the first cell on the processor
stack. Usually.
For an example of how this works, let us look at the new definition of
VARIABLE. Note that the code field is set by
the word CREATE.
: VARIABLE CREATE , ;
ASSEMBLER HERE *VARIABLE* !
RP [+ S -[ MOV, NEXT
: (CREATE) FIRST HERE 0A0 + U< 2 ?ERROR
?ALIGN -FIND IF DROP C>N ID. 4 MESSAGE SPACE THEN
HERE DUP C@ WIDTH F@ MIN 1+ =CELLS ALLOT
DUP 80 TOGGLE HERE 1- 80 TOGGLE
LATEST , CURRENT F@ F! 0 W,
{ *VARIABLE* @ } LITERAL <SUB> ;
Since the length field of a variable is never changed from its initial
0, all references to variables are by
subroutine. This subroutine call places the return address on the stack. The
first instruction in the variable is another subroutine call, to the working
code routine for variables. This instruction also places a return address on
the return stack. But the second return address points to the variable's
allocated storage area, not to code. So all the working code has to do is pop
the address off the return stack and push it to the data stack. The next
instruction, the RTS, resumes execution at the
code which called the variable.
The ability to copy inline code into a word means that the locations of
return stack items get rather fuzzy. An item is going to be somewhere on the
return stack, but where depends on whether the calling word copied the target
word inline or not. For example, a subroutine version of R would have to reach
over the return address to get the value on the return stack to be copied. An
inline version would not have to skip the return address.
An inline version of is only one instruction, two bytes. It makes sense to
copy it inline wherever possible. But it isn't always possible. But it isn't
always possible: some of us use the return stack to store things at
interpretation time:
BASE @ >R HEX ... R> BASE !
The implementor could tell you not to do things like that. Or he could have
written another set of state-stupid words for use inside compiled words, and
another set of state-stupid words for use outside of compilation, and he could
have expected you, the user, to remember the difference.
Instead, we have three words, and which become state smart immediate words!
For example:
\ rp [ s [ mov, => 2717
CODE R OFUSER STATE TST, NE IF, 2717 #L S -[ MOV,
'NF W, *+ BRA, THEN, 4 RP &[ S -[ MOV,
NEXT ;C IMMEDIATE
They all work on the same model. If the system is compiling, then the
appropriate opcode is assembled inline with W,. Otherwise, a subroutine version
is executed.
This also means that LENGTH may never be set
so that is called by subroutine. That is, it may never be less than two.
Execution arrays have also mutated under FastForth. With indirect
threading, all references to words in the array were the same length. Thus,
indexing into the array was easy: multiply the index by the sized of the
reference, add it to the base address of the array, fetch the value there, and
execute it. In 32-bit Real-Forth, EXEC is
defined as follows (except that it is done in code):
: EXEC 4* R> + @ EXECUTE ;
The new version is a bit more elaborate. The old EXEC mutates into:
CODE <EXEC> \ index --- | index into execution array
S [+ DR0 MOV, 2 # DR0 ASL, RP [+ AR0 MOV, DR0 AR0 ADD,
AR0 [ AR0 MOV, AR0 [ JMP, ;C
And a new compiler directive is added:
: EXEC COMPILE <EXEC> BEGIN -FIND IF ( found) STATE F@ <
IF , ELSE EXECUTE THEN
ELSE 0 ?ERROR THEN ?STACK STATE F@ 0= UNTIL ; IMMEDIATE
EXEC simply compiles a series of cfas
inline, until it finds that compilation has been turned off, usually by the
word STOP.
Other Improvements
There is much work once can to do to optimize FastForth. Most of these
suggestions have been done, at least experimentally. Their implementations and
implications will be left as an exercise for the student. There will be a quiz.
Forward referring branches can be made smart enough to make two- or
four-byte branches, if one cares to write the code to move the intervening code
appropriately at branch resolution time.
Since the processor has a variety of conditional branch instructions, why
not make the Forth conditional branches reflect this? The traditional Forth
typically compiles two words, one of which performs a test, and the other of
which does the branch. Instead, why not make the branch instruction also do the
test? For example, the phrase 0=IF might become
two inline instructions at compile time, instead of three or more.
We have seen how to move the indices and limits for loops into regiaters.
Why not save more time at run time, and force (DO) and (LOOP) (and
their ilk) to always be copied inline? This will require changes in the way
DO and LOOP
operate at compile time.
A major improvement can be made in any Forth by changing the header
structure. The traditional fig-Forth header structure places the link field
after the name field in memory. This requires dictionary searches to traverse
each name field to go to the next word in the dictionary. By placing the link
field before the name field, the traverse loop is replaced with a single
instruction. Since compilation consists largely of dictionary searches,
compilation is greatly speeded up by this improvement.
Interim Results
There are plenty of optimizations yet to make in FastForth. In spite of this
one may make some preliminary assessments. The results are not all in yet, but
they are definitely promising. For a quick and dirty benchmark, I looked to the
Eight Queens problem, as coded by LeVan, Forth Dimensions II #1 p.6, and
modified by Wilson M. Federici, (GEnie W.FEDERICI, Compuserve 74756,2716). As I
am also using an Atari ST, my results compare directly with Mr. Federici's.
However, to speed things up, I made the arrays byte arrays, which eliminate a
two place shift and replaced F@ with
C@. I also found that the greatest speed for
any given version was achieved with FastForth's LENGTH set to 16.
To Federici's results, I add the final five entries:
F32: |
8.90 sec. |
ForST with CALLS: |
7.23 sec. |
ForST with MACROS: |
3.77 sec. |
|
|
Real-Forth 1.3 (ITC): |
|
LW cells & 2*2* |
7.60 sec. |
LW cells & cell* |
7.06 sec. |
byte cells |
5.65 sec. |
FastForth 2.0 (J/BTC): |
|
LW cells & 2* 2* |
4.87 sec. |
byte cells |
3.04 sec. |
Compilation times improved. For example, compiling the target compiler and
then target compiling the FastForth nucleus, approximately 13 kilobytes in
size, takes about 120 seconds under Real-Forth. Under FastForth, this improves
to under 70 seconds. (As the Atari ST has a real processor, there is enough
room to hold the source for all this in Forth's memory, so speed of disk access
is excluded from consideration.)
Conclusions
Properly done, conversion of a 68000 32-bit Forth from indirect threaded
code to subroutine threaded code will be rewarding in both speed improvements
and in application and nucleus size improvements. The speed improvements were
expected when the conversion process was started, as was the smaller nucleus.
The improved application size was a pleasant surprise.
But the key point is this: however much snappy compilers or other tools may
help (or hinder), they are no substitute for competent programming or competent
software design. They are especially no substitute for good optimization. And
those are all still arts.
Availibility
Persons wishing to experiment with FastForth may implement these techniques
on their own target compilers for personal use and experimentation. Those who
wish to run the complete FastForth package may obtain a beta site copy for the
Atari ST from the author. The author will also discuss ports to other 680x0
machines and ports to other processors with interested parties.
FastForth for the Atari ST, including the above code, may be had in alpha
release from the author. Please consult the author for the current state of
documentation, etc.
Charles Curley is a long time Forth nuclear guru. When not working on
computers he teaches firearms safety and personal self defense. His forthcoming
book, Polite Society, covers federal and state firearms legislation in
layman terms.
Charles Curley
http://www.charlescurley.com
|