( * * LANGUAGE : ANS Forth * PROJECT : Forth Environments * DESCRIPTION : Matrix Multiplication * CATEGORY : Benchmark * AUTHOR : Mark Smotherman * LAST CHANGE : June 12, 2000, Marcel Hendrix * ) 1 [IF] true constant ndp? \ -- flag ; true if NDP stack version S" ../../lib/ndp387.fth" INCLUDED char . dp-char ! \ select ANS number conversion char . fp-char ! S" VfxUtil" INCLUDED \ FSL harness for ProForth VFX 3.0 [ELSE] true constant ndp? \ -- flag ; true if NDP stack version c" c:\products\PfwVfx" setmacro VfxDir c" c:\products\PfwVfx\lib" setmacro LibDir c" c:\products\VfSfp\FSL\library" setmacro FslDir c" c:\products\VfSfp\Hfp" setmacro NdpDir ndp? [if] S" %NdpDir%\Ndp387" INCLUDED [else] S" %NdpDir%\Hfp387" INCLUDED [then] char . dp-char ! \ select ANS number conversion char . fp-char ! -short-branches \ disable short forward branches S" %FslDir%\VfxUtil" INCLUDED \ FSL harness for ProForth VFX 3.x S" %FslDir%\DynMem" INCLUDED \ Dynamic memory [THEN] 0 [IF] ============================================================================================================= matrix multiply tests -- C language, version 1.0, May 1993 compile with -DN= I usually run a script file time.script 500 >500.times where the script file contains cc -O -DN=$1 mm.c a.out -n (I suggest at least two runs per method to a.out -n alert you to variations. Five or ten runs a.out -t each, giving avg. and std dev. of times is a.out -t best.) ... Contact Mark Smotherman (mark@cs.clemson.edu) for questions, comments, and to report results showing wide variations. E.g., a wide variation appeared on an IBM RS/6000 Model 320 with "cc -O -DN=500 mm.c" (xlc compiler): 500x500 mm - normal algorithm utime 230.81 secs 500x500 mm - normal algorithm utime 230.72 secs 500x500 mm - temporary variable in loop utime 231.00 secs 500x500 mm - temporary variable in loop utime 230.79 secs 500x500 mm - unrolled inner loop, factor of 8 utime 232.09 secs 500x500 mm - unrolled inner loop, factor of 8 utime 231.84 secs 500x500 mm - pointers used to access matrices utime 230.74 secs 500x500 mm - pointers used to access matrices utime 230.45 secs 500x500 mm - blocking, factor of 32 utime 60.40 secs 500x500 mm - blocking, factor of 32 utime 60.57 secs 500x500 mm - interchanged inner loops utime 27.36 secs 500x500 mm - interchanged inner loops utime 27.40 secs 500x500 mm - 20x20 subarray (from D. Warner) utime 9.49 secs 500x500 mm - 20x20 subarray (from D. Warner) utime 9.50 secs 500x500 mm - 20x20 subarray (from T. Maeno) utime 9.10 secs 500x500 mm - 20x20 subarray (from T. Maeno) utime 9.05 secs The algorithms can also be sensitive to TLB thrashing. On a 600x600 test an IBM RS/6000 Model 30 showed variations depending on relative location of the matrices. (The model 30 has 64 TLB entries organized as 2-way set associative.) 600x600 mm - 20x20 subarray (from T. Maeno) utime 19.12 secs 600x600 mm - 20x20 subarray (from T. Maeno) utime 19.23 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 18.87 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 18.64 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 17.70 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 17.76 secs Changing the declaration to include 10000 dummy entries between the b and c matrices (suggested by T. Maeno), i.e., double a[N][N],b[N][N],dummy[10000],c[N][N],d[N][N],bt[N][N]; 600x600 mm - 20x20 subarray (from T. Maeno) utime 16.41 secs 600x600 mm - 20x20 subarray (from T. Maeno) utime 16.40 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 16.68 secs 600x600 mm - 20x20 subarray (from D. Warner) utime 16.67 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 16.97 secs 600x600 mm - 20x20 btranspose (Warner/Smotherman) utime 16.98 secs I hope to add other algorithms (e.g., Strassen-Winograd) in the near future. P5-166 MHz, 48 MB, iForth 1.11, double-precision, Linux CLK 166 MHz 500x500 mm - normal algorithm 5.69 MFlops, 28.97 ticks/flop, 43.906 s 500x500 mm - blocking, factor of 20 19.21 MFlops, 8.58 ticks/flop, 13.007 s 500x500 mm - transposed B matrix 16.17 MFlops, 10.19 ticks/flop, 15.453 s 500x500 mm - Robert's algorithm 16.60 MFlops, 9.93 ticks/flop, 15.053 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 6.42 MFlops, 25.67 ticks/flop, 38.906 s 500x500 mm - D. Warner's algorithm, subarray 20x20 7.66 MFlops, 21.52 ticks/flop, 32.615 s 500x500 mm - iForth MAT* 61.48 MFlops, 2.68 ticks/flop, 4.065 s CLK 165 MHz 120x120 mm - normal algorithm 19.19 MFlops, 8.59 ticks/flop, 0.180 s 120x120 mm - blocking, factor of 20 20.19 MFlops, 8.16 ticks/flop, 0.171 s 120x120 mm - transposed B matrix 17.49 MFlops, 9.43 ticks/flop, 0.197 s 120x120 mm - Robert's algorithm 18.85 MFlops, 8.75 ticks/flop, 0.183 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 6.47 MFlops, 25.50 ticks/flop, 0.534 s 120x120 mm - D. Warner's algorithm, subarray 20x20 7.79 MFlops, 21.17 ticks/flop, 0.443 s 120x120 mm - iForth MAT* 59.80 MFlops, 2.75 ticks/flop, 0.057 s P5-166 MHz, 48 MB, ProForth VFX for WIN32, Version: 3.000.RC5.0018, Build date: 17 December 1999, double-precision cd examples/benchmrk ok include mm Including mm.FTH Including ../../lib/ndp387.fth (FLITERAL) is redefined Including VfxUtil.FTH VFXUTIL.FTH v1.00 6 October 1999 Try: 'n' mm -- normal 'b' n mm -- using blocking by n, 4 < n < 120 't' mm -- with transposed b matrix 'r' mm -- using Robert's algorithm 'm' n mm -- using Maeno's algorithm with blocking factor n 'w' n mm -- using Warner's algorithm with blocking factor n ALL-TESTS -- test all algorithms ( x ) MEGAFLOPS -- find optimum size for this machine, algorithm 'x' Compile time = 1587 ms, assuming clockspeed = 166 MHz. ok CLK 164 MHz 500x500 mm - normal algorithm 4.37 MFlops, 37.49 ticks/flop, 57.157 s 500x500 mm - blocking, factor of 20 7.61 MFlops, 21.54 ticks/flop, 32.838 s 500x500 mm - transposed B matrix 9.22 MFlops, 17.77 ticks/flop, 27.098 s 500x500 mm - Robert's algorithm 9.32 MFlops, 17.58 ticks/flop, 26.802 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 5.71 MFlops, 28.70 ticks/flop, 43.751 s 500x500 mm - D. Warner's algorithm, subarray 20x20 7.69 MFlops, 21.30 ticks/flop, 32.480 s ok CLK 165 MHz 120x120 mm - normal algorithm 8.97 MFlops, 18.39 ticks/flop, 0.385 s 120x120 mm - blocking, factor of 20 8.10 MFlops, 20.35 ticks/flop, 0.426 s 120x120 mm - transposed B matrix 9.85 MFlops, 16.74 ticks/flop, 0.350 s 120x120 mm - Robert's algorithm 10.09 MFlops, 16.33 ticks/flop, 0.342 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 6.32 MFlops, 26.08 ticks/flop, 0.546 s 120x120 mm - D. Warner's algorithm, subarray 20x20 8.06 MFlops, 20.45 ticks/flop, 0.428 s ok ============================================================================================================= [THEN] \ TOOLS ===================================================================================================== 0 VALUE [S] 0 VALUE [T] 0 CONSTANT U>D : HTAB ( n -- ) out @ - spaces ; \ step to position n : DEC. ( n -- ) BASE @ >R DECIMAL . R> BASE ! ; : DFLOAT[] ( addr ix -- addr' ) DFLOATS + ; : DFVARIABLE CREATE 0e F, ; [UNDEFINED] DF@+ [IF] : DF@+ ( addr -- addr' ) ( F: -- r ) DUP DF@ DFLOAT+ ; [THEN] [UNDEFINED] DF+! [IF] : DF+! ( addr -- ) ( F: r -- ) DUP DF@ F+ DF! ; [THEN] [UNDEFINED] DF!+ [IF] : DF!+ ( addr -- addr' ) ( F: r -- ) DUP DF! DFLOAT+ ; [THEN] [UNDEFINED] DF+!+ [IF] : DF+!+ ( addr -- addr' ) ( F: r -- ) DUP DF@ F+ DF!+ ; [THEN] [UNDEFINED] DDOT [IF] : DDOT ( addr1 inc1 addr2 inc2 count -- ) ( F: -- n ) SWAP DFLOATS >R ROT DFLOATS R> LOCALS| inc2 inc1 | 0e 0 ?DO SWAP DUP DF@ inc1 + SWAP DUP DF@ inc2 + F* F+ LOOP 2DROP ; [THEN] [UNDEFINED] DAXPY [IF] : DAXPY ( addr1 inc1 addr2 inc2 count -- ) ( F: a -- ) SWAP DFLOATS >R ROT DFLOATS R> LOCALS| inc2 inc1 | 0 ?DO FDUP SWAP DUP DF@ F* inc1 + SWAP DUP DF+! inc2 + LOOP 2DROP FDROP ; [THEN] : 2^x ( x -- 2^x ) 1 SWAP 0 ?DO 1 LSHIFT LOOP ; : MS ." MS unknown! " ; : EKEY? KEY? ; : EKEY KEY ; CHAR x CONSTANT 'x' CHAR n CONSTANT 'n' CHAR v CONSTANT 'v' CHAR u CONSTANT 'u' CHAR p CONSTANT 'p' CHAR t CONSTANT 't' CHAR i CONSTANT 'i' CHAR b CONSTANT 'b' CHAR m CONSTANT 'm' CHAR r CONSTANT 'r' CHAR w CONSTANT 'w' CHAR s CONSTANT 's' CHAR . CONSTANT '.' CHAR , CONSTANT ',' CHAR : CONSTANT ':' \ ===================================================================================== FALSE VALUE SHHT? 0 VALUE FLOPS 0 VALUE t/flops 0 VALUE msecs #120 VALUE N DOUBLE DMATRIX a{{ DOUBLE DMATRIX b{{ DOUBLE DMATRIX c{{ DOUBLE DMATRIX d{{ DOUBLE DMATRIX bt{{ 1 DFLOATS CONSTANT DFLOAT1 4 DFLOATS CONSTANT DFLOAT4 8 DFLOATS CONSTANT DFLOAT8 #166 VALUE PROCESSOR-CLOCK 2VARIABLE _ticks_ ( counts clock ticks ) CODE TICKS-GET ( -- d ) SUB EBP, 8 MOV 4 [EBP], EBX rdtsc MOV 0 [EBP], EAX MOV EBX, EDX RET END-CODE CODE K ( -- k ) POP EDX MOV ECX, $18 [ESP] ADD ECX, $1C [ESP] PUSH EDX LEA EBP, -4 [EBP] MOV 0 [EBP], EBX MOV EBX, ECX RET END-CODE : TICKS-RESET ( -- ) TICKS-GET _ticks_ 2! ; TICKS-RESET : TICKS>US ( d -- u ) PROCESSOR-CLOCK UM/MOD NIP ; : TICKS? ( -- u ) TICKS-GET _ticks_ 2@ D- ; : US? ( -- us ) TICKS? TICKS>US ; : CALIBRATE ( -- ) \ ." Calibrate doesn't work! " ; TICKS-RESET #1000 Sleep TICKS? #1000000 UM/MOD NIP TO PROCESSOR-CLOCK ; : ?# ( d -- d ) 2DUP OR 0= IF BL HOLD ELSE # ENDIF ; : .FLOPS ( n -- ) U>D <# BL HOLD # # '.' HOLD # ?# ?# ?# #> TYPE ." MFlops" ; : .TICKS ( n -- ) U>D <# BL HOLD # # '.' HOLD # ?# ?# BL HOLD ',' HOLD #> TYPE ." ticks/flop" ; : .SECS ( n -- ) U>D <# 's' HOLD BL HOLD # # # '.' HOLD # ?# ?# BL HOLD ',' HOLD #> TYPE ; : INIT-RESULT S" TICKS-RESET 0 TO [T] BEGIN " EVALUATE ; IMMEDIATE : (.RES) ( n -- ) DUP N * N * N * 2* 1 OR TICKS? ( -- n fl dti ) 3DUP TICKS>US 1 OR DUP >R #100 SWAP */ TO FLOPS #100. D* ROT UM/MOD NIP TO t/flops R> SWAP #1000 * / TO msecs SHHT? IF EXIT ENDIF FLOPS .FLOPS t/flops .TICKS msecs .SECS ; : .RESULT S" [T] 1+ TO [T] US? #2000000 > UNTIL [T] (.RES) " EVALUATE ; IMMEDIATE \ Set coefficients so that result matrix should have row entries equal to (1/2)*n*(n-1)*i in row i : SET-COEFFICIENTS ( -- ) N 0 ?DO N 0 ?DO J S>F FDUP b{{ J I }} DF! a{{ J I }} DF! LOOP LOOP ; : FLUSH-CACHE ( -- ) N 0 ?DO N 0 ?DO 0e d{{ J I }} DF! LOOP LOOP ; FVARIABLE row_sum FVARIABLE sum : CHECK-RESULT ( -- ) FLOPS 0= IF SHHT? 0= IF CR ." algorithm aborted" ENDIF EXIT ENDIF 0e row_sum F! N N 1- * 2/ S>F sum F! N 0 ?DO I S>F sum F@ F* row_sum F! N 0 ?DO c{{ J I }} DF@ row_sum F@ F<> IF CR ." error in result entry c{{ " J DEC. I DEC. ." }}: " c{{ J I }} DF@ F. ." <> " row_sum F@ F. UNLOOP UNLOOP EXIT ENDIF a{{ J I }} DF@ J S>F F<> IF CR ." error in result entry a{{ " J DEC. I DEC. ." }}: " a{{ J I }} DF@ F. ." <> " J S>F F. UNLOOP UNLOOP EXIT ENDIF b{{ J I }} DF@ J S>F F<> IF CR ." error in result entry b{{ " J DEC. I DEC. ." }}: " b{{ J I }} DF@ F. ." <> " J S>F F. UNLOOP UNLOOP EXIT ENDIF LOOP LOOP ; : NORMAL() ( -- ) SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - normal algorithm" 54 HTAB ENDIF INIT-RESULT N 0 ?DO c{{ I 0 }} a{{ I 0 }} TO [S] b{{ 0 0 }} N DFLOATS BOUNDS ?DO [S] 1 I N N DDOT DF!+ DFLOAT1 +LOOP DROP LOOP .RESULT ; : TRANSPOSE() ( -- ) SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - transposed B matrix" 54 HTAB ENDIF INIT-RESULT N 0 ?DO N 0 ?DO b{{ J I }} DF@ bt{{ I J }} DF! LOOP LOOP N 0 ?DO c{{ I 0 }} N 0 ?DO a{{ J 0 }} 1 bt{{ I 0 }} 1 N DDOT DF!+ LOOP DROP LOOP .RESULT ; \ from Monica Lam ASPLOS-IV paper : TILING() ( step -- ) DUP 4 N 1+ WITHIN 0= IF SHHT? 0= IF CR ." mm - blocking step size of " DUP DEC. ." is unreasonable" ENDIF DROP EXIT ENDIF SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - blocking, factor of " DUP DEC. 54 HTAB ENDIF 0 0 LOCALS| kk jj step | INIT-RESULT N 0 ?DO N 0 ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO kk N 0 ?DO I TO jj N 0 ?DO a{{ I kk }} kk step + N MIN kk ?DO DF@+ b{{ I jj }} 1 c{{ J jj }} 1 jj step + N MIN jj - DAXPY LOOP DROP LOOP step +LOOP step +LOOP .RESULT ; \ ******************************************** \ * Contributed by Robert Debath 26 Nov 1995 * \ * rdebath@cix.compulink.co.uk * \ ******************************************** : ROBERT() ( -- ) SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ." mm - Robert's algorithm" 54 HTAB ENDIF INIT-RESULT N 0 ?DO N 0 ?DO b{{ J I }} DF@ bt{{ I J }} DF! LOOP LOOP a{{ 0 0 }} TO [S] N 0 ?DO bt{{ 0 0 }} c{{ I 0 }} N 0 ?DO [S] 1 3 PICK 1 N DDOT DF!+ SWAP N DFLOAT[] SWAP LOOP 2DROP N DFLOATS [S] + TO [S] LOOP .RESULT ; 0 [IF] =========================================================================== * Matrix Multiply by Dan Warner, Dept. of Mathematics, Clemson University * * mmbu2.f multiplies matrices a and b * a and b are n by n matrices * nb is the blocking parameter. * the tuning guide indicates nb = 50 is reasonable for the * ibm model 530 hence 25 should be reasonable for the 320 * since the 320 has 32k rather than 64k of cache. * Inner loops unrolled to depth of 2 * The loop functions without clean up code at the end only * if the unrolling occurs to a depth k which divides into n * in this case n must be divisible by 2. * The blocking parameter nb must divide into n if the * multiply is to succeed without clean up code at the end. * * converted to c by Mark Smotherman * note that nb must also be divisible by 2 => cannot use 25, so use 20 =========================================================================== [THEN] DFVARIABLE s10 DFVARIABLE s00 DFVARIABLE s01 DFVARIABLE s11 : WARNER() ( nb -- ) 0 0 0 0 0 LOCALS| 'a 'b ii jj kk nb | SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ENDIF N nb MOD N 2 MOD OR IF SHHT? 0= IF ." mm - Warner's algorithm, the matrix size " N DEC. ." must be divisible both by the block size " nb DEC. ." and 2." ENDIF EXIT ENDIF nb 2 MOD IF SHHT? 0= IF ." mm - block size for Warner method must be evenly divisible by 2" ENDIF EXIT ENDIF SHHT? 0= IF ." mm - D. Warner's algorithm, subarray " nb 0 .R 'x' EMIT nb 0 .R SPACE 54 HTAB ENDIF INIT-RESULT N 0 ?DO I TO ii N 0 ?DO I TO jj nb ii + ii ?DO nb jj + jj ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO kk nb ii + ii ?DO nb jj + jj ?DO c{{ J I }} DUP DF@+ s00 DF! DUP DF@ s01 DF! c{{ J 1+ I }} DUP DF@+ s10 DF! DUP DF@ s11 DF! a{{ J kk }} TO 'a b{{ kk I }} TO 'b nb kk + kk ?DO 'a DUP DF@ 'b DF@+ F* s00 DF+! SWAP DF@ DF@ F* s01 DF+! 'a N DFLOAT[] DUP DF@ 'b DF@+ F* s10 DF+! SWAP DF@ DF@ F* s11 DF+! DFLOAT1 'a + TO 'a N DFLOATS 'b + TO 'b LOOP s11 DF@ DF! s10 DF@ DF! s01 DF@ DF! s00 DF@ DF! 2 +LOOP 2 +LOOP nb +LOOP nb +LOOP nb +LOOP .RESULT ; 0 [IF] =========================================================================== Matrix Multiply tuned for SS-10/30; * Maeno Toshinori * Tokyo Institute of Technology * * Using gcc-2.4.1 (-O2), this program ends in 12 seconds on SS-10/30. * * in original algorithm - sub-area for cache tiling * #define L 20 * #define L2 20 * three 20x20 matrices reside in cache; two may be enough =========================================================================== [THEN] DFVARIABLE t0 DFVARIABLE t1 DFVARIABLE t2 DFVARIABLE t3 DFVARIABLE t4 DFVARIABLE t5 DFVARIABLE t6 DFVARIABLE t7 : MAENO() ( nb -- ) 0 0 0 0 LOCALS| it kt i2 kk lparm | SHHT? 0= IF CR N 0 .R 'x' EMIT N 0 .R ENDIF N lparm MOD N 4 MOD OR IF SHHT? 0= IF ." mm - Maeno's algorithm, the matrix size " N DEC. ." must be divisible both by the block size " lparm DEC. ." and 4." ENDIF EXIT ENDIF lparm 4 MOD IF SHHT? 0= IF ." mm - block size for Maeno's method must be evenly divisible by 4" ENDIF EXIT ENDIF SHHT? 0= IF ." mm - T. Maeno's algorithm, subarray " lparm 0 .R 'x' EMIT lparm 0 .R SPACE 54 HTAB ENDIF INIT-RESULT N 0 ?DO N 0 ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO i2 N 0 ?DO I TO kk i2 lparm + TO it kk lparm + TO kt N 0 ?DO it i2 ?DO 0e t0 DF! 0e t1 DF! 0e t2 DF! 0e t3 DF! 0e t4 DF! 0e t5 DF! 0e t6 DF! 0e t7 DF! kt kk ?DO a{{ J I }} DF@ FDUP b{{ I K }} DUP DF@+ F* t0 DF+! FDUP DF@+ F* t1 DF+! FDUP DF@+ F* t2 DF+! DF@ F* t3 DF+! a{{ J 1+ I }} DF@ FDUP DF@+ F* t4 DF+! FDUP DF@+ F* t5 DF+! FDUP DF@+ F* t6 DF+! DF@ F* t7 DF+! LOOP t0 DF@ c{{ I J }} DF+!+ t1 DF@ DF+!+ t2 DF@ DF+!+ t3 DF@ DF+! t4 DF@ c{{ I 1+ J }} DF+!+ t5 DF@ DF+!+ t6 DF@ DF+!+ t7 DF@ DF+! 2 +LOOP 4 +LOOP lparm +LOOP lparm +LOOP .RESULT ; : MM ( char n -- ) DEPTH 0= ABORT" no algorithm chosen" DEPTH 2 < IF 0 ENDIF LOCALS| ur | & a{{ N N }}malloc malloc-fail? & b{{ N N }}malloc malloc-fail? OR & bt{{ N N }}malloc malloc-fail? OR & c{{ N N }}malloc malloc-fail? OR & d{{ N N }}malloc malloc-fail? OR ABORT" MM :: out of core" SET-COEFFICIENTS FLUSH-CACHE CASE 'n' OF NORMAL() ENDOF 't' OF TRANSPOSE() ENDOF 'b' OF ur TILING() ENDOF 'r' OF ROBERT() ENDOF 'm' OF ur MAENO() ENDOF 'w' OF ur WARNER() ENDOF CR ." `" DUP EMIT ." ' is an invalid algorithm" ENDCASE CHECK-RESULT & d{{ }}free & c{{ }}free & bt{{ }}free & b{{ }}free & a{{ }}free ; : ALL-TESTS ( -- ) CR ." CLK " CALIBRATE PROCESSOR-CLOCK DEC. ." MHz" 'n' mm EKEY? IF EKEY DROP EXIT ELSE 'b' 20 mm 't' mm ENDIF EKEY? IF EKEY DROP EXIT ELSE 'r' mm ENDIF EKEY? IF EKEY DROP EXIT ELSE 'm' 20 mm 'w' 20 mm ENDIF ; : NEXT-N ( -- ) N #1200 #1000 */ TO N #17 1 DO N I 2^x DUP 9 #10 */ SWAP #11 #10 */ WITHIN IF I 2^x TO N LEAVE ENDIF LOOP ; : MEGAFLOPS ( sel -- ) DEPTH 0= ABORT" no algorithm chosen" DEPTH 2 < IF 0 ENDIF 0 0 SHHT? N LOCALS| old-N silence? flp ix ur algo | TRUE TO SHHT? CR ." Algorithm = '" algo EMIT [CHAR] ' EMIT ur IF ." , parameter is " ur 0 .R ENDIF ." , clock = " CALIBRATE PROCESSOR-CLOCK DEC. ." MHz" #32 TO N #17 0 DO CR ." testing data size " N 3 .R ." x " N 3 .R ':' EMIT algo ur mm FLOPS flp > IF FLOPS TO flp N TO ix ENDIF FLOPS .FLOPS NEXT-N 0 TO FLOPS EKEY? IF EKEY DROP LEAVE ENDIF LOOP ix IF CR CR ." Maximum: " flp .FLOPS ." at N = " ix DEC. ENDIF silence? TO SHHT? old-N TO N ; : .ABOUT CR ." Try: 'n' mm -- normal" CR ." 'b' n mm -- using blocking by n, 4 < n < " N DEC. CR ." 't' mm -- with transposed b matrix" CR ." 'r' mm -- using Robert's algorithm" CR ." 'm' n mm -- using Maeno's algorithm with blocking factor n" CR ." 'w' n mm -- using Warner's algorithm with blocking factor n" CR CR ." ALL-TESTS -- test all algorithms" CR ." ( x ) MEGAFLOPS -- find optimum size for this machine, algorithm 'x'" ; .ABOUT CR .( Compile time = ) US? #1000 / DEC. .( ms, assuming clockspeed = ) PROCESSOR-CLOCK DEC. .( MHz.) ( * End of Source * )