Assignment 4 - Answers

Consider the following 80X86 code segment. Assume that when the code is run, CX is non-zero.
```
     MOV    AX, 0
  L: ADD    AX, BX
     ADC    DX, 0
     LOOP   L
```
1. What is the result of running the code?
  The code does an unsigned multiply DX:AX = BX*CX
2. Re-write the above code segment for better efficiency.
  The following code does the same operation:
```
      MUL    CX
```
Consider the following 80X86 code segment.
```
        MOV    AX, 0
     L: SHR    BX, 1
        ADC    AX, 0
	ADD    BX, 0
	JNZ    L
```
We wish to write this code without using any form of LOOP, and with no conditional jumps (we can use other forms of JMP, however). Assume that CS=SS=DS=SS for simplicity, and write a code segment that provides the same results for AX (but may use or destroy other registers). You may NOT "unfold" the above loop! Note: the basic idea suffices here. Bonus for the full implementation.
There are at least 3 ways to do it: (1) jump table (2) interrupt on overflow (3) self-modifying code. These work as follows:

Write 80X86 instructions to zero out the AL register on an 80X86:

List 5 different ways to zero the AL register with ONE instruction.
MOV AL,0 XOR AL, AL SUB AL, AL AND AL, 0 SHL AL, 8 MOV AX, 0 ; As well as anything else that zeroes AX

List 1 way to zero the AL register with 2 instructions, each different from all the instructions used in part 1.
OR AL, 0FFH INC AL

List the SHORTEST POSSIBLE CODE for adding 5 to Y and subtracting 2 from X which are defined as follows (consecutive locations):
```
    X    DB     1
    Y    DB     7
```
ADD Y, 5 SUB X, 2

If we know that subtracting 2 from X causes no overflow, is there a shorter way to do that? How?
Yes, we can do it with one instruction on a 16-bit WORD operand. The requirement for no overflow is to assure correctness, because we do NOT want a carry from Y into X! Code is an addition, as follows (can also be done with subtraction - a different immediate operand):
```
    ADD    WORD PTR X, 0FE05    ; Add 5 to low operand (Y), and
                                ; add 0FE = -2 to high operand (X)
```
Consider the following code for Motorola 68000 (comments state what each instruction does, according to the 68000 instruction manual (no, we did not teach you that machine, you should learn the relevant details on your own from the exercise)). Note that the 68000 has 32-bit registers D0-D7 and A0-A7, where A7 is also the STACK POINTER.
```
 F: MOVE.L   D0, -(A7)     ; D0 to memory - predecrement mode
    SUBQ.L   #1, D0        ; Subtract immediate - long (32 bit) operand
    BMI      N             ; Branch (jump) if result was negative
    JSR      F             ; CALL subroutine F (push PC then jump to F)
    MULS.L   (A7)+, D1:D2  ; Signed multiply memory (postincrement) 
                           ; with D2, result in D1:D2
    RTS                    ; Return from procedure/subroutine
 N: MOVE.L   (A7)+, D0     ; Move memory to D0 - postincrement mode
    MOVEQ.L  #0, D1        ; Move immediate to D1
    MOVEQ.L  #1, D2
    RTS
```
What happens if we execute an instruction JSR F, with D0=0? D0=1? Other values of D0 (call that value k)?
The code pushes its argument, D0, onto the stack, and restores it before returning. Also, We get to N if D0 was 0 - and return a 64-bit quantity, 1 in this case. This is the base case of the recursive call. In all other cases, we multiply the argument D0 (popped from stack) by whatever value was returned by the same function, but with the argument, k=D0, decremented by 1. That is, the function computes: IF k=0, return 1. Else, return F(k-1)*k. This is exactly the definition of factorial, k!

On Intel 80486, suppose that AX = 0000000010101111, BX = 0000000001001110, CX = 0000000010000101, and Carry Flag = 1. What is the result of the following operations (also describe the state of the carry and overflow flags after execution of each)

a.  ADD    CL, AL


Here and elsewhere, the extra result bit on the left denotes the carry.
No overflow unless stated...

    10000101
  + 10101111
  -----------
  1 00110100



b.  ADC    AL, 0E5H


           1
  + 10101111
  + 11100101
  -----------
  1 10010101


c.  ADD    AX, 0045H


      0000000010101111
    + 0000000001000101
   --------------------
    0 0000000011110100


d.  SUB	   BL, 69H


     01001110
   - 01101001
  ------------
   1 11100101


e.  RCR	   AX, 1

   1  1000000001010111

   OF = 1

Write instruction sequences to perform some common set operations, for 80X86.

Each set is a subset of [1..16] is represented by corresponding bits in the register (e.g., AX=1000000000000101 represents {1,3,16}).

Use the following table. Each entry contains a single bit. The index into the table selects which bit is set (e.g., the value at index zero has bit zero set).

 
BitTbl          dw   1, 2, 4, 8
                dw   10h, 20h, 40h, 80h
                dw   100h, 200h, 400h, 800h
                dw   1000h, 2000h, 4000h, 8000h

a. Remove - removes the specified item from the set.

   BX contains a value in the range 0..15.
   AX contains a set.


      ADD   BX, BX
      MOV   BX, [BitTbl+BX]    ; Get 1 in position specified by element
      XOR   AX, BX             ; remove from set


b. Odd - Even sets the zero flag if AX contains a set of numbers
   that are all odd.


      AND   AX, 5555H          ; Mask (get rid) of odd bits, zero flag
                               ; is set if there are no evens
      ADD   AX, 0FFFFH         ; create carry if AX was non-0
      MOV   AX, 0FFFFH
      ADC   AX, 0              ; Get 0 only if AX was non-0



c. Member - Member clears the zero flag if BX is an element
            of the AX set, it sets the zero flag otherwise.

   BX contains a value in the range 0..15.
   AX contains a set.


      ADD   BX, BX
      MOV   BX, [BitTbl+BX]    ; Get 1 in position specified by element
      AND   BX, AX             ; Mask to get only relevant bit (affects ZF)


d. UnionSets - Union computes AX := AX union BX.
          
   AX and BX contain the sets.


     OR   AX, BX


e. Intersection - Intersection computes AX := AX intersect BX.
          
   AX and BX contain the sets.



     AND  AX, BX


f. Complement - Complement computes AX := - AX, that is, all elements in
   the set are removed, and all elements not in the set are added.


     NOT AX

Repeat the above exercise for MIPS, where sets now have elements [1..32], values are provided in $a0 and $a1 instead of AX and BX, respectively, and returned in $v0 (for the member function, set $v0 to 1 if true, and to 0 if false). Write the new version of BitTbl necessary to answer this question.


The table is:
BitTbl:         .word   1, 2, 4, 8
                .word   0x10, 0x20, 0x40, 0x80
                .word   0x100, 0x200, 0x400, 0x800
                .word   0x1000, 0x2000, 0x4000, 0x8000
                .word   0x10000, 0x20000, 0x40000, 0x80000
                .word   0x100000, 0x200000, 0x400000, 0x800000
                .word   0x1000000, 0x2000000, 0x4000000, 0x8000000
                .word   0x10000000, 0x20000000, 0x40000000, 0x80000000

a) Remove:

        la      $t0, BitTbl        ; Address of table to $t0
        sll     $t1, $a1, 2        ; Multiply $a1 by 4
        add     $t1, $t1, $t0      ; Compute address in table
        lw      $t2, 0($t1)        ; Get word from table
        and     $v0, $a0, $a1

b) Odd:

        and     $t0, $a0, 0x55555555
        seq     $v0, $zero, $t0       ; Set $v0 to 1 if result was 0

c) Member:

        la      $t0, BitTbl        ; Address of table to $t0
        sll     $t1, $a1, 2        ; Multiply $a1 by 4
        add     $t1, $t1, $t0      ; Compute address in table
        lw      $t2, 0($t1)        ; Get word from table
        and     $v0, $t2, $a0
        sne     $v0, $zero, $v0    ; Set $v0 to 1 if $v0 not 0.

d) Union:

        or      $v0, $a0, $a1

e) Intersection:

        and      $v0, $a0, $a1

f) Complement:

        not      $v0, $a0

From the Patterson and Hennessy textbook: exercise 3.1 on page 197 (copy available at student office).
Initially, $t0 (which is the output value) is initialized to 0, and $t1 to 1. Now, the program loops as long as $t1 is less than $a0 (or n). Each time, the accumulator $t0 is incremented by $t1, which is then incremented by 2, in turn. Thus, the sequence of values of $t1 is: 1, 3, 5, 7, 9 and that of $t0 is 0, 1, 4, 9, 16 ...
To sum up, the code computes the square of n.
GraphicsPower is a machine that runs in 500MHz with three groups of instructions: Int, Float, and Graphics. The favorite flight simulation program spends 20% of its runtime in Int instructions, 30% in Float instructions, and 50% in Graphics instructions. Any imporovment on the Graphics instructions worsen the Float instructions by 25% of the same improvement. What is the maximum improvement that can by achieved by improving the Graphics instructions only?
Note that the clock rate is irrelevant. Now, the answer depends on the semantics of the statement: "Any imporovment on the Graphics instructions worsen the Float instructions by 25% of the same improvement." which is ambiguous. Assuming we read this: improving Graphics by DECREASING the runtime of each instruction by s factor of x, we INCREASE the runtime of each Float instruction by 0.25x of its timing. In this case, the average time per instruction is LINEAR in x, and the result must be at some hard limit - in this case when Graphics instructions take ZERO time. Consider a typical time unit in the original processor. With the new processor the timing will be: 0.2 + 0.5 (1-x) + 0.3 (1+0.25)x. With x=1 (Graphic instructions infinitely fast) we have, the time for the same run will be 0.2+0+0.375 = 0.575, or saving 42.5% of the execution time. (Note that some people will call this: "74% faster"..., because 1/0.575=1.74...)
Power-A is an 800 MHz machine with three groups of instructions G1, G2 , and G3. Instructions in G1 run in 3 clock cycles, in G2 run in 4 clock cycles, and in G3 run in 6 clock cycles. Programs Pa and Pb have the same number of instructions but different distributions among the groups G1, G2, and G3. Program Pa has 35% G1 instructions, 40% G2, and 25% G3. Program Pb has 40% G1, 30% G2, and 40% G3 instructions. Which program runs faster, Pa or Pb?
Again, clock rate is irrelevant. An average instruction in program Pa takes 3*0.35+4*0.4+6*0.25 = 4.15 cycles. An average instruction in program Pb takes 3*0.4+4*0.3+6*0.4 = 4.8 cycles. So, program Pa is faster.
We need to add the jal (jump and link) instruction to our basic MIPS architecture (figure 5.33). Modify the datapath as necessary, and add the microinstructions that are needed to do this.
We need to modify the hardware in 2 ways:
With the above changes, and assuming that there is already an entry in the DISPTCH 1 ROM for the jal instruction, we need to add just 1 micro instruction, for the third cycle of the instruction (i.e. fetch and decode are the same as all other istructions). This new microinstruction will have the fields:
Register control: RegDst = 2, MemtoReg = 2, RegWrite = 1, Sequencing = FETCH

Note that there are many other possible solutions as well.

Given the following MIPS assembly language code segment (assuming that the program starts at address 1000H),

        addi     $t0, $zero, 2
        addi     $t1, $zero, 0
  loop: lw       $t2, 50($t1)
        slt      $t3, $t2, $s3
        sw       $t3, 300($t1)
        addi     $t1, $t1, 1
        bne      $t1, $t0, loop

Generate the resulting machine code (manually).
Address Instruction Mnemonic Assembler 0x00001000 0x20080002 addi $8, $0, 2 ; 1: addi $t0, $zero, 2 0x00001004 0x20090000 addi $9, $0, 0 ; 2: addi $t1, $zero, 0 0x00001008 0x8d2a0032 lw $10, 50($9) ; 3: lw $t2, 50($t1) 0x0000100c 0x0153582a slt $11, $10, $19 ; 4: slt $t3, $t2, $s3 0x00001010 0xad2b012c sw $11, 300($9) ; 5: sw $t3, 300($t1) 0x00001014 0x21290001 addi $9, $9, 1 ; 6: addi $t1, $t1, 1 0x00001018 0x15283ffc bne $9, $8, -16 [loop-0x1018] ; 7: bne $t1, $t0, loop

Using the basic architecture of figure 5.33, show a timing diagram of the execution of this program - how many clock cycles does it take to execute?
First, the individual instructions. addi should take the same as R-type, 4 cycles. Branch takes 3. sw takes 4, and lw takes 5. slt takes 4, as it is an R-type instruction. The loop is executed twice, as $t1 starts at 0 and gets incremented each pass, until it equals $t0, which is 2. Timing diagram is thus:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | addi | addi | lw | slt | sw | addi | bne | F D A R F D A R F D A M R F D A R F D A M F D A R F D J

And the entire string is repeated for 28 more cycles, requiring all in all 56 cycles to execute. Above, F stands for Fetch, D for Decode, A for ALU operation, M for memory access, and R for writing to register.
Using a pipelined architecture, show a timing diagram of the execution of the program - how many clock cycles are saved?
Sorry, this is hard to do in HTML, so come and ask...

Fibbonacci numbers are defined recursively as:

f(0) = f(1) = 1;
f(n) = f(n-1)+f(n-2)

Write a recursive procedure for 80X86 that gets n in EAX and returns f(n) in EAX, without destroying values in any other register. You may assume that f(n) will fit in 32 bits. CLUE: if your program is more than 20 instructions, you have done something wrong...

We will assume that the value in EAX is unsigned, and NOT TOO LARGE
(i.e. no overflow).

F:     PROC NEAR
       CMP  EAX, 2
       JC   DONE

       PUSH EBX              ; General case. Save EBX (will be used as temp)
       DEC  EAX
       MOV  EBX, EAX          ; Copy n-1 to EBX as temp
       CALL F                 ; Computes f(n-1)
       XCHG EAX, EBX          ; Now EAX=n-1,  EBX=f(n-1)
       DEC  EAX
       CALL F                 ; Computes f(n-2)
       ADD  EAX, EBX          ; Adds them
       POP  EBX               ; Restore EBX and return
       RET  NEAR

DONE:  MOV  EAX, 1            ; Base case - EAX less than 2
       RET  NEAR              ; I never count on defaults...
       ENDP

Write a code segment that gets a pointer in BX and a count in CX, and then outputs CX bytes starting at address [BX], outputs them to port 10H, computes their 2-byte checksum and outputs that at the end as well. How would your program be different if the output were to memory-mapped IO? How many errors can the checksome detect, in the worst case (i.e. what is the distance of this code?)
Assuming that CX strictly greater than 0:
```
OutDev  EQU  10H

       MOV   DX, 0         ; DX will keep checksum
       MOV   AH, 0         ; Use as high-order byte of AX operand later on...
L:     MOV   AL, [BX]
       OUT   OutDev, AL
       ADD   DX, AX        ; Update checksum
       INC   BX
       LOOP  L
       MOV   AL, DL        ; Now output checksum, least-significant first
       OUT   OutDev, AL
       MOV   AL, DH
       OUT   OutDev, AL
```
For memory mapped IO, if memory address of device register is OutDev, replace all OUT instructions by
MOV BYTE PTR OutDev, AL

One bit error leads to a wrong checksum. Thus, distance is at least 2. But 2 "complementary" errors in different bytes in the same bit position do not change the checksum, and the distance is also AT MOST 2. Therefore, the distance of this code is exactly 2, i.e. it can be used to detect, but not fix, one error.

Due date: Thursday, January 31, 2002

Note: work and submission in this exercise is SOLO.