1. Consider the following 80X86 code segment. Assume that when the code is run, CX is non-zero.
```     MOV    AX, 0
LOOP   L
```
1. What is the result of running the code?

The code does an unsigned multiply DX:AX = BX*CX

2. Re-write the above code segment for better efficiency.

The following code does the same operation:

```      MUL    CX
```
2. Consider the following 80X86 code segment.
```        MOV    AX, 0
L: SHR    BX, 1
JNZ    L
```

We wish to write this code without using any form of LOOP, and with no conditional jumps (we can use other forms of JMP, however). Assume that CS=SS=DS=SS for simplicity, and write a code segment that provides the same results for AX (but may use or destroy other registers). You may NOT "unfold" the above loop! Note: the basic idea suffices here. Bonus for the full implementation.

There are at least 3 ways to do it: (1) jump table (2) interrupt on overflow (3) self-modifying code. These work as follows:

• Instead of JNZ L we use JMP [DX], where DX is set to TBL or TBL+2, and with TBL containing pointers to L and the instruction after the jump, respectively. But first, we need to set DX to 0 if BX is 0, and to 1 otherwise, and this can be done by setting the carry flag and shifting it conditionally. So, what we get instead of the JNZ L is:
```        MOV    DX, BX
MOV    DX, 0
RCL    DX            ; DX is 1 if BX was not 0, and is 0 otherwise
RCL    DX            ; and is now 2 or 0, respectively
JMP    DX
L1:
```

The jump table should be:

```    TBL  DW    OFFSET L1,  OFFSET L
```
• The second scheme should modify the interrupt vector for interrupt on overflow, in order to exit the loop. We need to generate an overflow if BX is 0. This can be done as follows:
```        MOV    DX, BX
MOV    DX, 0
CMC                 ; Invert carry flag
RCL    DX
ADD    DX, 07FFF    ; Generates an oveflow unless DX is 0
INTO                ; Interrupt on overflow
L0:  JMP    L
L1:
```
Several more things need to be done - (1) setting up the interrupt vector, (saving its previous value elsewhere) and (2) writing the interrupt routine. Now (1) is simple (and assumes that we are ALLOWED to do it!), but the interrupt routine should do the following: (1) Restore the interrupt vector to its original value, (2) Change the return address on the stack to L1 (it was L0 during the interrupt), and (3) return from the interrupt.
• Self-modifying code (again, assumes we do not have write-protection on this memory by hardware!) - use just JMP L and change the jump address. This works as follows:
```        MOV    DX, BX
CMC
MOV    DL, 0
RCL    DL, 5           ; Get 32 in DX if we want to exit the loop
L0:  JMP    SHORT L         ; Short is essential - so address is a byte!
NOP
...                    ; A sufficient number of NOPs so that L+32
NOP                    ; is in the NOPs...
SUB    L0+1, DL        ; Restore code to original
```
3. Write 80X86 instructions to zero out the AL register on an 80X86:
1. List 5 different ways to zero the AL register with ONE instruction.
```            MOV       AL,0
XOR       AL, AL
SUB       AL, AL
AND       AL, 0
SHL       AL, 8
MOV       AX, 0       ; As well as anything else that zeroes AX
```
2. List 1 way to zero the AL register with 2 instructions, each different from all the instructions used in part 1.
```            OR        AL, 0FFH
INC       AL
```

4. List the SHORTEST POSSIBLE CODE for adding 5 to Y and subtracting 2 from X which are defined as follows (consecutive locations):
```    X    DB     1
Y    DB     7
```
```    ADD   Y, 5
SUB   X, 2
```

If we know that subtracting 2 from X causes no overflow, is there a shorter way to do that? How?

Yes, we can do it with one instruction on a 16-bit WORD operand. The requirement for no overflow is to assure correctness, because we do NOT want a carry from Y into X! Code is an addition, as follows (can also be done with subtraction - a different immediate operand):

```    ADD    WORD PTR X, 0FE05    ; Add 5 to low operand (Y), and
; add 0FE = -2 to high operand (X)
```

5. Consider the following code for Motorola 68000 (comments state what each instruction does, according to the 68000 instruction manual (no, we did not teach you that machine, you should learn the relevant details on your own from the exercise)). Note that the 68000 has 32-bit registers D0-D7 and A0-A7, where A7 is also the STACK POINTER.
``` F: MOVE.L   D0, -(A7)     ; D0 to memory - predecrement mode
SUBQ.L   #1, D0        ; Subtract immediate - long (32 bit) operand
BMI      N             ; Branch (jump) if result was negative
JSR      F             ; CALL subroutine F (push PC then jump to F)
MULS.L   (A7)+, D1:D2  ; Signed multiply memory (postincrement)
; with D2, result in D1:D2
RTS                    ; Return from procedure/subroutine
N: MOVE.L   (A7)+, D0     ; Move memory to D0 - postincrement mode
MOVEQ.L  #0, D1        ; Move immediate to D1
MOVEQ.L  #1, D2
RTS
```

What happens if we execute an instruction JSR F, with D0=0? D0=1? Other values of D0 (call that value k)?

The code pushes its argument, D0, onto the stack, and restores it before returning. Also, We get to N if D0 was 0 - and return a 64-bit quantity, 1 in this case. This is the base case of the recursive call. In all other cases, we multiply the argument D0 (popped from stack) by whatever value was returned by the same function, but with the argument, k=D0, decremented by 1. That is, the function computes: IF k=0, return 1. Else, return F(k-1)*k. This is exactly the definition of factorial, k!

6. On Intel 80486, suppose that AX = 0000000010101111, BX = 0000000001001110, CX = 0000000010000101, and Carry Flag = 1. What is the result of the following operations (also describe the state of the carry and overflow flags after execution of each)
```a.  ADD    CL, AL

Here and elsewhere, the extra result bit on the left denotes the carry.
No overflow unless stated...

10000101
+ 10101111
-----------
1 00110100

1
+ 10101111
+ 11100101
-----------
1 10010101

0000000010101111
+ 0000000001000101
--------------------
0 0000000011110100

d.  SUB	   BL, 69H

01001110
- 01101001
------------
1 11100101

e.  RCR	   AX, 1

1  1000000001010111

OF = 1

```

7. Write instruction sequences to perform some common set operations, for 80X86.

Each set is a subset of [1..16] is represented by corresponding bits in the register (e.g., AX=1000000000000101 represents {1,3,16}).

Use the following table. Each entry contains a single bit. The index into the table selects which bit is set (e.g., the value at index zero has bit zero set).

```
BitTbl          dw   1, 2, 4, 8
dw   10h, 20h, 40h, 80h
dw   100h, 200h, 400h, 800h
dw   1000h, 2000h, 4000h, 8000h

a. Remove - removes the specified item from the set.

BX contains a value in the range 0..15.
AX contains a set.

MOV   BX, [BitTbl+BX]    ; Get 1 in position specified by element
XOR   AX, BX             ; remove from set

b. Odd - Even sets the zero flag if AX contains a set of numbers
that are all odd.

AND   AX, 5555H          ; Mask (get rid) of odd bits, zero flag
; is set if there are no evens
ADD   AX, 0FFFFH         ; create carry if AX was non-0
MOV   AX, 0FFFFH
ADC   AX, 0              ; Get 0 only if AX was non-0

c. Member - Member clears the zero flag if BX is an element
of the AX set, it sets the zero flag otherwise.

BX contains a value in the range 0..15.
AX contains a set.

MOV   BX, [BitTbl+BX]    ; Get 1 in position specified by element
AND   BX, AX             ; Mask to get only relevant bit (affects ZF)

d. UnionSets - Union computes AX := AX union BX.

AX and BX contain the sets.

OR   AX, BX

e. Intersection - Intersection computes AX := AX intersect BX.

AX and BX contain the sets.

AND  AX, BX

f. Complement - Complement computes AX := - AX, that is, all elements in
the set are removed, and all elements not in the set are added.

NOT AX

```

8. Repeat the above exercise for MIPS, where sets now have elements [1..32], values are provided in \$a0 and \$a1 instead of AX and BX, respectively, and returned in \$v0 (for the member function, set \$v0 to 1 if true, and to 0 if false). Write the new version of BitTbl necessary to answer this question.
```
The table is:
BitTbl:         .word   1, 2, 4, 8
.word   0x10, 0x20, 0x40, 0x80
.word   0x100, 0x200, 0x400, 0x800
.word   0x1000, 0x2000, 0x4000, 0x8000
.word   0x10000, 0x20000, 0x40000, 0x80000
.word   0x100000, 0x200000, 0x400000, 0x800000
.word   0x1000000, 0x2000000, 0x4000000, 0x8000000
.word   0x10000000, 0x20000000, 0x40000000, 0x80000000

a) Remove:

la      \$t0, BitTbl        ; Address of table to \$t0
sll     \$t1, \$a1, 2        ; Multiply \$a1 by 4
lw      \$t2, 0(\$t1)        ; Get word from table
and     \$v0, \$a0, \$a1

b) Odd:

and     \$t0, \$a0, 0x55555555
seq     \$v0, \$zero, \$t0       ; Set \$v0 to 1 if result was 0

c) Member:

la      \$t0, BitTbl        ; Address of table to \$t0
sll     \$t1, \$a1, 2        ; Multiply \$a1 by 4
lw      \$t2, 0(\$t1)        ; Get word from table
and     \$v0, \$t2, \$a0
sne     \$v0, \$zero, \$v0    ; Set \$v0 to 1 if \$v0 not 0.

d) Union:

or      \$v0, \$a0, \$a1

e) Intersection:

and      \$v0, \$a0, \$a1

f) Complement:

not      \$v0, \$a0

```

9. From the Patterson and Hennessy textbook: exercise 3.1 on page 197 (copy available at student office).

Initially, \$t0 (which is the output value) is initialized to 0, and \$t1 to 1. Now, the program loops as long as \$t1 is less than \$a0 (or n). Each time, the accumulator \$t0 is incremented by \$t1, which is then incremented by 2, in turn. Thus, the sequence of values of \$t1 is: 1, 3, 5, 7, 9 and that of \$t0 is 0, 1, 4, 9, 16 ...

To sum up, the code computes the square of n.

10. GraphicsPower is a machine that runs in 500MHz with three groups of instructions: Int, Float, and Graphics. The favorite flight simulation program spends 20% of its runtime in Int instructions, 30% in Float instructions, and 50% in Graphics instructions. Any imporovment on the Graphics instructions worsen the Float instructions by 25% of the same improvement. What is the maximum improvement that can by achieved by improving the Graphics instructions only?

Note that the clock rate is irrelevant. Now, the answer depends on the semantics of the statement: "Any imporovment on the Graphics instructions worsen the Float instructions by 25% of the same improvement." which is ambiguous. Assuming we read this: improving Graphics by DECREASING the runtime of each instruction by s factor of x, we INCREASE the runtime of each Float instruction by 0.25x of its timing. In this case, the average time per instruction is LINEAR in x, and the result must be at some hard limit - in this case when Graphics instructions take ZERO time. Consider a typical time unit in the original processor. With the new processor the timing will be: 0.2 + 0.5 (1-x) + 0.3 (1+0.25)x. With x=1 (Graphic instructions infinitely fast) we have, the time for the same run will be 0.2+0+0.375 = 0.575, or saving 42.5% of the execution time. (Note that some people will call this: "74% faster"..., because 1/0.575=1.74...)

11. Power-A is an 800 MHz machine with three groups of instructions G1, G2 , and G3. Instructions in G1 run in 3 clock cycles, in G2 run in 4 clock cycles, and in G3 run in 6 clock cycles. Programs Pa and Pb have the same number of instructions but different distributions among the groups G1, G2, and G3. Program Pa has 35% G1 instructions, 40% G2, and 25% G3. Program Pb has 40% G1, 30% G2, and 40% G3 instructions. Which program runs faster, Pa or Pb?

Again, clock rate is irrelevant. An average instruction in program Pa takes 3*0.35+4*0.4+6*0.25 = 4.15 cycles. An average instruction in program Pb takes 3*0.4+4*0.3+6*0.4 = 4.8 cycles. So, program Pa is faster.

12. We need to add the jal (jump and link) instruction to our basic MIPS architecture (figure 5.33). Modify the datapath as necessary, and add the microinstructions that are needed to do this.

We need to modify the hardware in 2 ways:

• Add a way to force the written register address to be that of \$ra. Unfortunately, we cannot get that from the instruction - all bits are taken by the jump address. So, we need to expand the DstReg multiplexer with a third input (call it "2") - which is hard-wired to the number of the \$ra register in binary.
• We need to have the PC+4 at the ALUout register output at the time after the decode. Unfortunately, it arrives there AT the decode cycle, and disappears later. One way to handle this is to have the Write data multiplexer to the register bank expanded with a third entry (also called "2") to have the PC as an input as well.

With the above changes, and assuming that there is already an entry in the DISPTCH 1 ROM for the jal instruction, we need to add just 1 micro instruction, for the third cycle of the instruction (i.e. fetch and decode are the same as all other istructions). This new microinstruction will have the fields:

```   Register control: RegDst = 2, MemtoReg = 2, RegWrite = 1, Sequencing = FETCH
```

Note that there are many other possible solutions as well.

13. Given the following MIPS assembly language code segment (assuming that the program starts at address 1000H),
```        addi     \$t0, \$zero, 2
loop: lw       \$t2, 50(\$t1)
slt      \$t3, \$t2, \$s3
sw       \$t3, 300(\$t1)
bne      \$t1, \$t0, loop
```
1. Generate the resulting machine code (manually).
```Address        Instruction   Mnemonic                       Assembler

0x00001000	0x20080002  addi \$8, \$0, 2                  ; 1: addi     \$t0, \$zero, 2
0x00001004	0x20090000  addi \$9, \$0, 0                  ; 2: addi     \$t1, \$zero, 0
0x00001008	0x8d2a0032  lw \$10, 50(\$9)                  ; 3: lw       \$t2, 50(\$t1)
0x0000100c	0x0153582a  slt \$11, \$10, \$19               ; 4: slt      \$t3, \$t2, \$s3
0x00001010	0xad2b012c  sw \$11, 300(\$9)                 ; 5: sw       \$t3, 300(\$t1)
0x00001014	0x21290001  addi \$9, \$9, 1                  ; 6: addi     \$t1, \$t1, 1
0x00001018	0x15283ffc  bne \$9, \$8, -16 [loop-0x1018]   ; 7: bne      \$t1, \$t0, loop

```
2. Using the basic architecture of figure 5.33, show a timing diagram of the execution of this program - how many clock cycles does it take to execute?

First, the individual instructions. addi should take the same as R-type, 4 cycles. Branch takes 3. sw takes 4, and lw takes 5. slt takes 4, as it is an R-type instruction. The loop is executed twice, as \$t1 starts at 0 and gets incremented each pass, until it equals \$t0, which is 2. Timing diagram is thus:

```
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

F D A R F D A R F  D  A  M  R  F  D  A  R  F  D  A  M  F  D  A  R  F  D  J
```

And the entire string is repeated for 28 more cycles, requiring all in all 56 cycles to execute. Above, F stands for Fetch, D for Decode, A for ALU operation, M for memory access, and R for writing to register.

3. Using a pipelined architecture, show a timing diagram of the execution of the program - how many clock cycles are saved?

Sorry, this is hard to do in HTML, so come and ask...

14. Fibbonacci numbers are defined recursively as:
```f(0) = f(1) = 1;
f(n) = f(n-1)+f(n-2)
```

Write a recursive procedure for 80X86 that gets n in EAX and returns f(n) in EAX, without destroying values in any other register. You may assume that f(n) will fit in 32 bits. CLUE: if your program is more than 20 instructions, you have done something wrong...

```We will assume that the value in EAX is unsigned, and NOT TOO LARGE
(i.e. no overflow).

F:     PROC NEAR
CMP  EAX, 2
JC   DONE

PUSH EBX              ; General case. Save EBX (will be used as temp)
DEC  EAX
MOV  EBX, EAX          ; Copy n-1 to EBX as temp
CALL F                 ; Computes f(n-1)
XCHG EAX, EBX          ; Now EAX=n-1,  EBX=f(n-1)
DEC  EAX
CALL F                 ; Computes f(n-2)
POP  EBX               ; Restore EBX and return
RET  NEAR

DONE:  MOV  EAX, 1            ; Base case - EAX less than 2
RET  NEAR              ; I never count on defaults...
ENDP

```

15. Write a code segment that gets a pointer in BX and a count in CX, and then outputs CX bytes starting at address [BX], outputs them to port 10H, computes their 2-byte checksum and outputs that at the end as well. How would your program be different if the output were to memory-mapped IO? How many errors can the checksome detect, in the worst case (i.e. what is the distance of this code?)

Assuming that CX strictly greater than 0:

```OutDev  EQU  10H

MOV   DX, 0         ; DX will keep checksum
MOV   AH, 0         ; Use as high-order byte of AX operand later on...
L:     MOV   AL, [BX]
OUT   OutDev, AL
ADD   DX, AX        ; Update checksum
INC   BX
LOOP  L
MOV   AL, DL        ; Now output checksum, least-significant first
OUT   OutDev, AL
MOV   AL, DH
OUT   OutDev, AL
```
For memory mapped IO, if memory address of device register is OutDev, replace all OUT instructions by
```       MOV   BYTE PTR OutDev, AL
```

One bit error leads to a wrong checksum. Thus, distance is at least 2. But 2 "complementary" errors in different bytes in the same bit position do not change the checksum, and the distance is also AT MOST 2. Therefore, the distance of this code is exactly 2, i.e. it can be used to detect, but not fix, one error.

Due date: Thursday, January 31, 2002

Note: work and submission in this exercise is SOLO.