Assignment 4 - Answers

  1. Consider the following 80X86 code segment. Assume that when the code is run, CX is non-zero.
         MOV    AX, 0
      L: ADD    AX, BX
         ADC    DX, 0
         LOOP   L
    1. What is the result of running the code?

      The code does an unsigned multiply DX:AX = BX*CX

    2. Re-write the above code segment for better efficiency.

      The following code does the same operation:

            MUL    CX
  2. Consider the following 80X86 code segment.
            MOV    AX, 0
         L: SHR    BX, 1
            ADC    AX, 0
    	ADD    BX, 0
    	JNZ    L

    We wish to write this code without using any form of LOOP, and with no conditional jumps (we can use other forms of JMP, however). Assume that CS=SS=DS=SS for simplicity, and write a code segment that provides the same results for AX (but may use or destroy other registers). You may NOT "unfold" the above loop! Note: the basic idea suffices here. Bonus for the full implementation.

    There are at least 3 ways to do it: (1) jump table (2) interrupt on overflow (3) self-modifying code. These work as follows:

  3. Write 80X86 instructions to zero out the AL register on an 80X86:
    1. List 5 different ways to zero the AL register with ONE instruction.
                  MOV       AL,0
                  XOR       AL, AL
                  SUB       AL, AL
                  AND       AL, 0
                  SHL       AL, 8
      	    MOV       AX, 0       ; As well as anything else that zeroes AX
    2. List 1 way to zero the AL register with 2 instructions, each different from all the instructions used in part 1.
                  OR        AL, 0FFH
                  INC       AL

  4. List the SHORTEST POSSIBLE CODE for adding 5 to Y and subtracting 2 from X which are defined as follows (consecutive locations):
        X    DB     1
        Y    DB     7
        ADD   Y, 5
        SUB   X, 2

    If we know that subtracting 2 from X causes no overflow, is there a shorter way to do that? How?

    Yes, we can do it with one instruction on a 16-bit WORD operand. The requirement for no overflow is to assure correctness, because we do NOT want a carry from Y into X! Code is an addition, as follows (can also be done with subtraction - a different immediate operand):

        ADD    WORD PTR X, 0FE05    ; Add 5 to low operand (Y), and
                                    ; add 0FE = -2 to high operand (X)

  5. Consider the following code for Motorola 68000 (comments state what each instruction does, according to the 68000 instruction manual (no, we did not teach you that machine, you should learn the relevant details on your own from the exercise)). Note that the 68000 has 32-bit registers D0-D7 and A0-A7, where A7 is also the STACK POINTER.
     F: MOVE.L   D0, -(A7)     ; D0 to memory - predecrement mode
        SUBQ.L   #1, D0        ; Subtract immediate - long (32 bit) operand
        BMI      N             ; Branch (jump) if result was negative
        JSR      F             ; CALL subroutine F (push PC then jump to F)
        MULS.L   (A7)+, D1:D2  ; Signed multiply memory (postincrement) 
                               ; with D2, result in D1:D2
        RTS                    ; Return from procedure/subroutine
     N: MOVE.L   (A7)+, D0     ; Move memory to D0 - postincrement mode
        MOVEQ.L  #0, D1        ; Move immediate to D1
        MOVEQ.L  #1, D2

    What happens if we execute an instruction JSR F, with D0=0? D0=1? Other values of D0 (call that value k)?

    The code pushes its argument, D0, onto the stack, and restores it before returning. Also, We get to N if D0 was 0 - and return a 64-bit quantity, 1 in this case. This is the base case of the recursive call. In all other cases, we multiply the argument D0 (popped from stack) by whatever value was returned by the same function, but with the argument, k=D0, decremented by 1. That is, the function computes: IF k=0, return 1. Else, return F(k-1)*k. This is exactly the definition of factorial, k!

  6. On Intel 80486, suppose that AX = 0000000010101111, BX = 0000000001001110, CX = 0000000010000101, and Carry Flag = 1. What is the result of the following operations (also describe the state of the carry and overflow flags after execution of each)
    a.  ADD    CL, AL
    Here and elsewhere, the extra result bit on the left denotes the carry.
    No overflow unless stated...
      + 10101111
      1 00110100
    b.  ADC    AL, 0E5H
      + 10101111
      + 11100101
      1 10010101
    c.  ADD    AX, 0045H
        + 0000000001000101
        0 0000000011110100
    d.  SUB	   BL, 69H
       - 01101001
       1 11100101
    e.  RCR	   AX, 1
       1  1000000001010111
       OF = 1

  7. Write instruction sequences to perform some common set operations, for 80X86.

    Each set is a subset of [1..16] is represented by corresponding bits in the register (e.g., AX=1000000000000101 represents {1,3,16}).

    Use the following table. Each entry contains a single bit. The index into the table selects which bit is set (e.g., the value at index zero has bit zero set).

    BitTbl          dw   1, 2, 4, 8
                    dw   10h, 20h, 40h, 80h
                    dw   100h, 200h, 400h, 800h
                    dw   1000h, 2000h, 4000h, 8000h
    a. Remove - removes the specified item from the set.
       BX contains a value in the range 0..15.
       AX contains a set.
          ADD   BX, BX
          MOV   BX, [BitTbl+BX]    ; Get 1 in position specified by element
          XOR   AX, BX             ; remove from set
    b. Odd - Even sets the zero flag if AX contains a set of numbers
       that are all odd.
          AND   AX, 5555H          ; Mask (get rid) of odd bits, zero flag
                                   ; is set if there are no evens
          ADD   AX, 0FFFFH         ; create carry if AX was non-0
          MOV   AX, 0FFFFH
          ADC   AX, 0              ; Get 0 only if AX was non-0
    c. Member - Member clears the zero flag if BX is an element
                of the AX set, it sets the zero flag otherwise.
       BX contains a value in the range 0..15.
       AX contains a set.
          ADD   BX, BX
          MOV   BX, [BitTbl+BX]    ; Get 1 in position specified by element
          AND   BX, AX             ; Mask to get only relevant bit (affects ZF)
    d. UnionSets - Union computes AX := AX union BX.
       AX and BX contain the sets.
         OR   AX, BX
    e. Intersection - Intersection computes AX := AX intersect BX.
       AX and BX contain the sets.
         AND  AX, BX
    f. Complement - Complement computes AX := - AX, that is, all elements in
       the set are removed, and all elements not in the set are added.
         NOT AX

  8. Repeat the above exercise for MIPS, where sets now have elements [1..32], values are provided in $a0 and $a1 instead of AX and BX, respectively, and returned in $v0 (for the member function, set $v0 to 1 if true, and to 0 if false). Write the new version of BitTbl necessary to answer this question.
    The table is:
    BitTbl:         .word   1, 2, 4, 8
                    .word   0x10, 0x20, 0x40, 0x80
                    .word   0x100, 0x200, 0x400, 0x800
                    .word   0x1000, 0x2000, 0x4000, 0x8000
                    .word   0x10000, 0x20000, 0x40000, 0x80000
                    .word   0x100000, 0x200000, 0x400000, 0x800000
                    .word   0x1000000, 0x2000000, 0x4000000, 0x8000000
                    .word   0x10000000, 0x20000000, 0x40000000, 0x80000000
    a) Remove:
            la      $t0, BitTbl        ; Address of table to $t0
            sll     $t1, $a1, 2        ; Multiply $a1 by 4
            add     $t1, $t1, $t0      ; Compute address in table
            lw      $t2, 0($t1)        ; Get word from table
            and     $v0, $a0, $a1
    b) Odd:
            and     $t0, $a0, 0x55555555
            seq     $v0, $zero, $t0       ; Set $v0 to 1 if result was 0
    c) Member:
            la      $t0, BitTbl        ; Address of table to $t0
            sll     $t1, $a1, 2        ; Multiply $a1 by 4
            add     $t1, $t1, $t0      ; Compute address in table
            lw      $t2, 0($t1)        ; Get word from table
            and     $v0, $t2, $a0
            sne     $v0, $zero, $v0    ; Set $v0 to 1 if $v0 not 0.
    d) Union:
            or      $v0, $a0, $a1
    e) Intersection:
            and      $v0, $a0, $a1
    f) Complement:
            not      $v0, $a0

  9. From the Patterson and Hennessy textbook: exercise 3.1 on page 197 (copy available at student office).

    Initially, $t0 (which is the output value) is initialized to 0, and $t1 to 1. Now, the program loops as long as $t1 is less than $a0 (or n). Each time, the accumulator $t0 is incremented by $t1, which is then incremented by 2, in turn. Thus, the sequence of values of $t1 is: 1, 3, 5, 7, 9 and that of $t0 is 0, 1, 4, 9, 16 ...

    To sum up, the code computes the square of n.

  10. GraphicsPower is a machine that runs in 500MHz with three groups of instructions: Int, Float, and Graphics. The favorite flight simulation program spends 20% of its runtime in Int instructions, 30% in Float instructions, and 50% in Graphics instructions. Any imporovment on the Graphics instructions worsen the Float instructions by 25% of the same improvement. What is the maximum improvement that can by achieved by improving the Graphics instructions only?

    Note that the clock rate is irrelevant. Now, the answer depends on the semantics of the statement: "Any imporovment on the Graphics instructions worsen the Float instructions by 25% of the same improvement." which is ambiguous. Assuming we read this: improving Graphics by DECREASING the runtime of each instruction by s factor of x, we INCREASE the runtime of each Float instruction by 0.25x of its timing. In this case, the average time per instruction is LINEAR in x, and the result must be at some hard limit - in this case when Graphics instructions take ZERO time. Consider a typical time unit in the original processor. With the new processor the timing will be: 0.2 + 0.5 (1-x) + 0.3 (1+0.25)x. With x=1 (Graphic instructions infinitely fast) we have, the time for the same run will be 0.2+0+0.375 = 0.575, or saving 42.5% of the execution time. (Note that some people will call this: "74% faster"..., because 1/0.575=1.74...)

  11. Power-A is an 800 MHz machine with three groups of instructions G1, G2 , and G3. Instructions in G1 run in 3 clock cycles, in G2 run in 4 clock cycles, and in G3 run in 6 clock cycles. Programs Pa and Pb have the same number of instructions but different distributions among the groups G1, G2, and G3. Program Pa has 35% G1 instructions, 40% G2, and 25% G3. Program Pb has 40% G1, 30% G2, and 40% G3 instructions. Which program runs faster, Pa or Pb?

    Again, clock rate is irrelevant. An average instruction in program Pa takes 3*0.35+4*0.4+6*0.25 = 4.15 cycles. An average instruction in program Pb takes 3*0.4+4*0.3+6*0.4 = 4.8 cycles. So, program Pa is faster.

  12. We need to add the jal (jump and link) instruction to our basic MIPS architecture (figure 5.33). Modify the datapath as necessary, and add the microinstructions that are needed to do this.

    We need to modify the hardware in 2 ways:

    With the above changes, and assuming that there is already an entry in the DISPTCH 1 ROM for the jal instruction, we need to add just 1 micro instruction, for the third cycle of the instruction (i.e. fetch and decode are the same as all other istructions). This new microinstruction will have the fields:

       Register control: RegDst = 2, MemtoReg = 2, RegWrite = 1, Sequencing = FETCH

    Note that there are many other possible solutions as well.

  13. Given the following MIPS assembly language code segment (assuming that the program starts at address 1000H),
            addi     $t0, $zero, 2
            addi     $t1, $zero, 0
      loop: lw       $t2, 50($t1)
            slt      $t3, $t2, $s3
            sw       $t3, 300($t1)
            addi     $t1, $t1, 1
            bne      $t1, $t0, loop
    1. Generate the resulting machine code (manually).
      Address        Instruction   Mnemonic                       Assembler
      0x00001000	0x20080002  addi $8, $0, 2                  ; 1: addi     $t0, $zero, 2
      0x00001004	0x20090000  addi $9, $0, 0                  ; 2: addi     $t1, $zero, 0
      0x00001008	0x8d2a0032  lw $10, 50($9)                  ; 3: lw       $t2, 50($t1)
      0x0000100c	0x0153582a  slt $11, $10, $19               ; 4: slt      $t3, $t2, $s3
      0x00001010	0xad2b012c  sw $11, 300($9)                 ; 5: sw       $t3, 300($t1)
      0x00001014	0x21290001  addi $9, $9, 1                  ; 6: addi     $t1, $t1, 1
      0x00001018	0x15283ffc  bne $9, $8, -16 [loop-0x1018]   ; 7: bne      $t1, $t0, loop
    2. Using the basic architecture of figure 5.33, show a timing diagram of the execution of this program - how many clock cycles does it take to execute?

      First, the individual instructions. addi should take the same as R-type, 4 cycles. Branch takes 3. sw takes 4, and lw takes 5. slt takes 4, as it is an R-type instruction. The loop is executed twice, as $t1 starts at 0 and gets incremented each pass, until it equals $t0, which is 2. Timing diagram is thus:

       1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
      | addi  | addi  |   lw        |    slt    |   sw      |   addi    |  bne   |
       F D A R F D A R F  D  A  M  R  F  D  A  R  F  D  A  M  F  D  A  R  F  D  J  

      And the entire string is repeated for 28 more cycles, requiring all in all 56 cycles to execute. Above, F stands for Fetch, D for Decode, A for ALU operation, M for memory access, and R for writing to register.

    3. Using a pipelined architecture, show a timing diagram of the execution of the program - how many clock cycles are saved?

      Sorry, this is hard to do in HTML, so come and ask...

  14. Fibbonacci numbers are defined recursively as:
    f(0) = f(1) = 1;
    f(n) = f(n-1)+f(n-2)

    Write a recursive procedure for 80X86 that gets n in EAX and returns f(n) in EAX, without destroying values in any other register. You may assume that f(n) will fit in 32 bits. CLUE: if your program is more than 20 instructions, you have done something wrong...

    We will assume that the value in EAX is unsigned, and NOT TOO LARGE
    (i.e. no overflow).
    F:     PROC NEAR
           CMP  EAX, 2
           JC   DONE
           PUSH EBX              ; General case. Save EBX (will be used as temp)
           DEC  EAX
           MOV  EBX, EAX          ; Copy n-1 to EBX as temp
           CALL F                 ; Computes f(n-1)
           XCHG EAX, EBX          ; Now EAX=n-1,  EBX=f(n-1)
           DEC  EAX
           CALL F                 ; Computes f(n-2)
           ADD  EAX, EBX          ; Adds them
           POP  EBX               ; Restore EBX and return
           RET  NEAR
    DONE:  MOV  EAX, 1            ; Base case - EAX less than 2
           RET  NEAR              ; I never count on defaults...

  15. Write a code segment that gets a pointer in BX and a count in CX, and then outputs CX bytes starting at address [BX], outputs them to port 10H, computes their 2-byte checksum and outputs that at the end as well. How would your program be different if the output were to memory-mapped IO? How many errors can the checksome detect, in the worst case (i.e. what is the distance of this code?)

    Assuming that CX strictly greater than 0:

    OutDev  EQU  10H
           MOV   DX, 0         ; DX will keep checksum
           MOV   AH, 0         ; Use as high-order byte of AX operand later on...
    L:     MOV   AL, [BX]
           OUT   OutDev, AL
           ADD   DX, AX        ; Update checksum
           INC   BX
           LOOP  L
           MOV   AL, DL        ; Now output checksum, least-significant first
           OUT   OutDev, AL
           MOV   AL, DH
           OUT   OutDev, AL
    For memory mapped IO, if memory address of device register is OutDev, replace all OUT instructions by
           MOV   BYTE PTR OutDev, AL

    One bit error leads to a wrong checksum. Thus, distance is at least 2. But 2 "complementary" errors in different bytes in the same bit position do not change the checksum, and the distance is also AT MOST 2. Therefore, the distance of this code is exactly 2, i.e. it can be used to detect, but not fix, one error.

Due date: Thursday, January 31, 2002

Note: work and submission in this exercise is SOLO.