Introduction to Artificial Inteligence

Assignment 6


Written assignment: probabilistic inference and learning

Justify all answers!

1) We have the following distribution over the binary variables A, B, C, given as a full table:

   A     B     C     probability
  -------------------------------
   F     F     F       0.81
   F     F     T       0
   F     T     F       0
   F     T     T       0.09
   T     F     F       0
   T     F     T       0.09
   T     T     F       0.01
   T     T     T       0

a)   Construct the Bayes network that most compactly describes the distribution (i.e. one that has
     the smallest number of arcs). 


b) Is the answer unique?First, we need to check dependencies. We have (by summing up numbers
in respective rows):

P(A) = 0.1,  P(~A) = 0.9
P(B) = 0.1,  P(~B) = 0.9
P(C) = 0.18, P(~C) = 0.82

P(AB) = 0.01
P(~AB) = 0.09
P(A~B) = 0.09
P(~A~B) = 0.81

Thus we know that I(A, B| {}) and thus A and B are not immediate
neighbors in the BN.

Now, we also have:
P(AC) = 0.09 which is NOT equal to P(A)P(C) = 0.1*0.18 = 0.018
so there must be a path from A to C. Also:
P(BC) = 0.09 which is NOT equal to P(B)P(C) = 0.1*0.18 = 0.018
so there must be a path from B to C.
Thus the graph is connected, and since there is no arc beteen A and B
the only way to do that is that the graph must be of shape A-C-B,
and it remains to be seen if we can determine the directions of the
arcs. In fact, since A and B are independent given no evidece, it must
be the case that C (which is on the path from A to B) is a converging
node on the path A-C-B. Thus the arcs are both directed towards C,
and this solution is unique.



2) Consider the following variables and their immediate causes (parents):

     variable         parents
-------------------------------
        A               none
        B               A
        C               B E
        D               A
        E               B D
        F               D

    a) Is this network a poly-tree?
    b) Determine the truth of the following independence
       statements (using d-separation):
       1)  I({B}, {D} | {})
      
       False, path B-A-D is not blocked (A is diverging non-evidence)
 
       2)  I({B}, {D} | {A})

       True: path B-A-D is blocked since A is diverging evidence
             path B-C-E-D is blocked since C is converging non-evidence
	     path B-E-D is blocked since E is converging non-evidence

       3)  I({B}, {D} | {A, C})

       False: path B-C-E-D is 
                not blocked by C (converging evidence node)
                not blocked by E (pass-through non-evidence node)

       4)  I({B}, {F} | {A})

       True: path B-A-D-F is blocked by A (diverging evidence node)
             path B-E-D-F is blocked by E (converging non-evidence)
             path B-C-E-D-F is blocked by C (converging non-evidence)

       Observe that this last path is NOT blocked by E, 
          which is pass-through non-evidence! Blocking is thus path-dependent!

       5)  I({B}, {F} | {A, C})

       False: path B-E-D-F 
                not blocked by D (diverging non-evidence)
                not blocked by E (converging, evidence at child C)

    c) Assume that the nodes are binary-valued.
       Suppose that P(A=true) = 0.4, P(D=true|A=true) = 0.5, P(D=true|A=false) = 0.1,
       P(E=true|D=true) = 0.5, P(E=true|D=false) = 0.2

       Find P(A=true | D=true, F=true)

       Hint: you may want to preprocess to get rid of nodes before doing any computation.


       We begin by removing non-evidence child-less node C - barren node.
       Now remove E (is now non-evidence child-less node - barren node).
       Now remove B (is now non-evidence child-less node - barren
       node).

       In the remaining graph, node that A is d-separated from F given
       D, that is: I(A,F|{D}) and thus we have P(A|DF) = P(A|D).

       To compute the latter we just apply Bayes rule:

       P(A=t|D=t) =
         =  P(D=t|A=t)P(A=t)/(P(D=t|A=t)P(A=t)+P(D=t|A=f)P(A=f))
         =  0.5*0.4 / (0.5*0.4 + 0.1*0.6) = 0.2 / 0.26 = 10/13     
        

3) You need to construct a 5-input neural network that has value 1 in the 
   output just when the number of its inputs that are on is either 0
   or 3. (assume inputs are in {0, 1}).
   a) Can this be done without hidden units?

   No. the answer should be 1 for input vector 00000 and 0 for input
   vectors 10000 where the 1 can be in any position. A hyperplane
   which separates 00000 from the 10000 vectors cannot intersect any
   other edges of the 2^5 hypercube. But the point 10000 must also be
   separated from 11100 (the latter should have outpput 1), requiring
   at LEAST one more hyperplane. In fact, it is easy to see that we
   need THREE hyperplanes to separate all the points that need
   to be separated.

   b) Show a network using threshold elements that performs
      the required computation (specify the weights, too!)

   There are numerous possibilities. The simplest is to have hidden
   units as featuer detectors such as: "1 or more",  and "3 or more"
   (we will call it "3+"), and "more than 3" (we will call it "3++")
   etc. Then we have an output unit that is 1 whenever:
      A) None of the detectors is on
      B) The "3+" detector is on but the "3++" is not.

   So we have 3 units (ignoring "input units").

   The "3+" detector has as inputs all 5 circuit inputs, each with a
   weight of 1 and with a threshold of 2.5. Thus its output will be 1
   just when at least 3 of its inputs are 1.

   The "3++" detector also has as inputs all 5 circuit inputs, each with a
   weight of 1 and with a threshold of 3.5. Thus its output will be 1
   just when at least 4 of its inputs are 1.

   The output unit has as inputs all 5 circuit inputs, each with a
   weight of -1. Also the output of the "3+" detector, with a weight
   of 5, and the output of the "3++" detector with a weight of -10. Its
   threshold is -0.5.

   Thus, when no circuit input is 1, the wighted sum is 0 (above the
   -0.5 threshold) and the output is thus 1. Once one or 2 of the
   input units is 1, but without turning on any of the detectors, the
   weighted sum is -1 or -2, and the output of the network is 0. When
   3 of the inputs are 1, the "3+" detector turns on (assuming the "3+"
   detector is off), and the wighted sum is again non-negative, and
   the network output is again 1. Finally, when more than 3 inputs are
   1 the "3++" detector turns on, and its large negative weight
   overrides all the other inputs and again makes the network output
   zero, as required.


4) a) Using the perceptron learning rule (the "delta" rule), show the steps
      of learning 2-input NAND function with a single perceptron
      and a step activation function with initial threshold 1 (note
      that the threshold can also be learned as a weight). Initially all
      other weights are minus 0.9, and the learning rate is 0.5.
      Note: in this case use the constant 1 instead of the derivative
      of the g() function.

      To make the initial threshold 1, we have a 3rd input connected
      constantly to a "1" signal, with a weight of -1 (we will call it W0).
      We also make
      the assumption that if the weighted sum is EXACTLY equal to
      0, the preceptron output is 0 (we need to make SOME assumption,
      as in a step function the value at 0 can be otherwise
      undefined).

      Starting with weights: W0 = -1, W1 = -0.9, W2 = -0.9,
        00:  weighted sum is -1, result 0, should be 1, error 1.
             Change by adding 0.5 to the "threshold" input to get
             W0 = -0.5.
        01: weighted sum is -1.3, result 0, should be 1, error 1.
             change by adding 0.5 to W0, W1, to get W0=0, W1=-0.4
        10: weighted sum is -0.9, result 0, should be 1, error 1.
             change by adding 0.5 to W0, W2, to get W0=0.5, W1=-0.4
        11: weighted sum is -0.3, result 0, correct.

        W0 = 0.5, W1 = -0.4, W2 = -0.4


        Now start epoch 2:

        00: weighted sum is 0.5, result 1, correct.
        01: weighted sum is 0.1, result 1, correct.
        10: weighted sum is 0.1, result 1, correct.
        11: weighted sum is -0.3, result 0, correct.

      And training is done.


   b) Show (by example) that the order of presenting the examples can
      change the number of training episodes.


      Starting with weights: W0 = -1, W1 = 0.9, W2 = 0.9,
      Suppose that the order of example presentation is 00, 01, 10,
      11. We get:

        00:  weighted sum is -1, result 0, should be 1, error 1.
             Change by adding 0.5 to the "threshold" input to get
             W0 = -0.5.
        01: weighted sum is 0.4, result 1, correct.
        10: weighted sum is 0.4, result 1, correct.
        11: weighted sum is 1.3, result 1, should be 0 error -1.
             Change by subtracting 0.5 from all weights, to get:
             W0 = -1, W1 = 0.4, W2 = 0.4.

        (Note how this last "fix" spoils the improvement of W0
        achieved in the first correction).
        This ends the first epoch. Now a 2nd epoch:

        00:  weighted sum is -1, result 0, should be 1, error 1.
             Change by adding 0.5 to the "threshold" input to get
             W0 = -0.5.
        01: weighted sum is -0.1, result 0, should be 1, error 1.
             change by adding 0.5 to W0, W1, to get W0=0, W1=0.9
        10: weighted sum is 0.4, result 1, correct.
        11: weighted sum is 1.3, result 1, should be 0 error -1.
             Change by subtracting 0.5 from all weights, to get:
             W0 = -0.5, W1 = 0.4, W2 = -0.1.

        Now start epoch 3:

        00:  weighted sum is -0.2, result 0, should be 1, error 1.
             Change by adding 0.5 to the "threshold" input to get
             W0 = 0.
        01: weighted sum is 0.4, result 1, correct.
        10: weighted sum is -0.1, result 0, should be 1, error 1.
             Change by adding 0.5 to W0, W2 to get: W0 = 0.5, W2=0.4 
        11: weighted sum is 0.8, result 1, should be 0 error -1.
             Change by subtracting 0.5 from all weights, to get:
             W0 = 0, W1 = -0.1, W2 = -0.1.


        Now start epoch 4:

        00:  weighted sum is 0, result 0, should be 1, error 1.
             Change by adding 0.5 to the "threshold" input to get
             W0 = 0.5.
        01: weighted sum is 0.4, result 1, correct.
        10: weighted sum is 0.4, result 1, correct.
        11: weighted sum is 0.3, result 1, should be 0 error -1.
             Change by subtracting 0.5 from all weights, to get:
             W0 = 0, W1 = -0.6, W2 = -0.6.

        Now start epoch 5:

        00:  weighted sum is 0, result 0, should be 1, error 1.
             Change by adding 0.5 to the "threshold" input to get
             W0 = 0.5.
        01: weighted sum is -0.1, result 0, should be 1 error 1.
             Change by adding 0.5 to W0, W1 to get: W0 = 1, W1 = -0.1.
        10: weighted sum is 0.4, result 1, correct.
        11: weighted sum is 0.3, result 1, should be 0 error -1.
             Change by subtracting 0.5 from all weights, to get:
             W0 = 0.5, W1 = -0.6, W2 = -1.1.

        Now start epoch 6:

        00:  weighted sum is 0.5, result 1, correct.
        01: weighted sum is -0.1, result 0, should be 1 error 1.
             Change by adding 0.5 to W0, W1 to get: W0 = 1, W1 = -0.1.
        10: weighted sum is -0.7, result 0, should be 1, error 1.
             Change by adding 0.5 to W0, W2 to get: W0 = 1.5, W2 = -0.6.
        11: weighted sum is 0.8, result 1, should be 0 error -1.
             Change by subtracting 0.5 from all weights, to get:
             W0 = 1, W1 = -0.6, W2 = -1.1.

        Now start epoch 7:

        00: weighted sum is 1, result 1, correct.
        01: weighted sum is 0.4, result 1, correct.
        10: weighted sum is -0.2, result 0, should be 1, error 1.
             Change by adding 0.5 to W0, W2 to get: W0 = 1.5, W2 = -0.6.
        11: weighted sum is 0.3, result 1, should be 0 error -1.
             Change by subtracting 0.5 from all weights, to get:
             W0 = 1, W1 = -1.1, W2 = -1.1.

        Now start epoch 8:

        00: weighted sum is 1, result 1, correct.
        01: weighted sum is -0.2, result 0, should be 1 error 1.
             Change by adding 0.5 to W0, W1 to get: W0 = 1.5, W1 = -0.6.
        10: weighted sum is 0.3, result 1, correct.
        11: weighted sum is -0.3, result 0, correct.

            W0 = 1.5, W1 = -0.6, W2 = -1.1.

        Now start epoch 9:

        00: weighted sum is 1.5, result 1, correct.
        01: weighted sum is 0.9, result 1, correct.
        10: weighted sum is 0.4, result 1, correct.
        11: weighted sum is -0.2, result 0, correct.

       Finally we are done... Note how many times fixes to one weight
       spoiled the fixes done previously to another weight!!


      Starting with weights: W0 = -1, W1 = 0.9, W2 = 0.9,
      suppose the order of the examples at epoch 1 
      were: 01, 10, 11, 00, to get:


        11: weighted sum is 0.8, result 1, should be 0, error -1.
            Change by adding -0.5 to all to get: W0=-1.5, W1=0.4, W2=0.4 
        01: weighted sum is -1, result 0, should be 1, error 1.
            Change by adding 0.5 to W0, W1, to get: W0 = -1, W1 = 0.9
        10: weighted sum is -0.5, result 0, should be 1, error 1.
            Change by adding 0.5 to W0, W2, to get: W0 = 0, W2 = 0.9
        00: weighted sum 0, result 0, should be 1, error 1.
            Change by adding 0.5 to W0 to get:

           W0 = 0.5, W1 = 0.9, W2 = 0.9

        Now, in epoch 2:

        11: weighted sum is 2.3, result 1, should be 0, error -1.
            Change by adding -0.5 to all to get: W0=0, W1=0.4, W2=0.4 
        01: weighted sum is 0.5, result 1, correct.          
        10: weighted sum is 0.5, result 1, correct.
        00: weighted sum 0, result 0, should be 1, error 1.
            Change by adding 0.5 to W0 to get:


           W0 = 0.5, W1 = 0.4, W2 = 0.4

        Now, in epoch 3, suppose order is: 11, 00, 01, 10

        11: weighted sum is 1.3, result 1, should be 0, error -1.
            Change by adding -0.5 to all to get: W0=0, W1=-0.1, W2=-0.1 
        00: weighted sum 0, result 0, should be 1, error 1.
            Change by adding 0.5 to W0 to get W0 = 0.5
        01: weighted sum is 0.4, result 1, correct.
        10: weighted sum 0.4, result 1, correct

           W0 = 0.5, W1 = -0.1, W2 = -0.1

        Now, in epoch 4, suppose order is: 11, 00, 01, 10

        11: weighted sum is 0.3, result 1, should be 0, error -1.
            Change by adding -0.5 to all to get: W0=0, W1=-0.6, W2=-0.6 
        00: weighted sum 0, result 0, should be 1, error 1.
            Change by adding 0.5 to W0 to get W0 = 0.5
        01: weighted sum is -0.1, result 0, should be 1, error 1.
            Change by adding 0.5 to W0, W1, to get: W0 = 1, W1 = -0.1
        10: weighted sum is 0.4, correct.

           W0 = 1, W1 = -0.1, W2 = -0.6

        Now, in epoch 5, suppose order is 11, 10, 01, 00

        11: weighted sum is 0.3, result 1, should be 0, error -1.
            Change by adding -0.5 to all to get: W0=0.5, W1=-0.6, W2=-1.1 
        10: weighted sum is -0.6, result 0, should be 1, error 1.
            Change by adding 0.5 to W0, W2, to get: W0 = 1, W2 = -0.6
        01: weighted sum is 0.4, correct.
        00: weighted sum 1, result 1, correct

           W0 = 1, W1 = -0.6, W2 = -0.6

        Now, in epoch 6 (in any order):

        11: weighted sum is -0.2, result 0, correct.
        01: weighted sum is 0.4, result 1, correct.
        10: weighted sum is 0.4, result 1, correct.
        00: weighted sum is 1, result 1, correct.


        And we are done. It took 3 fewer epochs in this case!



5)  Consider the following taxation scenario, occuring in a certain middle-eastern country.*
    We will assume that you are a risk-neutral, i.e. that
    your utility function is linear in the amount of money that your assets are worth.
    Assume also that moral considerations are not part of your utility function.
 
    Your accountant has notified you that your taxes for FY 2006-2007 amount to 1,000,000 NIS.
    You have the following options:
       1) Declare the correct taxes verbatim, and pay the full amount.
       2) Claim that you have used much of the money to fund
          scientific experiments on monkeys, in which case you will only need to
          pay 200,000 NIS. However, a senior tax collection evaluator will then assess
          your forms, and some of them have militant friends in the ASPCM* which
          will trash your house, costing you an additional 1,500,000.

    There are 3 possible evaluators, equally likely to be on your case:

    Evaluator A has no friends in the ASPCM.

    Evaluator B has a militant friend in the ASPCM, but likes the
       colour green, and if he sees a sufficient amount of it (say
       $30,000 = 100,000 NIS), with  probability 0.5, will
       forget that monkeys have anything to do with animals (and
       otherwise will take the money and still tell his friend).

    Evaluators C will always report to his militant ASPCM friends.

     a) Which of the above taxation options should you prefer as a
     rational agent?

Decision diagram contains 2 random variables: Evaluator (=E) and
(Default on bribe = D). 
It contains 2 decision variables: Pay full (=P) and
pay Bribe (=B).
The utility node is a child of all the variables.
The random variables can be seen as independent, and there
is an information arc from the E node to the B decision node.



If you pay the full amount, your (certain, as well as expected)
utility is U(pay full) = (-1000000)

If you try the funding experiments claim, we have an uncertain
outcome, and need to evaluate an expectation:

 E(U(claim experiments)) = 
      (-200000) 
          + 1/3 * 0                  ; case A
          + 1/3 * ((-100000)+(0.5*0+0.5*(-1500000))) ; case B
          + 1/3 (-1500000)  ; case C
       = (-200000) + 1/3*(-850000) + (-500000) = (-983333)

 Therefore we prefer to claim useage for experiments.

 Note that the expectation above is assuming the decision in case
 B was for showing the colour green, otherwise this case is
 replaced by (-1500000) which is worse.
                              



     b) Suppose that a "friend" of the accountant working at the treasury department can give you a "hint"
        about which evaluator will be on your case, if you grease his palm.
        How much is this information worth, given that your "friend"
        is perfectly reliable in this respect?

 Decision diagram contains all the variables of the previous
 diagram, but has one additional decision node F (whether to
 obtain infromation from the "friend") and one additional
 random vairiable I (the information provided). The I node has
 as parents the E node and the F decision node.
 There is now an information arc from the E node to the P
 decision node. The utility node has all nodes as parents.

 Suppose you had the information already. Then your action would
 change only in cases B and C. That is because in case B you would
 lose 50000 on the average by not paying the full amount, and in
 case C you would lose 700000 by not paying the full amount.
 These are the expected gains from having the information in these
 cases, and we must take the expectation over these information
 possibilities, so:
   VPI(knowing evaluator) = 1/3*0 + 1/3 * 50000 + 1/3 * 700000 = 250000

     Draw the decision diagram for each case, and solve it.



6) Consider the following belief-state MDP, where scoring is total value of rewards obtained in 3 steps.
   The world has 4 locations in linear sequence, as follows:  F1 I F2 P
   Actions are: R attempts to move right, and L attempts to move left,
   and S senses location F2. Location F1 contains a  flag, which has a
   value of 1. Location F2 contains a flag, with value either 10 or
   100, with probability 0.5 each. 

   A robot visiting a location with a flag gets a reward equal to the
   flag value, and then the flag disappears. P is an absorbing state,
   with a reward of -100.
   Transition probabilities are based on: motion succeeds with probability 0.8, but with probability
   0.2 the robot either stays in the same location, or moves in the opposite direction (if possible),
   with probability 0.1 each (0.2 to stay in place if at end of sequence).
   Sensing, if performed, tells the robot the actual value of the flag
   at F2. 

   b1) Formalize the above problem as a (belief-state) MDP, i.e. what are the states, the transition probability
       function, and the reward function?

First, we define the underlying MDP. The state consists of location
L (one of 1,2,3,4), and states of flags F1 (0 if collected, + if not)
and F2 (0 if collected, + if value 10 and not collected, ++ 
if value 100 if value 100 and not collected). Additionally there is
time...

If we assume that transition into a location of a flag collects it
immediately, then our reward function must depend on the
entire transition (both pre- and post- action state, as well as
possibly the action).

Transition probabilities are decomposable - the location
does not depend on the flag states, so we start off with that part.

For move right actions:
P(L'=i+1 | L=i, A=R) = 0.8        ; for i from 1 to 3
P(L'=i | L=i, A=R) = 0.1        ; for i from 2 to 3
P(L'=1 | L=1, A=R) = 0.2
P(L'=i-1 | L=i, A=R) = 0.1        ; for i from 2 to 3

P(L'=4 | L=4, A) = 1              ; for any action A

For move left actions:
P(L'=i-1 | L=i, A=L) = 0.8        ; for i from 2 to 3
P(L'=i | L=i, A=R) = 0.1        ; for i from 2 to 3
P(L'=1 | L=1, A=L) = 0.9
P(L'=i+1 | L=i, A=L) = 0.1        ; for i from 1 to 3

For sense actions:
P(L'=i | L=i, A=S) = 1        ; for i from 1 to 4

Now the flag collections can be represented deterministically
if depending just on the pre- and post- action locations:

P(F1' = 0 | F1 = 0) = 1
P(F1' = 0 | L' = 1) = 1
P(F1' = + | F1 = +, L' > 1) = 1

P(F2' = 0 | F1 = 0) = 1
P(F2' = 0 | L' = 3) = 1
P(F2' = + | F2 = +, L' not = 3) = 1
P(F2' = ++ | F2 = ++, L' not = 3) = 1

Now, the complete MDP transition probability,

P(L',F1',F2'|L,F1,F2,A) = P(L'|L,A)*P(F1'|F1,L')*P(F2'|F2,L')


The reward function is zero, except for:
R({L,F1,F2}, A, {L'=4,F1',F2'}) = -100          ; for all A not = S, L, F1, F2, F1', F2' 
R({L,F1=+,F2}, A, {L',F1'=0,F2'}) = 1           ; for all A not = S, L, L', F2, F2' 
R({L,F1,F2=+}, A, {L',F1',F2'=0}) = 10          ; for all A not = S, L, L', F1, F1' 
R({L,F1,F2=++}, A, {L',F1',F2'=0}) = 100        ; for all A not = S, L, 1', F1, F1' 
R({L,F1,F2}, A=S, {L',F1',F2'}) = sensing cost  ; for all L, L', F1, F1', F2, F2' 

Note that some of the above combinations are impossible, thus any
defintion of the reward for these cases is OK. Also, we
did not define a sensing cost in the exercise, so we will assume
it is 0, although it does waste a move...

This completes the definition of the MDP. But to define the
belief-state MDP one must also account for the observation history.
There are only 3 cases which make a difference: no observation,
observed F2=+, and observed F2=++. We can model this by adding
an additional ternary variable O, with values {U, +, ++, 0}
(observing F2 after collection is meaningless, as we know it
is no longer there, otherwise we should have an additional
possible value). Note that since the observation is perfect, we
can do as in the example in class, without the extra variable, and
just add the value U to the domain of the F2 variable. Here, however,
we will use the additional O variable. The transitions for this
variable are:

P(O'=+|O=U, A=S)  = 0.5              ; These are based on the probabilities
P(O'=++|O=U, A=S) = 0.5              ; for the value of F2
P(O'=U|O=U, A not =S, L' not =3) =1 
                      ; F2 value stays unknown unless sensed or picked
P(O'=++|O=++, L' not = 3) = 1 ; Known value of F2 does not change
P(O'=+|O=+, L' not = 3) = 1   ; 
P(O'=0|L' = 3) = 1            ; agent knows that F2 picked at L=3
P(O'=0|O=0) = 1               ; agent knows that a picked flag stays picked

Must also change the reward function for F2:
R({L,F1,O=++}, A, {L',F1',O=0}) = 100        ; for all A not = S, L, L', F1, F1' 
R({L,F1,O=+}, A, {L',F1',O=0}) = 10          ; for all A not = S, L, L', F1, F1' 
R({L,F1,O=0}, A, {L',F1',O=U}) = 55          ; for all A not = S, L, L', F1, F1' 
The latter is picking up F2 before it is known, must average the
rewards weighted by the probabilities. Note that now in fact variable
O playes the role of F2 for the agent.


   b2) Find the optimal policy. You may assume that the robot starts
       at location I in order to save some computation.

First, we compute V1, the value function for having only one
transition until the end of the world. We will denote in the table
both the value and the action leading to that value. Impossible
states we will ignore by marking them X.


               0          |            +           (F1 values)
   -+---------------------+--------------------
    |   0 | +  | ++ |  U  |   0 | +  | ++ |  U     (O values)
L==============================================
 1  |  0  | 0  | 0  | 0   |     |    |    |  
    | any | any| any| any |  X  |  X | X  |  X
------------------------------------------------
 2  |   0 |  8 | 80 |  44 | 0.8 | 8.1|80.1|44.1  
    |  any|  R |  R |  R  |  L  |  R |  R | R
------------------------------------------------
 3  |  S  |    |    |     |  S  |    |    |  
    |  0  | X  | X  |  X  |  0  |  X | X  |  X
------------------------------------------------
 4              -100 (for any action)

Observe how at location 3 we prefer the sensing action,
PRECISELY because it does nothing (no chance to fall
into the pit). If we are not allowed to do S,
then we will need to do L, but this will have a value of -10...
(still better than R, which will be -80).
Note that for L=2 we need to average over the possible outcomes,
and the probability that we reach F2 is only 0.8. But on the
right-hand side of the table there is also some bonus from the
fact that we might end up picking F1 when we try to pick up F2.

Now we construct the value function and policy for V2:


               0          |            +           (F1 values)
   -+---------------------+--------------------
    |   0 | +  | ++ |  U  |   0 | +  | ++ |  U     (O values)
L==============================================
 1  |  0  | 6.4| 64 |35.2 |     |    |    |  
    | any | R  | R  | R   |  X  |  X | X  |  X
------------------------------------------------
 2  |   0 |  8 | 80 |  44 | 0.8 | 8.1|80.1|44.1  
    |  any|  R |  R |  R  |  L  |  R |  R | R
------------------------------------------------
 3  |  S  |    |    |     |  S  |    |    |  
    |  0  | X  | X  |  X  |  0  |  X | X  |  X
------------------------------------------------
 4              -200 (for any action)

Note how in location 3 you STILL do S, even if
reachable F1 is still available, because trying to
do L has too high a chance of dropping the agent in
the pit!


Finally, the value function and policy for V3. But
here we need only do this for the initial state, which is
L=2, F1=+, and O=U. In fact, we could have avoided
computing values for (e.g.) L=1, F1=0, O=0 even earlier,
since it is unreachable from the initial state in 2 moves...

In the initial state, action R results in:
  E(U(R)) =   0.8 * (55+0)   ; success in moving right
            + 0.1 * (0+44)     ; stay in same place
            + 0.1 * (1+35.2)   ; moved left instead
          = 44 + 4.4 +3.62 = approx 52

(note that above, in parenthesis, the left numbers are immediate
rewards, the right numbers are from the respective position
in the table for V2).

For action L:
  E(U(L)) =     0.8 * (1+35.2)   ; success in moving left
              + 0.1 * (0+44)     ; stay in same place
              + 0.1 * (55+0)   ; move right instead
          = 28.96 + 4.4 + 5.5 = 38.14

For action S:
  E(U(S)) =    0.5 * 8.1 + 0.5 * 80.1 = 44.1

So the best initial action is to move R for an expected utility of 52





Deadline: Wednesday, April 9, 2008, at 12 noon.

Note: submission (as well as WORK) on this exercise is SOLO (that is, NOT in pairs or other groups of size > 1)

* Disclaimer: any seeming identification with existing countries, institutions, or people, is unintentional.