Introduction to Artificial Inteligence

Assignment 6 - solutions


Written assignment: probabilistic inference and learning

Justify all answers!

1) We have the following distribution over the binary variables A, B, C, given as a full table:

   A     B     C     probability
  -------------------------------
   F     F     F       0.125
   F     F     T       0.125
   F     T     F       0.125
   F     T     T       0.125
   T     F     F       0.02
   T     F     T       0.08
   T     T     F       0.08
   T     T     T       0.32

a)   Construct the Bayes network that most compactly describes the distribution (i.e. one that has
     the smallest number of arcs). 

   The algorithm presented in class assumed an ordering. We will be a bit more general and first find
   all the dependencies. From the table we get (denoting A=true by a, A=false by ~a, etc.)

   P(a) = P(~a) = 0.5
   P(b) = P(c) = 0.65
   P(~b) = P(~c) = 0.35

   P(a,b) = 0.4
   P(~a,b) = 0.25
   P(a,~b) = 0.1
   P(~a,~b) = 0.25

   P(a,c) = 0.4
   P(~a,c) = 0.25
   P(a,~c) = 0.1
   P(~a,~c) = 0.25

   P(b,c) = 0.445
   P(~b,c) = 0.205
   P(b,~c) = 0.205
   P(~b,~c) = 0.145

Using the above, we can see that: 
  P(b,c |a) = P(a,b,c) / P(a) = 0.32 / 0.5 = 0.64
  P(b|a) = P(a,b)/P(a) = 0.4 / 0.5 = 0.8
  P(c|a) = P(a, c)/P(a) = 0.4 / 0.5 = 0.8
So we have P(b,c |a) = P(b|a)*P(c|a)
Likewise, we will get P(B,C | A) = P(B|A) * P(C|A)
for all possible assignments of A, B, C. Therefore, we have independence of
B and C given A, and in the BN we should have d-sep({B},{C} | {A}).

What about other d-sep statements? Well no other such statments are true, e.g.
consider: d-sep({A},{B} |{}). This does not hold
since P(a,b) = 0.4 which is NOT equal to P(a)*P(b)=0.5*0.65=0.325
The other possible independence statements (or d-separation statements):
d-sep({A},{C} |{})
d-sep({B},{C} |{})
d-sep({A},{C} | {B})
d-sep({A},{B} | {C})
Also do NOT hold.

Now, consider the construction ordering A, B, C.
Obviously, in this ordering, A has no parents. 
Now consider B - since we do not have d-sep({A},{B} |{}) we need to make A a parent of B.
Now consider C - since we do not have d-sep({A},{C} |{}) we need to make A a parent of C.
However, since we DO have  d-sep({B},{C} | {A}) we do NOT need to make B a parent of C.

The resulting network is a tree with A as a root node, and each of B and C are its children,
a total of 2 arcs.

b) Is the answer unique?

The answer is not unique. If we had the ordering B, A, C we would have
B as a parent of A, and A as a parent of C (no need to make B a parent of C
because we need to have: d-sep({B},{C} | {A}). We get a chain with 2 arcs.

Likewise, an ordering C, A, B would result in a chain in the reverse direction.

Observe that using an ordering B, C, A would result in a graph with 3 arcs,
and this graph would not satisfy d-sep({B},{C} | {A}), (and also would not be
minimal).

2) Consider the following variables and their immediate causes (parents):

     variable         parents
-------------------------------
        A               none
        B               A
        C               B D E
        D               A
        E               D
        F               E

    a) Is this network a poly-tree?
    b) Determine the truth of the following independence
       statements (using d-separation):
       1)  I({A}, {E} | {})
           No, path A-D-E is not blocked (D is a non-evidence passthough node)
       2)  I({A}, {E} | {C})
           No, for the same reason as above.
       3)  I({A}, {E} | {D})
           Yes. Path  A-D-E is blocked (D is an evidence passthough node)
                Paths A-B-C-E and A-B-C-D-E are blocked by C (converging non-evidence)
       4)  I({D}, {B} | {})
           No. Path B-A-D is not blocked (B is a non-evidence diverging node)
       5)  I({B}, {D} | {A}
           Yes. Path B-A-D is blocked by B (an evidence diverging node)
                Paths B-C-D and B-C-E-D are blocked by C (converging non-evidence)
       6)  I({B}, {D} | {A, C})
           No. Path B-C-D is not blocked (C is a converging evidence node)
       7)  I({A}, {F} | {E})
           Yes. Path A-B-C-E-F is blocked by E (a diverging evidence node)
                Paths A-B-C-D-E-F and A-D-E-F are blocked by E (a passthrough evidence node)
    c) Assume that the nodes are binary-valued.
       Suppose that P(A=true) = 0.8, P(D=true|A=true) = 0.5, P(D=true|A=false) = 0.1,
       P(E=true|D=true) = 0.5, P(E=true|D=false) = 0.2

       Find P(A=true | E=true, F=true)

       Hint: you may want to preprocess to get rid of nodes before doing any computation.

  Answer: Node C is has no children, is not evidence and not query, and is thus barren and
       removed. After that, node B is also seen as barren, and removed. Now, we have
       d-sep({A}, {F} | {E}) so we know:
       P(A | E, F) = P(A|E)
       So this is the same as if the query did not have F as evidence. If so, then node F
       would also be barren, and can be removed. We remain with the network: A-D-E,
       and need to compute (using "a" to denote "A=true", etc.)
         P(a | e) = P(a, e) / P(e)
                            = P(a, e) / (P(a, e) + P(~a, e))

       But:
         P(a, e) = P(a, d, e) + P(a, ~d, e)
                 = P(a)*P(d|a)*P(e|d) + P(a)*P(~d|a)(P(e|~d)
                 = 0.8*0.5*0.5 + 0.8*(1-0.5)*0.2 = 0.4*(0.5+0.2) = 0.28

         P(~a, e) = P(~a, d, e) + P(~a, ~d, e)
                  = P(~a)*P(d|~a)*P(e|d) + P(~a)*P(~d|~a)(P(e|~d)
                  = (1-0.8)*0.1*0.5 + (1-0.8)*(1-0.1)*0.2 = 0.2*(0.05+0.18) = 0.2*0.23 = 0.046

       Finally: 
           P(a | e)  = P(a, e) / (P(a, e) + P(~a, e)) = 0.28 / (0.28 + 0.046) = (approx) 0.86
                   

3) You need to construct a 4-input neural network that has value 1 in the 
   output just when an even number of its inputs are 1, and 0 otherwise
   (assume inputs are in {0, 1}).
   a) Can this be done without hidden units?

      No. Consider the case where the first 2 inputs are 0 and 1, respectively,
      and consider the output as a function of the remaining inputs
      (this is a projection of the 4-dimensional problem onto a 2-dimensional space).
      The number of 1's in the input will be even if exactly 1 of the remaining
      inputs will be 1. Thus, the desired output is a XOR of the remaining inputs. We
      already KNOW that XOR is not linearly seperable, so cannot be done without
      hidden units.

   b) Show a network using threshold elements that performs
      the required computation (specify the weights, too!)

      Answer: there are many possibilities, using hidden units to count the
      number of 1's (actually, to specify whether it is over a threshold).
      We will use one unit (A) to indicate 0 inputs active, a second (B) to
      indicate at least 2 inputs active, a third (C) to indicate at most 2 inputs active, and
      a fourth (D) to indicate 4 inputs active. The output O should be 1 if either
      A or D are active, or if both units B and C are active (in the latter case this means
      exactly 2 inputs active). For simplicity, we will let all thresholds be 1, except for
      the thresholds of A and C, which are -1.

      The rest of the weights are as follows:
        For unit A, all weights are -1.5 (as a result, A will be active only if
          none of the inputs are active).
        For unit B, all weights are 0.6 (as a result, any 2 active inputs will activate B)
        For unit C, all weights are -0.4 (as a result, any 3 active inputs will DEactivate C)
        Finally, for unit D all weights are 0.3 (thus, D will activate only if all inputs are active)

      Now, in order to achieve the correct output, make the weights from the respective units
      to the output unit as follows:
       From A and D: 2
       From B and C: 0.6
      As a result O will be active if either A or D are active, or if B both and C are active.

      Obviously, here are other correct answers as well here.

4) a) Using the perceptron learning rule (the "delta" rule), show the steps
      of learning 3-input AND function with a single perceptron
      and a step activation function with threshold 1. Initially all
      weights are 0.9, and the learning rate is 0.5.

      Assume the following presentation order (and results):
         011: error signal is -1 (need 0, got 1)  so updating weights 2 and 3 by -0.5
              (weights now 0.9, 0.4, 0.4)
         110: error signal is now -1 (need 0, got 1), so updating weights 1 and 2 by -0.5
              (weights now 0.4, -0.1, 0.4)
         111: error signal is now 1 (need 1, got 0),  so updating all weights by 0.5
              (weights now 0.9, 0.4, 0.9)
         101: error signal is now -1 (need 0, got 1), so updating weights 1 and 3 by -0.5
              (weights now 0.4, 0.4, 0.4)

      We now do the rest of the examples, but no errors, and thus no changes, occur (the epoch ends).
      On the next epoch, ALL examples pass with no errors (including the above 4), so
      the algorithm terminates with the correct set of weights.

   b) Show (by example) that the order of presenting the examples can
      change the number of training episodes.

      Suppose we now start (weights (0.9, 0.9, 0.9) with examples: 000, 001, 010, 100, 111.
      No errors in any of these examples, no changes. Now we go to (similar to last time):
         011: error signal is -1 (need 0, got 1)  so updating weights 2 and 3 by -0.5
              (weights now 0.9, 0.4, 0.4)
         110: error signal is now -1 (need 0, got 1), so updating weights 1 and 2 by -0.5
              (weights now 0.4, -0.1, 0.4)
         101: no error, so no change.

       This ends the epoch, with an INCORRECT set of weights: (0.4, -0.1, 0.4)
       Suppose we keep the same order of for the first examples, so we do: 000, 001, 010, 100
         111: error signal is now 1 (need 1, got 0),  so updating all weights by 0.5
              (weights now 0.9, 0.4, 0.9)
         101: error signal is now -1 (need 0, got 1), so updating weights 1 and 3 by -0.5
              (weights now 0.4, 0.4, 0.4) 
         011: no error, no change 
         110: no error, no change 

     At the end of this epoch, we have the weights (0.4, 0.4, 0.4). Ooops, back to square one...
     If we repeat the above, it is clear that the correct weights will NEVER be reached!


5)  A project leader in the DT Prosero project would like to construct a
    decision procedure based on results from hiring employees in the IMG4
    consortium. The decision procedure will be based on a decision tree, using information
    in the case database. Attributes for each employee are:
    degree level (A),  Java programming ability (B), self-motivation (C), grade in AI (D).
    The target attribute is the "decision", describing the amount of money to
    pay the employee.

  a) Construct by hand (using the greedy search using the best information
     gain heuristic) a consistent decision tree for the following table:

     input variables               decision
   A      B      C      D
--------------------------------------------
   1      F      H      95             8
   1      T      L      70             4
   1      F      L      70             4
   1      T      M      95             6
   1      T      H      80             6
   2      T      H      95             8
   2      T      H      80             8
   2      F      H      95             8

We consider a greedy assesment of each of the possible attributes.
Trying A, we have:
  A=1 gives us an (unnormalized) distribution (2, 2, 1) for the cases (4, 6, 8) resp.
  A=2 results in distribution (0,0,3), i.e. a full classification.
  Remaining information required to classify all examples is thus:
    Inf(2,2,1)+0

Trying B, we have:
  B=F gives distribution (1,0,2)
  B=T gives distribution (1,2,2)
  Remaining information required to classify all examples is thus:
    Inf(1,2,2)+Inf(1,0,2) which is clearly greater (thus worse) than for A
    so B is clearly not the best root node... (Shortcut, saves us computing the
    value of B exactly).

Trying C, we have:
  C=L gives distribution (2, 0, 0), i.e. a full classification,
  C=M gives distribution (0, 1, 0), i.e. a full classification
  C=H gives distribution (0, 1, 4)
  Remaining information required to classify all examples is thus:
    0+0+Inf(0, 1, 4) which is BETTER than for attribute A 
   (because the distribution is
    STRICTLY more peaked than (1,2,2), and with the same number of examples).
    (By "strictly more peaked" I mean that some of the values in (0, 1, 4) are the result of moving
     the examples of some item INTO another item, and all the rest stay the same). Also, permutations
     make no difference on the value of Inf().

Trying D we have:
  D=70 gives distribution (2, 0, 0), a full classification,
  D=80 gives distribution (0, 1, 1)
  D=95 gives distribution (0, 1, 3)
  Remaining information required to classify all examples is thus:
   Inf(0,1,1) + Inf(0,1,3) = -(0+log(0.5)+log(0.5)) - (0+log(0.25)+3*log(0.75))
                           = 4-3*log(0.75)     (all logs in base 2)
                           = approx 5.245
  (It is not IMMEDIATELY obvious which is better between C and D, so we need to compute.)
  For attribute C we had:
   Inf(0,1,4) = -(0+log(0.2)+4*log(0.8)) = approx 2.32 + 1.29 = 3.61
  Thus, remaining information needed to classify all examples after getting the
  value of attribute C is smaller, so C is the best attribute and is selected as the root node.

  In recursive calls to the algorithm, for branches L and M no further brancing is needed,
  but for C=H additional attributes are needed, so we look at each of the remaning attributes:

Trying A we have (only for the examples where C=H):
  A=1 gives distribution (0,1,1)
  A=2 gives distribution (0,0,3), a full classification
  Remaining information needed is Inf(0,1,1)+0 = 2 bits.

Trying B we have (only for the examples where C=H):
  B=F gives distribution (0,0,2), a full classification
  B=T gives distribution (0,1,2)
  Remaining information needed is Inf(0,1,2) = -log(1/2)-2*log(2/3) = approx 1.58 + 1.17 = 2.73

Trying D we have (only for the examples where C=H):
  D=70 gives distribution (0,0,0), a full classification
  D=80 gives distribution (0,1,1)
  D=95 gives distribution (0,0,3), a full classification
  Remaining information needed is Inf(0,1,1)+0+0 = 2 bits.

So either A or D are better than B, we can pick either according to the information
criterion. Let us pick A because it has fewer children... In the recursive calls,
the branch 2 of A is a full classification, so we need more attributes only for the
case A=1 (and, of course, C=H as before). Trying ot the remaining attributes we get:

Trying B we have:
 B=F gives distribution (0,0,1), a full classification
 B=T gives distribution (0,1,0), a full classification

So we are done. But still let us try D as well:
 D=70 gives distribution (0,0,0), a full classification
 D=80 gives distribution (0,1,0), a full classification
 D=95 gives distribution (0,0,1), a full classification
so D seems just as good, but has more children, so we would still prefer B.

The resulting tree is (with 3 internal nodes) is:

variable : value      variable : value      variable : value

   C  ---- L  --- 4
      \ -- M  --- 6
        -- H  ------------ A ---- 1 ----------- B ----- F --- 8
                             \                    ----- T --- 6
                               -- 2 ---- 8


  b) Is a more compact (fewer internal nodes) decision tree possible for the above table?
     If no - prove it. If yes, which tree?

     We can have a tree with fewer than 3 internal nodes only if we can find a tree
with 2 nodes that can be used for exact classification. So we can simply check all possible
trees with 2 internal nodes. Now, we have already seen that no such tree is possible
with root C (otherwise it would have been found above). Looking at the other results in
step 1, we find:

With root B, we have 2 branches that are not fully classified thus requiring at least 2 more
internal nodes, so no-go.

With root D, again we have 2 branches that are not fully classifies, so again no-go.

We remain with possible root A, where branch A=2 is a full classification, so need to
check branch A=1, which has distribution (2,2,1). Note that there is no
way to fully classify this distribution with a BINARY attribute, so we need to check
whether attribute C or D can result in a complete classification.

For C=H we get distribution (0,1,1) which is not a full classification so C is no good.

For D=95, we get distribution (0,1,1) which AGAIN is not a full classification, so
D is ALSO no good.

Therefore, there is no decision tree with fewer than 3 nodes consistent with this training set.



6)  Consider the following taxation scenario, occuring in a certain middle-eastern country.
    We will assume that you are a risk-neutral, i.e. that
    your utility function is linear in the amount of money that your assets are worth.
    Assume also that moral considerations are not part of your utility function.
 
    Your accountant has notified you that your taxes for FY 2005-2006 amount to 1,000,000 NIS.
    You have the following options:
       1) Declare the correct taxes verbatim, and pay the full amount.
       2) Attempt to use a tax shelter, in which case if you succeed you will only need to
          pay 100,000 NIS. However, a senior tax collection evaluator will then assess
          your forms, and can either accept it, or refuse, in which case you will also pay
          fines, and your total taxes will be 2,000,000 NIS.

    There are 4 possible evaluators, equally likely to be on your case:

    Evaluator A will accept your claim with probability 0.5

    Evaluator B will accept your claim with probability 0.2, but in the rest of 80% of the
       cases can be convinced that your claim is justifiable with probability 0.9, if
       100,000 NIS (from your money) can be redirected to better causes
       (such as helping out with a certain project of evaluator B's cousin).

    Evaluators C and D will never accept your claim.

     a) Which of the above taxation options should you prefer as a rational agent?

        Answer: You have 2 possible actions. 

             Action 1 costs 1,000,000 always, so E[U(be honest)] = -1,000,000

             Action 2 depends on the evaluator assigned, so we have:
             E[U(cheat)] =   0.25*E[U(A|cheat)] 
                           + 0.25*E[U(B|cheat)] 
                           + 0.25*E[U(C|cheat)] 
                           + 0.25*E[U(D|cheat)]
                         =   0.25 * [ 
                              [(0.5 * -100,000) + (0.5 * -2,000,000)]              // case A
                           +  [(0.2 * -100,000) +                                  // case B
                                      0.8 * (-100,000 + (0.1 * -2,000,000 + 0.9 * -100,000))]
                           + 2 * -2,000,000]                                       /cases C, D
                         =   0.25 * (-1,050,000 + -20,000 + 0.8 * (-100,000 + -290,000) + -4,000,000)
                         < -1,000,000

             So cheating is worse! Note that there is a shortcut here, because we know that
             with probability 50% either C or D will be chosen, in which case we pay double,
             which already costs 1,000,000. Since cases A and B also incur some cost, we can conclude
             quickly that cheating does NOT work, WITHOUT doing the computations (but not so in general). 

     b) Suppose that a "friend" of the accountant working at the treasury department can give you a "hint"
        about which evaluator will be on your case, if you grease his palm.
        How much is this information worth, given that your "friend" is perfectly reliable in this respect?

      Answer: The "friend" provides perfect information on the evaluator, so we need to compute
              PVI for this item of information. Now, since the "friend" is always right, the
              distribution over his answers will be equal to the distribution over evaluators. So
              we need to take the (weighted) average of the gains for each possible answers, which are
              as follows:
                If you know the evaluator to be A, and you cheat, you pay on the average
                (0.5 * -100,000) + (0.5 * -2,000,000) = -1,050,000
                So here if does not pay to cheat, so if you know that A will be chosen you will
                STILL not cheat. So in this case the information is worthless to you.

                Note that if you know the evaluator to be C or D, obviously it does NOT pay to
                cheat, so in these cases the information is ALSO worthless.

                However, if the evaluator is B and you cheat, you pay only:
                E[U(B|cheat)] =  (0.2 * -100,000) + 0.8 * (-100,000 + (0.1 * -2,000,000 + 0.9 * -100,000)))
                              = -20,000 + 0.8 * -390,000 = -332,000
                So by cheating here you gain 668,000.

                Therefore, VPI(which evaluator) = 0.25*0 + 0.25*668,000 + 0.25*0 + 0.25*0 = 167,000

            So, you should be willing to pay up to 167,000 NIS for the "friend"'s information.


7) Consider the following MDP, where scoring is total value of rewards obtained in 3 steps.
   The world has 4 locations in linear sequence, as follows:  F1 I F2 P
   Actions are: R attempts to move right, and L attempts to move left. Locations F1 and F2 contains
   flags, which have value of 1 and 10, respectively. A robot visiting a location with a flag
   gets a reward equal to the flag value, and then the flag disappears. P is an absorbing state,
   with a reward of -100.

   Transition probabilities are based on: motion succeeds with probability 0.8, but with probability
   0.2 the robot either stays in the same location, or moves in the opposite direction (if possible),
   with probability 0.1 each (0.2 to stay in place if at end of sequence).

   b1) Formalize the above problem as an MDP, i.e. what are the states, the transition probability
       function, and the reward function?

   Answer: to define an MDP, we need to define the state space, the action space,
      the reward function, and the transition distribution.

      The states (excluding time) are triples: (Location, Flag1, Flag2)
      where the first is a location (one of four of F1, I, F2, P), and the last 2
      are binary (flag exists true/false). Total number of states is thus 4*2*2=16.

      The action space contains two actions: L, R.

      Reward: we will assume reward is received based only on the current state,
      before the action is made, and is also received in "last" step, when no
      actions are executed. (Thus, 3 steps above means 3 actions and 3 transitions,
      but 4 rewards). The reward function is defined as follows (x and y below
      stand for "don't care", or "any value"):

        R(F1, 1, x) = 1
        R(F2, x, 1) = 10
        R(P, x, y)  = -100

      All other rewards are zero.

      Transition probabilities: stating this explicitly requires an array with
      16*2*16 probability entries (state, action, resulting state). However,
      the array is sparse and well structured, so we state just the non-zero
      transition probabilities. We will assume that the flags "disappear" only
      in the transition AFTER reaching them, and thus after receiving the reward.

      Pit is absorbing, you stay there no matter what action is taken, and flags
      never change either.
        P((P, x, y) | (P, x, y), a) = 1   
 
      Starting at location I with action L (flags do not change): 
        P((F1, x, y) | (I, x, y), L) = 0.8
        P((I, x, y) | (I, x, y), L) = 0.1
        P((F2, x, y) | (I, x, y), L) = 0.1

      Starting at location I with action R (flags do not change):
        P((F2, x, y) | (I, x, y), R) = 0.8
        P((I, x, y) | (I, x, y), R) = 0.1
        P((F1, x, y) | (I, x, y), R) = 0.1

      Starting at location F1 with action L (flag 1 disappears):
        P((F1, 0, y) | (F1, x, y), L) = 0.9
        P((I, 0, y) | (F1, x, y), L) = 0.1

      Starting at location F1 with action R (flag 1 still disappears):
        P((I, 0, y) | (F1, x, y), R) = 0.8
        P((F1, 0, y) | (F1, x, y), R) = 0.2

      Starting at location F2 with action L (flag 2 disappears):
        P((I, x, 0) | (F1, x, y), L) = 0.8
        P((F2, x, 0) | (F1, x, y), L) = 0.1
        P((P, x, 0) | (F1, x, y), L) = 0.1

      Starting at location F2 with action R (flag 2 still disappears):
        P((P, x, 0) | (F1, x, y), R) = 0.8
        P((F2, x, 0) | (F1, x, y), R) = 0.1
        P((I, x, 0) | (F1, x, y), R) = 0.1

   b2) Find the optimal policy.

       Finding the optimal policy with finite horizon is done using the value
       function. But since this is a finite history, the value function depends
       on time, so we denote Vt(state). We will count t by the number of actions
       remaining, so V0(s) will thus be the value of being at state s when no
       actions remain to be executed, and thus clearly we just get the
       reward at s. That is, we have:

          V0(s) = R(s)

          Specifically, we have:
            v0(F1, 1, x) = 1
            v0(F2, x, 1) = 10
            v0(P, x, y) = -100
          with v0=0 for all other states.

       To compute Vi(s) we use the recursive equations:

          Vi(s) = R(s) + max (a over actions) sum (s' over states) P(s'|s,a) Vi-1(s')

       We can now compute V1(s), observe that we can ignore in the summation all
       zero terms. Also, we will take the liberty of NOT computing exactly the
       VERY BAD cases, and just use bounds, for conciseness. Note that during
       the computation, we also designate the best action for each state, this
       is the needed optimal policy! If no actions appear below, this means that all
       actions are equally good (or bad!) for the state.

          V1(P,x,y) = -200 (all actions equally bad!)

          V1(F1,1,x) = 1   (get flag 1 and cannot get any other reward in 1 step)
          V1(F1,0,x) = 0   (same as above, but no flag)

          V1(I,0,0) = 0    (no flags/rewards available, all actions indifferent)
          V1(I,0,1) = 8    (optimal action R, get flag 2 reward next time with
                              probability 0.8). 
          V1(I,1,0) = 0.8  (optimal action L, get flag 1 reward next time, with
                              probability 0.8)
          V1(I,1,1) = 8.1  (optimal action R, get flag 2 reward next time, with
                              probability 0.8, or flag 1 reward, prob. 0.1)

          V1(F2,x,0) = -10 (best action L, still land in pit with prob. 0.1)
          V1(F2,x,1) = 0   (best action L, reward 10 but  still land in pit 
                            with prob. 0.1, exactly cancelling the flag reward)

       Now we can use the recursive equation to compute V2(s).

          V2(P,x,y) = -300 (all actions equally bad!)

          V2(F1,1,0) = 1   (get flag 1 and cannot get any other reward in 1 step)
          V2(F1,1,1) = 7.4 (get flag 1, best action R to get to (I,0,1) with prob 0.8)
          V2(F1,0,1) = 6.4 (as above, best action R, but no flag 1)
          V2(F1,0,0) = 0   (no best action)

          V2(I,0,0) = -1   (Optimal action L, but can land in (F2,0,0) with prob. 0.1)
          V2(I,0,1) = 0.8  (Optimal action L, can land in (I,0,1) with prob. 0.1) 
          V2(I,1,0) = -0.22 (optimal action L, get flag 1 reward next time, with
                              probability 0.8, but could hit (F2,1,0),prob 0.1)
          V2(I,1,1) = 1.61 (optimal action L, get flag 1 reward next time, with
                              probability 0.8, or to (I,1,1) prob. 0.1)

          V2(F2,0,0) = -21 (best action L, still land in pit with prob. 0.1,
                            and in (F2,0,0) with prob. 0.1)
          V2(F2,0,1) = -11 (best action L, reward 10 but  still land in pit 
                            with prob. 0.1, and in (F2,0,0) with prob. 0.1)
          V2(F2,1,0) = -20.36
                           (best action L, still land in pit with prob. 0.1,
                            and in (F2,1,0) with prob. 0.1, but can reach (I,1,0)
          V2(F2,1,1) = -10.36
                           (best action L, reward 10 but still land in pit 
                            with prob. 0.1, (F2,1,0) with prob. 0.1,
                            reach (I,1,0) with probability 0.8)

       Now we can compute V3(s) using V2(s), in a similar manner. Observe that
       with two transitions remaining, the best action was always L, due to the
       possible "catastrophe" of the pit, except for location F1 where the
       pit is unreachable in 2 steps! With THREE transitions remaining, the optimal
       policy will be L from ANY location, independent of the flag states!
       (Well, except for location P where the agent is equally dead no matter
       what...)

       Note that the above is in fact simple expectimax with memoization!