Introduction to Artificial Inteligence

Assignment 6 - Solution


Written assignment: probabilistic inference and learning

Justify all answers!

1) We have the following distribution over the binary variables A, B, C, given as a full table:

   A     B     C     probability
  -------------------------------
   F     F     F       0.105
   F     F     T       0.315
   F     T     F       0.07
   F     T     T       0.21
   T     F     F       0.135
   T     F     T       0.045
   T     T     F       0.09
   T     T     T       0.03

a)   Construct the Bayes network that most compactly describes the distribution (i.e. one that has
     the smallest number of arcs). 

 Answer: Assume that the order of construction is A, C, B. We will denote A=T by a, and A=F by ~a, and
    likewise for B and C.

   Node A requires no action, except to find P(a), which is the sum of 4 table entries where A is T.
   We thus have: P(a) = 0.135+0.045+0.09+0.03 = 0.3, and therefore P(~a) = 1-0.3 = 0.7

   Now, check node C w.r.t. all possible predecessors, i.e. w.r.t. A. Is C independent of A?
   Well, P(c) = 0.315 + 0.045 + 0.21 + 0.03 = 0.6
   But P(c | a) = P(c,a)/P(a) = (0.045+0.03)/0.3 = 0.075 / 0.3 = 0.25, NOT equal to P(c), thus C depends
   on A, and thus A is a parent of C.
   We also need to complete out the CPT for C:
   We have P(c | ~a) = P(c, ~a)/P(~a) = (0.315+0.21)/0.7 = 0.525 / 0.7 = 0.75
   The rest of the CPT sums distributions up to 1, thus:
   P(~c | a) = 1-P(c | a) = 0.75
   P(~c | ~a) = 1-P(c | ~a) = 0.25

   Finally, we check B. We have P(b) = 0.07+0.21+0.09+0.03 = 0.4

   P(b | ~a, ~c) = P(~a,b,~c)/P(~a,~c) = 0.07/(0.07+0.105) = 0.4
   P(b | a, ~c) = P(a,b,~c)/P(a,~c) = 0.09/(0.09+0.135) = 0.4
   P(b | ~a, c) = P(~a,b,c)/P(~a,c) = 0.21/(0.21+0.315) = 0.4
   P(b | a, c) = P(a,b,c)/P(a,c) = 0.03/(0.03+0.045) = 0.4
  
   Therefore B does not depend on B, C, and B has no parents.
   The resulting network has nly one arc, from A to C.
    
b) Is the answer unique?

   The answer is not unique, but the undirected topology IS unique. Using an ordering: C, A, B
   we would have a network with one arc from C to A. All other orderings would result in one of these two
   networks.

2) Consider the following variables and their immediate causes (parents):

     variable         parents
-------------------------------
        A               none
        B               none
        C               A B
        D               A B
        E               C
        F               D

    a) Is this network a poly-tree?

      No, there is a cycle ACBDA in the underlying undirected graph.

    b) Is this network directed-path singly connected?

       Yes, because the only nodes with more than 1 parent are C and D, and their parents have
       no ancestors (hence only one way to get to C or D from each of their parents).

    c) Determine the truth of the following independence
       statements (using d-separation):
       1)  I({A}, {E} | {})

           Not d-separated, there is an unblocked path ACE.

       2)  I({A}, {E} | {C})

           d-separated, because C is an evidence pass-through node in the only paths: ACE and ADBCE
           and thus both paths are blocked by C.

       3)  I({A}, {B} | {})

           d-separated, path ADB is blocked by D, a converging node (no evidence!) and path ACB is
           blocked by converding node C.

       4)  I({A}, {B} | {F})

           Not d-separated. Path ADE becomes unblocked, because converging node D has a child evidence
           node F.

       5)  I({A}, {E, F} | {C})

           Not d-separated, path ADF is unblocked (D is passthrough and not an evidence node).

       6)  I({A}, {F} | {C, D})

           d-separated, path ACF is passthough C which is evidence.

    d) Suppose that P(A) = 0.8, P(E|C) = 0.2, P(E|~C) = 0.4, P(B) = 0.1, P(C|A,B) = 0.3, P(C|~A,B) = 0.9

       Find P(A|B, C, E)

      Answer: First, we remove the barren nodes F and then D. Now, we have that I({A}, {E} | {B, C})
      because C blocks the path ACE. Thus, P(A|B, C, E) = P(A|B,C). In this last query, E becomes
      a barren node and can be removed, so we need to consider just the three-node network composed of
      A, B, C.

     By definition, P(A|B,C) = P(A, B, C) / P(B,C) = P(A,B,C) / (P(A,B,C)+P(~A,B,C))
     Using the decomposition according to the BN, we have: 
       P(A,B,C)=P(C|A,B)P(A)P(B) = 0.3 * 0.8 * 0.1 = 0.024

     Likewise: P(~A,B,C)=P(C|~A,B)P(~A)P(B) = 0.9 * (1-0.8) * 0.1 = 0.018

     Thus, P(A|B, C, E) = P(A|B,C) = 0.024/(0.024+0.018) = 4/7

3) You need to construct a 4-input neural network that has value 1 in the 
   output just when an odd number of its inputs are 1, and 0 otherwise
   (assume inputs are in {0, 1}).
   a) Can this be done without hidden units?

      No - parity is not linearly separable. A formal proof by reduction: set two of the inputs to 0.
      The output is now XOR of the 2 remaining inputs, and we have already shown that XOR is not
      linearly separable.

   b) Show a network using threshold elements that performs
      the required computation (specify the weights, too!)

     Answer: this can be done with hidden units, similar to the way XOR is done with hidden units. We
     can easily do it with 4 hidden units, with unit Hi enoding the fact that: "there were at least i
     input units with input value 1". Each unit Hi is connected to all the input units, with input weight 1.
     The threshold of the hidden units are t(Hi) = i-0.5. Check that this achieves the required semantics,
     for example t(H2) = 1.5 so the output of this unit is 1 just when at least 2 of the input units are 1.

     Finally, the hidden units are connected to an output onit that has thershold of 1, but the weights
     Wi on the connection of Hi to the output unit are as follows:

     W1 = 1.5    (Thus if one input only is active, this is the only Hi which is active, thus activating
                  the output unit)
     W2 = -5     (Thus if 2 inputs are active, this overrides H1)
     W3 = 10     (Overriding both W1 and W2)
     W4 = -20    (Overriding W1, W2, W3)
     
4) a) Using the perceptron learning rule (the "delta" rule), show the steps
      of learning 3-input function MAJORITY with a single perceptron
      and a step activation function with threshold 1. Initially all
      weights are 0.7, and the learning rate is 0.5.

     Answer: well the question was too easy, because the network as is ALREADY computes
     the majority function. Thus, the error for each of the examples is ZERO and there is
     NO TRAINING at all. After one pass over the example, the learning algorithm terminates
     with success.

   b) Show (by example) that the order of presenting the examples can
      change the number of training episodes.

      Answer: In this case, order of examples is irrelevant. In order to make this more
      interesting, we will give an example, but assume that the initial weights for the first
      and second inputs are 0.1 and the third is 0.7.

      In this case, if the first example is: 110, the weighted sum is 0.2 and the answer is 0, an error.
      The first 2 weights are increased by 0.5*1 = 0.5 and are now 0.6. The result is a network that is
      correct, no further errors, and after a second pass this is detected and the algorithm completes
      successfully.

      However, if the first example is 001, followed by 111, this will be not so good. The first case
      results in a weigted sum of 0.7, and an output of 0, which is correct, so no update. The second
      case results in a weighted sum of 0.9, and an output of 0, which is incorrect, and ALL weights are
      now increased by 0.5 to (0.6, 0.6, 1.2). The rest of the examples now result in no error. But on
      the second pass, the example 001 results in a weighted sum of 1.2 and the output is 1, an error.
      This is fixed to 0.7, resulting in a correct network this time.


5)  A project leader in the IMG4 consortium would like to construct a
    decision procedure based on results from hiring employees in the earlier KITE
    consortium. The decision procedure will be based on a decision tree, using information
    in the case database. Attributes for each employee are:
    degree level (A),  industrial work experience (B), self-motivation (C), grade in AI (D).
    The target attribute is the "decision", describing how good the employee is for the
    project.


  a) Construct by hand (using the greedy search using the best information
     gain heuristic) a consistent decision tree for the following table:

     input variables               decision
   A      B      C      D
--------------------------------------------
   2      T      H      95             good
   2      T      H      80             good
   2      F      H      95             good
   1      F      H      95             good
   1      T      L      70             bad
   1      F      L      70             bad
   1      T      M      95             average
   1      T      H      80             average

Answer: check all attributes for information gain (or for missing information AFTER getting the
      answer to a query base on the attribute).

After a query based on A, if A is 2 we can say "good". 
If A is 1 we have probabilities (0.2, 0.4, 0.4) for (good, bad, average) respectively for 5 exemplars.
(This is 1.5 bits for these examples, or 0.95 bits averaged over 8 examples).

After a query based on C, if C is L we can say "bad".
If C is M we can say "average".
If C is H we have probability (0.8, 0.2) for (good, average) respectively, for 5 examplars.
(This is approx 0.72 bits for these examples, or 0.45 bits average over 8 examples)

After a query based on D, if D is 70 we can say "bad".
If D is 80 we have probabilities (0.5, 0.5) for (good, average) respectively for 2 exemplars.
 (1 bit per example for these 2 exemplars)
If D is 95 we have probabilities (0.75, 0.25) for (good, average) respectively for 4 exemplars.
(This is approx 0.81 bits for these 4 examples)
Total number of bits needed after query is 2+4*0.81, or 0.655 per example averaged over all 8 exemplars).

After a query based on B:
If B is F we have probabilities (2/3, 1/3) for (good, bad) respectively for 3 exemplars,
and probabilities (0.4, 0.2, 0.4) for (good, bad, average) respectively for 5 exemplars.
Without bothering to compute, the information missing for just this branch is as much as for the
result of A, so B provides less information than A.

Thus, the query after which least additional information is needed is C. Then, we need to build a
tree recursively for the case where C is H. We need to consider A, B, D.

Using A here, for A=2 we can say "good", but for A=1 we have 2 examplars, distributed (0.5, 0.5) among
(good, average).

Using B here, for B=F we can say "good", but for B=T we have distribtion (2/3,1/3)  among
(good, average), which is worse than for A.

Using D, for D=95 we can say "good", but for D=80 we have distribution (0.5, 0.5) among
(good, average), same information content as A.

Thus, for this branch we choose either A or D. Assume we choose D.
Finally, in the branch reamining (C=H, D=80), query A can be used to separate the final 2 examples.
We thus get a tree with 3 internal nodes:

C:  L  (say "bad")
    M  (say "average")
    H  (query on D:     70 (no such exemplars - say "good")
                        95 (say "good")
                        80 (query on A:     2  (say "good")
                                            1  (say "average")

  b) Is a more compact (fewer internal nodes) decision tree possible for the above table?
     If no - prove it. If yes, which tree?

     In this case, a 1-node tree is not possible. Trying all trees starting with 1 node at the top,
     we can only get a 2-node tree if we start with A or C (other trees require resolution to more
     than one branch!) Starting with C, we have seen that one additional node is insufficient.
     Likewise, starting with A on top, in the branch A=1 there is no single query which will
     allow perfect classification. Thus, the above answer is optimal in the number of internal nodes.



6)  Consider the following investment scenario, occuring in a certain middle-eastern country.
    We will assume that you are risk-neutral, i.e. that
    your utility function is linear in the amount of money that your assets are worth.
 
    You have 1,000,000 NIS you wish to invest, with the following options: 
       1) Buy a house and rent it out for net gain of 40,000 NIS per year (after taxes, etc.)
       2) Buy government bonds for a 5% yearly interest, out of which you need to pay 15% capital
          gains tax. However, there is talk that the capitals gains tax will be raised to 30% very
          soon, and suppose that you credit this rumor with a 0.5 probability of being accurate
          (and that if false, the capital gains tax will remain unchanged).

     a) Which of the above investments should you make as a rational agent?


        Action 1 results in a net gain of 40000 NIS, so assuming utility is linear in money
        (risk neutrality) we will just measure utility in NIS.

        For action 2, there are two outcomes:

        1) With probability 0.5, you gain 50000 NIS buy pay taxes
           15% of that (that is, 7500 NIS), for a net gain of 42500 NIS.

        2) With probability 0.5, you gain 50000 NIS buy pay taxes
           30% of that (that is, 15000 NIS), for a net gain of 35000 NIS.

        Your expected utility is thus: 0.5 * 42500 + 0.5 * 35000 = 38750
        which is STRICTLY LESS than buying the house, so you should buy the house...  
 

     b) Suppose that a "friend" working at the treasury department can give you a "hint"
        on whether the capital gains tax will actually be raised. Given that your intended investment
        is for exactly one year, how much (maximum) should you as an intelligent agent be willing to pay this
        employee, given that 1) he is perfectly reliable, and 2) moral considerations are not
        part of your utility function.

        Suppose you get the hint for free, how much would you gain on the average?
        Since you believe that the probability of increase taxation is 0.5, and that your "friend"
        is perfectly accurate, then you believe that there is a probability of 0.5 that the hint will
        be "taxes will increase", and probability 0.5 that the hint will be "taxes will not increase".

        After receving the hint, if the hint is "taxes will increase" then your action afterwords would
        definitely be to buy the house and gain 40000 rather than the 35000 for the bonds, after taxes.
        In this case you will NOT gain from the hint, because your action would have been to buy the house
        even WITHOUT the hint.

        If the hint is "taxes will not increase", then you should go for the bonds, for a gain of 42500,
        a decision you would make only after receiving the hint, gaining you 2500 more than the action
        you would have made otherwise.

        Thus, expected value of the information is 0.5 * 0 + 0.5 * 2500 = 1250 NIS, and you should be
        willing to pay anything below that amount for the hint, igonoring other considerations...


Deadline: Tuesday, June 14, 2005, at 10AM.

Note: submission (as well as WORK) on this exercise is SOLO (that is, NOT in pairs or other groups of size > 1)

Below are additional exercises on the required material. These exercises are optional, but recommended.

a) Consider a world with the following operators:

   Op(Action: pickup(x)
      Precond: Location(Robot, y) and Location(x, y) and not Holding(x)      /* Exists y such that... */
      Effect:  Holding(x))

   Op(Action: go(from, to)
      Precond: Location(Robot, from)
      Effect:  Location(Robot, to)

The semantics of the Location predicate is such that an object can only be at one location.

a1) The initial state is: Location(Robot, Room1) and not holding(Box1) and Location(Box1, Room2)

 The final state should have: Holding(Box1)

 Show the steps for solving this planning problem using partial-order planning (POP).

    Initially, POP sets up the following: a dummy operator for the initial state,
    and a dummy operator for the goal state, as follows:

       Initial dummy operator (has only an effect):
         Effect: Location(Robot, Room1) and not holding(Box1) and Location(Box1, Room2)

       Goal dummy operator (only preconditions):
         Precond: Holding(Box1)

    Now, POP finds a an unsatisfied precondition, Holding(Box1) and an operator that
    achieves it, pickup, and add it, instantiated x=Box1:

       pickup(Box1)
         Precond: Location(Robot, y) and Location(Box1, y) and not Holding(Box1)
         Effect: Holding(Box1)

    POP binds the effect of pickup(Box1) to the precondition of the dummy operator.
    Also, the precondition not Holding(Box1) is an effect of the initial state,
    and Location(Box1, Room2) is an effect of the initial state (thus y=Room2).

    This leaves one unsatisfied precondition: Location(Robot, Room2), and POP
    finds an operator go(from, Room2) to achieve it, and adds the operator:
      go(from, Room2)
        Precond: Location(Robot, from)
        Effect:  Location(Robot, Room2)
    
    POP binds from=Room1, and then the precondition Location(Robot, from) is an
    effect of the initial state, so the plan is complete.

    Note that in this solution, and the next, we ignored constraints on operators
    that can violate preconditions (there were none that were used in this case).

a2) Add an operator that supports the robot carrying the box, and show how from the initial state
    how a plan is developed for achieving:  Location(Box1, Room1) using POP.

    We will add th operator as the need arises, in order to see where it is missing!
    As before, POP sets up the dummy operator for the initial state,
    and a dummy operator for the goal state, as follows:

       Initial dummy operator (has only an effect):
         Effect: Location(Robot, Room1) and not holding(Box1) and Location(Box1, Room2)

       Goal dummy operator (only preconditions):
         Precond: Location(Box1, Room1)

    Now, POP finds a an unsatisfied precondition, Location(Box1, Room1) but there is
    no operator that has such an effect! Actually, the robot holding the box moving
    to another room would work, but this is not known to the program (this is the
    ramification problem. So, assume that there was a specific operator for
    carrying the box, e.g.:

    Op(Action: Carry(x, from, to)
       Precond: Location(Robot, from) and Holding(x)
       Effect:  Location(Robot, to) and Location(x, to)

    With the binding to=Room1 and x=Box1, this operator can satisfy the precondition,
    so POP adds the instantiated operator:

      Carry(Box1, from, Room1)
       Precond: Location(Robot, from) and Holding(Box1)
       Effect:  Location(Robot, to) and Location(Box1, to)

    The rest proceeds almost as before.
    POP finds a an unsatisfied precondition, Holding(Box1) and an operator that
    achieves it, pickup, and add it, instantiated x=Box1:

       pickup(Box1)
         Precond: Location(Robot, y) and Location(Box1, y) and not Holding(Box1)
         Effect: Holding(Box1)

    POP binds the effect of pickup(Box1) to the precondition of the dummy operator.
    Also, the precondition not Holding(Box1) is an effect of the initial state,
    and Location(Box1, Room2) is an effect of the initial state (thus y=Room2).

    This leaves one unsatisfied precondition: Location(Robot, Room2), and POP
    finds an operator go(from, Room2) to achieve it, and adds the operator:
      go(from, Room2)
        Precond: Location(Robot, from)
        Effect:  Location(Robot, Room2)
    
    POP binds from=Room1, and then the precondition Location(Robot, from) is an
    effect of the initial state, so the plan is complete.

    Actually, observe that there are TWO unsatisfied instances of
    the precondition Location(Robot, Room2), but the go(Room1, Room2)
    can actually satisfy BOTH of them (because the "pickup" operator
    does not change this condition!)

b) Consider the following MDP, where scoring is total value of rewards obtained in 3 steps.
   The world has 4 locations in linear sequence, as follows:  F1 I F2 P
   Actions are: R attempts to move right, and L attempts to move left. Locations F1 and F2 contains
   flags, which have value of 1 and 10, respectively. A robot visiting a location with a flag
   gets a reward equal to the flag value, and then the flag disappears. P is an absorbing state,
   with a reward of -100.

   Transition probabilities are based on: motion succeeds with probability 0.8, but with probability
   0.2 the robot either stays in the same location, or moves in the opposite direction (if possible),
   with probability 0.1 each (0.2 to stay in place if at end of sequence).

   b1) Formalize the above problem as an MDP, i.e. what are the states, the transition probability
       function, and the reward function?

   Answer: to define an MDP, we need to define the state space, the action space,
      the reward function, and the transition distribution.

      The states (excluding time) are triples: (Location, Flag1, Flag2)
      where the first is a location (one of four of F1, I, F2, P), and the last 2
      are binary (flag exists true/false). Total number of states is thus 4*2*2=16.

      The action space contains two actions: L, R.

      Reward: we will assume reward is received based only on the current state,
      before the action is made, and is also received in "last" step, when no
      actions are executed. (Thus, 3 steps above means 3 actions and 3 transitions,
      but 4 rewards). The reward function is defined as follows (x and y below
      stand for "don't care", or "any value"):

        R(F1, 1, x) = 1
        R(F2, x, 1) = 10
        R(P, x, y)  = -100

      All other rewards are zero.

      Transition probabilities: stating this explicitly requires an array with
      16*2*16 probability entries (state, action, resulting state). However,
      the array is sparse and well structured, so we state just the non-zero
      transition probabilities. We will assume that the flags "disappear" only
      in the transition AFTER reaching them, and thus after receiving the reward.

      Pit is absorbing, you stay there no matter what action is taken, and flags
      never change either.
        P((P, x, y) | (P, x, y), a) = 1   
 
      Starting at location I with action L (flags do not change): 
        P((F1, x, y) | (I, x, y), L) = 0.8
        P((I, x, y) | (I, x, y), L) = 0.1
        P((F2, x, y) | (I, x, y), L) = 0.1

      Starting at location I with action R (flags do not change):
        P((F2, x, y) | (I, x, y), R) = 0.8
        P((I, x, y) | (I, x, y), R) = 0.1
        P((F1, x, y) | (I, x, y), R) = 0.1

      Starting at location F1 with action L (flag 1 disappears):
        P((F1, 0, y) | (F1, x, y), L) = 0.9
        P((I, 0, y) | (F1, x, y), L) = 0.1

      Starting at location F1 with action R (flag 1 still disappears):
        P((I, 0, y) | (F1, x, y), R) = 0.8
        P((F1, 0, y) | (F1, x, y), R) = 0.2

      Starting at location F2 with action L (flag 2 disappears):
        P((I, x, 0) | (F1, x, y), L) = 0.8
        P((F2, x, 0) | (F1, x, y), L) = 0.1
        P((P, x, 0) | (F1, x, y), L) = 0.1

      Starting at location F2 with action R (flag 2 still disappears):
        P((P, x, 0) | (F1, x, y), R) = 0.8
        P((F2, x, 0) | (F1, x, y), R) = 0.1
        P((I, x, 0) | (F1, x, y), R) = 0.1

   b2) Find the optimal policy.

       Finding the optimal policy with finite horizon is done using the value
       function. But since this is a finite history, the value function depends
       on time, so we denote Vt(state). We will count t by the number of actions
       remaining, so V0(s) will thus be the value of being at state s when no
       actions remain to be executed, and thus clearly we just get the
       reward at s. That is, we have:

          V0(s) = R(s)

          Specifically, we have:
            v0(F1, 1, x) = 1
            v0(F2, x, 1) = 10
            v0(P, x, y) = -100
          with v0=0 for all other states.

       To compute Vi(s) we use the recursive equations:

          Vi(s) = R(s) + max (a over actions) sum (s' over states) P(s'|s,a) Vi-1(s')

       We can now compute V1(s), observe that we can ignore in the summation all
       zero terms. Also, we will take the liberty of NOT computing exactly the
       VERY BAD cases, and just use bounds, for conciseness. Note that during
       the computation, we also designate the best action for each state, this
       is the needed optimal policy! If no actions appear below, this means that all
       actions are equally good (or bad!) for the state.

          V1(P,x,y) = -200 (all actions equally bad!)

          V1(F1,1,x) = 1   (get flag 1 and cannot get any other reward in 1 step)
          V1(F1,0,x) = 0   (same as above, but no flag)

          V1(I,0,0) = 0    (no flags/rewards available, all actions indifferent)
          V1(I,0,1) = 8    (optimal action R, get flag 2 reward next time with
                              probability 0.8). 
          V1(I,1,0) = 0.8  (optimal action L, get flag 1 reward next time, with
                              probability 0.8)
          V1(I,1,1) = 8.1  (optimal action R, get flag 2 reward next time, with
                              probability 0.8, or flag 1 reward, prob. 0.1)

          V1(F2,x,0) = -10 (best action L, still land in pit with prob. 0.1)
          V1(F2,x,1) = 0   (best action L, reward 10 but  still land in pit 
                            with prob. 0.1, exactly cancelling the flag reward)

       Now we can use the recursive equation to compute V2(s).

          V2(P,x,y) = -300 (all actions equally bad!)

          V2(F1,1,0) = 1   (get flag 1 and cannot get any other reward in 1 step)
          V2(F1,1,1) = 7.4 (get flag 1, best action R to get to (I,0,1) with prob 0.8)
          V2(F1,0,1) = 6.4 (as above, best action R, but no flag 1)
          V2(F1,0,0) = 0   (no best action)

          V2(I,0,0) = -1   (Optimal action L, but can land in (F2,0,0) with prob. 0.1)
          V2(I,0,1) = 0.8  (Optimal action L, can land in (I,0,1) with prob. 0.1) 
          V2(I,1,0) = -0.22 (optimal action L, get flag 1 reward next time, with
                              probability 0.8, but could hit (F2,1,0),prob 0.1)
          V2(I,1,1) = 1.61 (optimal action L, get flag 1 reward next time, with
                              probability 0.8, or to (I,1,1) prob. 0.1)

          V2(F2,0,0) = -21 (best action L, still land in pit with prob. 0.1,
                            and in (F2,0,0) with prob. 0.1)
          V2(F2,0,1) = -11 (best action L, reward 10 but  still land in pit 
                            with prob. 0.1, and in (F2,0,0) with prob. 0.1)
          V2(F2,1,0) = -20.36
                           (best action L, still land in pit with prob. 0.1,
                            and in (F2,1,0) with prob. 0.1, but can reach (I,1,0)
          V2(F2,1,1) = -10.36
                           (best action L, reward 10 but still land in pit 
                            with prob. 0.1, (F2,1,0) with prob. 0.1,
                            reach (I,1,0) with probability 0.8)

       Now we can compute V3(s) using V2(s), in a similar manner. Observe that
       with two transitions remaining, the best action was always L, due to the
       possible "catastrophe" of the pit, except for location F1 where the
       pit is unreachable in 2 steps! With THREE transitions remaining, the optimal
       policy will be L from ANY location, independent of the flag states!
       (Well, except for location P where the agent is equally dead no matter
       what...)

       Note that the above is in fact simple expectimax with memoization!