Introduction to Artificial Inteligence

Assignment 6

Written assignment: planning, probabilistic inference, and learning - Solution

1) We have the following distribution over the binary variables A, B, C, D, given as a full table:

   A     B     C     D   probability
   F     F     F     F   0.0
   F     F     F     T   0.05
   F     F     T     F   0.025
   F     F     T     T   0.05
   F     T     F     F   0.0
   F     T     F     T   0.05
   F     T     T     F   0.025
   F     T     T     T   0.05
   T     F     F     F   0.0
   T     F     F     T   0.15
   T     F     T     F   0.075
   T     F     T     T   0.15
   T     T     F     F   0.0
   T     T     F     T   0.15
   T     T     T     F   0.075
   T     T     T     T   0.15

a) Construct the Bayes network that most compactly describes the 
   distribution (i.e. one that has the smallest number of arcs). 

ANS: By glancing at the probability table, it is obvious that numbers 
     in lines 1-4 (B=F) are repeated in lines 5-8 (B=T), and likewise
     lines 9-12 are repeted in lines 13-16. So clearly these numbers do not
     depend on B, and all variables are thus independent of B (therefore B is
     completely disconnected from the rest of the variables).
     We only need to consider nodes A, C, D now. Suppose we start with A, then C,
     Then D. Checking whether C depends on A, compare P(C) and P(C|A).
     We have P(C=True) = 2 * (0.025+0.05+0.075+0.15) = 0.6
             P(A=True) = 2 * (0.15+0.075+0.15) = 0.75
             P(A=True, C=True) = 2 * (0.15+0.075) = 0.45
     So we have P(A=True, C=True) = P(C=True)*P(A=True)
       We need to check the other possible instantiations:
          P(A=True, C=False) = 2 * (0.15) = 0.3 = P(A=True)*P(C=False) = 0.75 * 0.4 
          P(A=False, C=False) = 2*0.05 = 0.1 P(A=false)*P(C=False) = 0.25 * 0.4 
          P(A=False, C=True) = 2*(0.05+0.025) = 0.15 =  P(A=false)*P(C=True) = 0.25 * 0.6
     So C does not depend on A either, and thus also has no parents.

     Finally, check D. We have:  P(D=False) = 2 * (0.025+0.075) = 0.2
     and therefore P(D=True) = 0.8.
         We check all the cases of D and A and find that
           P(A, D) = P(A) * P(D), and thus D does not depend on A either.

     It remains to be seen whether D depends on C, and in fact it does:
     P(C=False, D=False) = 0  as can be seen from the respective 0 entries.
     However, P(C=False) = 0.4 and P(D=False) = 0.2 and multiplying them
     does NOT give 0. So D depends (only) on C.

     The resulting Bayes network has only one arc - from C to D.

b) Is the answer above unique? If not, what is the set of possible BNs 
   that describe the distribution compactly?

ANS: Not unique. The Bayes nework containing only one arc from D to C is
     also possible. But there are no other options beyond these two.

2) Consider the following variables and their immediate causes (parents):

     variable         parents
        A               none
        B               none
        C               none
        D               A C
        E               A B
        F               B
        G               D F
        H               C E

    a) Is this network a poly-tree?

ANS: No, because e.g. A-E-H-C-A is a cycle in the underlying undirected graph.

    b) Is this network (directed-path) singly connected?

ANS: Yes, there are no cases with more than 1 directed path between any pair of

    c) Determine the truth of the following independence
       statements (using d-separation):
       1)  I({D}, {E} | {})
ANS: No, because in path D-A-E node A is a non-evidence diverging node, and thus
     this path is not blocked by the (null) evidence.
       2)  I({D}, {E} | {A})

ANS: Yes, because in path D-A-E node A is an evidence diverging node, and thus
     this path is blocked by the evidence. All other paths contain a converging
     non-evidence node with no children (a "barren" node), that blocks the path.

       3)  I({D}, {E} | {A, G})

ANS: No. Now path D-G-F-B-E is unblocked, because G is a converging evidence node,
     F is a passthrough non-evidence node, and B is a diverging non-evidence node.

       4)  I({B}, {G} | {D, F})

ANS: Yes, as every simple path from B to G must traverse eithe D or F as passthrough
     nodes, and both are evidence nodes, blocking the respective paths.

       5)  I({B}, {C} | {})

ANS: Yes, successively removing barren nodes leaves just B and C disconnected.
     Alternately (longer), we can show directly that every path from B to C
     has a non-evidence converging node (and cannot have evidence descendants,
     as the evidence is null).
       6)  I({B}, {C} | {H})

ANS: No. Path C-H-E-B is unblocked, since H is a converging evidence node and
     E is a passthrough non-evidence node.

    d) Assume that the nodes are binary-valued.
       Suppose that P(A=true) = 0.4, P(B=true)= 0.2, P(E=true|A=B=True) = 0.5,
       P(E=true|A=false,B=true) = 0.1, P(E=true|A=true,B=false) = 0.5, P(E=true|A=B=false) = 0.5
       P(H=true|E=true) = 0.1, P(H=true|E=false) = 0.4, 

       Find P(A=true | B=true, E=true, H=true)

       Hint: you may want to preprocess to get rid of nodes before doing any computation.

ANS: In pre-processing, remove barren node G, then D and F. Now, we can see that
     I({A}, {H}| {B, E}) in the remaining graph (E blocks the only path A-E-H
     as it is a passthrough evidence node), so it is sufficient to compute the
     equivalent query:  P(A=true | B=true, E=true)
     But for the latter query, H is barren, and so is C. So we remain only
     with a BN with the 3 nodes A, B, E. We now use the definition of conditionals:
        P(A=true | B=true, E=true) = P(A=true, B=true, E=true)/P(B=true, E=true)
        = P(A=true, B=true, E=true)/(P(A=true, B=true, E=true)+ P(A=false, B=true, E=true)) 
     We now use factoring according to the definition of BN distributions, i.e.:
        P(A, B, E) = P(A)*P(B)*P(E|A, B) to get:
     P(A=true | B=true, E=true) = 0.4*0.2*0.5 / (0.4*0.2*0.5 + 0.6*0.2*0.1) 
     = 0.2/(0.2+0.06) = 10/13

3) You need to construct a 6-input neural network that has value 1 in the 
   output just when the number of its inputs that are "on" is either 1
   or 4 (assume inputs are in {0, 1}).
   a) Can this be done without hidden units?

ANS: No. In order to separate the case of no inputs active from one input active, a hyperplane
    is needed that cannot possibly also separate the case of one input active from 2 inputs
    active. Another way to see it is projecting the space onto any 2 inputs x1, x2,
    when all other inputs are 0. In this case, the output needs to be a XOR(x1, x2),
    which is known not to be implementable without hidden units.

   b) Show a network using threshold elements that performs
      the required computation (specify the weights, too!)

ANS: Many ways to do this. The simplest would have 3 hidden units, one for encoding
     "more than 1", another for encoding "4 or more" and one for encoding "more than 4".
     We will call these unit outputs a2, a4, and a5, respectively. An additional unit
     is the output unit o. All units are connected to all inputs, with a weight of 1.
     In addition, the outputs a2, a4, a5 are also connected as inputs of o.
     The theresholds and still unspecified weights are as follows.

     The threshold of unit o is 0.5, so that any input unit being "on" can
     activate it. The weight on the connection from a2 to o is -10, so that 
     it will override the normal inputs and deactivate o if more than one input is "on".
     The weight from a4 to o is 10, so that it will override the override
     generated by a2 if a4 is active. The weight from a5 to o is again -10, so
     as to override the override provided by a4, if more than 4 inputs are "on".

     Finally, the threshold of the aj units are:
     Threshold of a2: 1.5
     Threshold of a4: 3.5
     Threshold of a5: 4.5

Additional challenge: can this function be implemented with 
 a) One hidden unit?
 b) Two hidden units?  

4) a) For the blocks world as defined in class, show how POP constructs the plan
      (that is, show all partial plans from the initial plan until a correct
      plan is reached) for initial state:
    C  D

      and the goal state cointaining: On(A,B) and On(B,C)

ANS:  Initial plan contains dummy step:
          PRECONDITIONS: null
          POSTCONDITIONS: On(A,B), On(B,D), On(C,Table), On(D,Table), Clear(A), Clear(C)

      and a dummy step:
          PRECONDITIONS: On(A,B) On(B,C)
          POSTCONDITIONS: null 
      and constraint INIT < END.

      Say POP picks On(B,C) of END as the unsatisfied precondition to handle.
      The only way to make it work is to add a step S1: 
         PRECONDITIONS: Clear(B), Clear(C), On(B,x)
         POSTCONDITIONS: On(B,C), not Clear(C), Clear(x)
      and constraints  INIT < PutOn(B,C) < END
      This causes no potential violations.

      Now, POP might pick unsatisfied precondition Clear(C) of S1,
      which can be satisfied by INIT.

      Next, can pick unsatisfied precondition On(B, x) of S1, which can be
      satisfied by INIT with x=D.

      Next, can pick unsatisfied precondition Clear(B) of S1, which only
      be satisfied by addin a step Puton(A, y) for some y. We will use y=Table
      (or alternately a specialized step PutOnTable(A)), as otherwise we might
      run into trouble, e.g. PutOn(A,C) will cause a violation of Clear(C) which
      cannot be fixed. So we add step S2:
         PRECONDITIONS: Clear(A), On(A, x)
         POSTCONDITIONS: On(A,Table), Clear(x)
      With x=B, the postcondition Clear(x) satisfies this precondition of S1, so
      we add a constraint S2 < S1. Also INIT < S2.

      No we look at the Clear(A) precondition of S2, which can be satisfied by INIT.
      Likewise, precondition On(A,B) is satisfied by INIT.
      Finally, there is the precondition On(A,B) of END. Although it is satisfied by INIT,
      this would not work because then PutOn(A,Table) would violate it in a way that cannot
      be fixed. So instead we add step S3:
         PRECONDITIONS: Clear(A), Clear(B), On(A,x)
         POSTCONDITIONS: On(A,B), not Clear(B), Clear(x)
      Note that S2 violates the precondition On(A,B) so S3 must be after S2,
      so we need to promote S2 and add the constraint S2 < S3. Also, note that
      S3 is a potential threat to precondition Clear(B) of S1, so we must promote S3,
      and require S1 < S3 as well. Overall, 
      we have the constraints INIT < S2 < S1 < S3 < END, which are consistent.
      Finally, Clear(A) of S3 is satisfired by INIT, and On(A,x) is satisfied by S2
      with x=Table. Thus, we have a complete plan.

   b) Suppose that is not known whether On(B,D) or On(D,B) in the initial state,
      and that there is an operator "Check(On(x,y))" which results in knowing whether
      this is true or not. Show how a conditional plan is constructed for the above

ANS: Same as above, except that after adding S1, we run into trouble when trying to satisfy
     precondition Clear(B) of S1, as we do not know if we have On(B,D) or On(D,B)
     (we will assume that we do know that A is wither on B or on D, depending on the
     above information, otherwise we will need to check that as well.

     So we add an SC (conditional) step:
         PRECONDITION: null
         POSTCONDITION: Know(On(B,D))
     In the "true" branch we continue as before. Now we have the "false" branch which
     has "On(D,B)". In this case we must add a step S4:
         PRECONDITIONS: Clear(D), On(D, x)
         POSTCONDITIONS: On(D,Table), Clear(x)

     And now we must duplicate the step PutOn(A,Table) for this contingency,
     (call this S5) so as to get Clear(D). 
     So in fact the "Check" must be the first action, and we get the plan.

     INIT --  SC -- (true) --- S2 -- S1 -- S3 -- END
                    (false) -- S5 -- S4 -- S1 -- S3 -- END

     (also note that the planner may not be smart enough to re-use S1 and S3,
      in which case it must duplicate those steps as well).

5)  Consider the following scenario, occuring in a certain middle-eastern country.*
    We will assume that you are a risk-neutral, i.e. that
    your utility function is linear in value unit discused below (morts).
    Assume also that moral considerations are not part of your utility function.
    A well-known band of criminals has stolen your FOO, is 
    threatening to destroy it immediately (a damage of 1 mort), and will only return it to you if you
    give them your mojo (C)** instead. Surrendering your mojo is not of immediate
    material damage, but may cause you a loss in the future. 
    With probability 0.5 this will cause a loss
    of 1 mort next fortnight, and additionally with 
    probability 0.4, independent of the above, it will cause a damage of 10 morts
    the one fortnight after the next. Your options are as follows:

    a) Do nothing, and then the criminals destroy your FOO.
    b) Surrender your mojo, and possibly encur the respective future losses.
    c) Request the police to try a raid to recover your FOO. Since
       you do not know where it is, the results are as follows:
       With probability 0.1 they will succeed, and recover your FOO.
       With probability 0.1, this will cause a loss of 1 mort, regardless of
       whether they succeed in recovering your FOO or not.

    5.1) What is your optimal course of action? Consider two reward accumulation functions:
       a) Simple sum of rewards.
ANS:  Do nothing: reward -1 mort.
      Surrender mojo: expected reward 0.5*(-1)+0.4*(-10) = -4.5
      Police action:  0.9*(-1)+0.1*0+0.1*(-1) = -1

      So either doing nothing or police action are optimal here (though the latter is more

       b) Discounted sum of rewards with a discounting factor of 0.2 (per fortnight).
ANS: No change of utility for actions a and c since the rewards are immediate.
     But for action b the rewards are in the future, so we modify the computation to:
     Surrender mojo, expected discounted reward: 0.5*(-1)*0.2 + 0.4*(-10)*0.2*0.2 = -0.26

     So for the discouted reward case the "surrender mojo" action is optimal.

    5.2) Would the answer be different, if in surrendering your mojo, there is
         in addition a probability of 0.5 that the band of criminals will steal
         your FOO again after 2 fortnights (and then leave you with the same options?)

ANS: No. Because in the undiscounted rewards case, the "surrender mojo" action is already
     bad, and this possibility only makes it worse. 

     In the discounted rewards case,
     we can bound the damages by 1 mort 2 fortnights ahead, and multiplied by 0.2 squared
     this is 0.04, so deducting this from -0.26 still leaves the "surrender mojo"
     action as optimal in this case.

    5.3) In 5.1, suppose that an informer can provide information that
         will increase the probability of success of a police raid to 0.5. How much
         should you be willing to pay for such information (in both cases a and b)?

ANS:Assuming you choose "police action" (optimal anyway), receiving the information will
    not make you change the decision, but will improve the expected utility of this action.
    The new expected utility for this action will be: 0.5*(-1)+0.5*0+0.1*(-1) = -0.6
    So you should be willing to pay up to the difference, 0.4 mort, for the information.

    In the discounted rewards case, the improvement in expected discounted reward will not
    make police action optimal anyway, so this action will not be chosen. Therefore,
    the information here is worth 0.

6) Consider the following belief-state MDP: the Canadian traveller problem.

   The (undirected) graph has 4 vertices, as follows:
     (I, V1, V2, G)
   All vertices are directly connected by edges.
   The edges are as follows:
     (I, V1), weight 1, no ice.  ; e1
     (I, V2), weight 2, no ice.  ; e2
     (I, G), weight 10, no ice.  ; e3
     (V1, V2), weight 3, probability of ice 0.5  ; e4
     (V1, G), weight 1, probability of ice 0.8   ; e5
     (V2, G), weight 2, probability of ice 0.5   ; e6
   b1) Formalize the above problem as a (belief-state) MDP, i.e. 
       what are the states, the transition probability
       function, and the reward function?

ANS: In general, the states of the belief-space MDP can be factored into
    the (known) agent vertex, and what is known about each edge. Since edge
    state in the underlying process cannot change, edges known initially to be
    traversible need not be represented. Thus, the belief space is:

     B = V x {0, 1, U} ^ |E'|

    Where V is the set of vertices, and E' the set of unknown edges.
    For each edge e, "0" denotes blocked, "1" denotes unblocked, and "U"
    denotes unknown (probabilty p(e) of being blocked).
    Actions are of the form move(v, u), i.e. traverse the edge from v to u.

    The transition probabilities for actions not acievable in a belief state
    (trying to traverse an edge not adjacent to the current location, or
    a blocked edge) are 1 for staying in the same state, 0 for any other
    state.  For achievable actions, a= move(v, u) we can define the distribution over
    post-action states in a decomposable manner:
       P(b' | a, b) = P(v' | a,b) * P(e' | a,b) * P(e'' | b) * P(e''' | a,b)

    Where P(e'''| v, a) is a distribution  over the edges unknown before the action
    that are not adjacent to u, which remain unknown with probability 1 after
    the transition, P(v' | v, a) = 1 just when v' = u, and 0 otherwise, and
    P(e'' | v, e) is a distribution over edges that are already known, and
    their state remains the same with probability 1 (note that this does not depend on a).
    Finally, the distribution P(e' | v,a) is over previously unknown edges
    e' that are adjacent to u, and is a product over the respective edge states:

        P(e' | v,a) = (prod (e over blocked edges)  p(e)) * (prod (e over unblocked edges) 1-p(e))

    We will generate the numbers for the example on-the-fly below as needed.

    Rewards for an achievable action are R(move(v,u)) = -w(v,u)

   b2) Find the optimal policy. You may assume that the robot starts
       at vertex I in order to save some computation.

     In the current example, there are 3 unknown edges, so the belief states are of the form:
      [v; e4, e5, e6]
     Where v indicates a vertex, and the ei are the state of the respective edge, out of {0,1,U}.
     The initial state might be: [I; U, U, U]

     This can be done by value iteration, where all belief state utilities are initialized
     to (-infinity), except for (X stands for "any possible configuration of the edge"):
       U([G;X,X,X]) = 0              ; Being at the goal, no actions needed, utility 0.

     Since we need to do it manually, we will do this efficiently by beginning with cases
     that need only be updated once (we hope). First, the ones which have only certain
       U([V1;X,1,X]) = U([G;X,X,X]) - w(V1,G) = 0 - 1 = -1      ; at V1, edge e5 to goal clear
       U([V2;X,X,1]) = U([G;X,X,X]) - w(V2,G) = 0 - 2 = -2      ; at V2, edge e6 to goal clear
       U([I;X,X,1])  = U([V2;X,X,1]) - w(I,V2) = -2 -2 = -4     ; At I, knowing e6 clear
       U([I;X,1,X])  = U([V1;X,1,X]) - w(I,V1) = -1 -1 = -2     ; At I, knowing e5 clear
                  ; note that these are ASSIGMENTS, we have e.g. U([I;X,1,1]) improved to -2 here.
       U([V1;X,0,1]) = U([I;X,X,1]) - w(I,V1) = -4 - 1 = -5     ; At V1, knowing e5 blocked, e6 clear
       U([V2;X,1,0]) = U([I;X,1,X]) - w(I,V2) = -2 - 2 = -4     ; at V2, knowing e6 blocked, e5 clear
                  ; best actions in these last 2 cases are "return to I".

       U([I;X,0,0])  = U([G;X,X,X]) - w(I,G) = 0 - 10 = -10
       U([V1;X,0,0]) = U([I;X,0,0]) - w(I,V1) = -10 - 1 = -11
       U([V2;X,0,0]) = U([I;X,0,0]) - w(I,V2) = -10 -2 = -12

       Now we need to handle a few uncertain transitions:
        U([I;X,0,U]) = max { (U([G;X,X,X]) - w(I,G))                           ; = -10
                             (U([V1;X,0,U) - w(I,V1))                          ; = ? (will not help!)
                             0.5*U([V2;X,0,0]) + 0.5*U([V2;X,0,1])- w(I,V2))   ; = 0.5(-12-2)-2 = -9
                       } = -9 
             ; with optimal action: move(I,V2)

        U([I;X,U,0]) = max { (U([G;X,X,X]) - w(I,G))                         ; = -10
                             (U([V2;X,U,0) - w(I,V2))                        ; = ? (will not help!)
                             0.8*U([V1;X,0,0]) + 0.2*U([V1;X,0,1])- w(I,V1)) ; = 0.8*(-11)+0.2*(-1)-1
                       } = -10 
             ; with optimal action either  move(I,V1) or move(I,G), since they cause the same value,
               so we will keep move(I,G) as the optimal (simpler).

        U([V1;X,0,U]) = U([I;X,0,U]) - w(I,V1) = -9 - 1 = -10
        U([V2;X,U,0]) = U([I;X,U,0]) - w(I,V2) = -10 - 2 = -12

        And finally we get to update the "initial" state:
        U([I;U,U,U]) = max { (U([G;X,X,X]) - w(I,G))                     ; = -10
                             0.5*U([V2;X,U,0])+0.5*U([V2;X,U,1])-w(I,V2) ; = 0.5*(-12-2)-2 = -9
                             0.8*U([V1;X,0,U])+0.2*U([V1;X,1,U])-w(I,V1) ; = 0.8*(-10)+0.2*(-1)-1=-9.2
                       } = -9
             ; With optimal action move(I,V2)

Final optimal policy is intocated in the moves above. Stated as a tree starting in the
"initial" position it looks like:

move(I,V2)  --- (e6 unblocked) --- move(V2,G)
            --- (e6 blocked) --- move(V2,I) --- move(I,G)

Expected cost is 9.2, since  U([I;U,U,U])= -9.2

Note that we "cheated" since it never paid to use e4 even if it is unblocked,
so we "bundled" all respective cases of the state of e4 into "don't care"s.
We will get the same result if we don't do this, but the solution
will be at least twice as long!

7)  A project leader in the IMG4 consortium would like to construct a
    decision procedure based on results from hiring employees in the KITE
    consortium. The decision procedure will be based on a decision tree, using information
    in the case database. Attributes for each employee are:
    degree level (A),  Java programming ability (B), self-motivation (C), grade in AI (D).
    The target attribute is the "decision", describing the amount of money to
    pay the employee.

  a) Construct by hand (using the greedy search using the best information
     gain heuristic) a consistent decision tree for the following table:

     input variables               decision
   A      B      C      D
   1      F      M      95             8
   1      T      L      70             4
   1      F      L      70             4
   2      T      M      95             6
   2      T      M      80             6
   2      T      H      95             8
   2      T      H      80             8
   2      F      H      95             8

ANS: We need to try all possible attributes (input variables). We will try a qualitative comparison,
and if not clear we will actually need to compute entropy.

Trying A, we get for A=1: {8, 4, 4}
                     A=2: {8, 8, 8, 6, 6}

Trying B, we get for B=F: {8, 6, 6}
                     B=T: {8, 8, 6, 6, 4}
The case B=F has the same distribution type as A=1, and just as much information.
The case B=T is strictly less informative than that of A=2. So attribute B is worse than A.

Trying C, we get for C=L: {4, 4}           ; fully classified
                     C=M: {8, 6, 6}
                     C=H: {8, 8, 8}        ; fully classified
Since the case C=M is like the case A=1, but C=M and C=H are clearly better than the case A=2,
then C is better than A (and better than B). The case C=M requires a bit less than 3
bits to fully classify the 3 examples (distribution not fifty-fifty), to get the exact
number do: -(log(1/3)+2*log(2/3))
Base of logs is 2 throughout.

Trying D, we get for D=70: {4, 4}
                     D=80: {8, 6}
                     D=95: {8, 8, 8, 6}
So it looks like C is also better than D, but it is not very obvious. The case D=80
requires 2 bits to classify the 2 examples as  distribution is (0.5, 0.5).
The case D=95 requires more than 2 more bits total for the 4 examples, to get the exact number
do: -(log(1/4)+3*log(3/4))
So we can see that indeed D is worse than C.

So we pick C for the root of the tree. We now need to handle just the branch C=M.
Here we can pick either A or B (but not D), for a complete classification.
For example, pick A. We have: A=1: {8}
                              A=2: {6, 6}
So we are done (zero additional bits needed for final classification.
So the resulting decision tree is:

C? --- C=L : 4
   --- C=H : 8
       C=M : A? --- A=1 : 8
                --- A=2 : 6 

  b) Is a more compact (fewer internal nodes) decision tree possible for the above table?
     If no - prove it. If yes, which tree?

ANS: Not possible. 
     Proof: The resulting tree has 2 internal nodes. A more compact tree would
     have only 1 internal node. We have enumerated all possible such trees, and did not
     get a tree consistent with the training data, so a 2-node tree is the most compact
     consistent tree.

Deadline: Wednesday, January 13, 2010, at 12 noon.

Note: submission (as well as WORK) on this exercise is SOLO (that is, NOT in pairs or other groups of size > 1)

* Disclaimer: any seeming identification with existing countries, institutions, political parties, or people, is unintentional.

** Mojo: copyright some film studio we do not recall.