AI MidTerm 2, solutions
----------------------------


Question 1:
-----------
1) FALSE. The whole idea of PARTIAL order planning is least commitment,
   and thus in general a PARTIALLY ordered set of operators is returned.

2) TRUE. There may be more than one operator (or operator instantiation)
         and ordering constraints that work at any point, both are
         decision points in the search (branches).

3) FALSE. Sensors reveal only PARTIAL (and possibly noisy) 
          information about the state in a POMDP. The whole point of this
          model is INaccessibility of the state.

4) TRUE. With no evidence, directed-path singly connected Bayes networks
         are easy (polynomial-time in size of input)
         for the probability-updating problem. The same also holds
         for poly-trees (in case you were confused between the notions).


Question 2:
-----------
a) Simply compute expected utility for each choice. Note that the outcome is slighly
   complicated by the fact that getting the choice right gurantees only 10 points, and IF
   the choice is right only then 10 points for the argumentation become available, at a certain
   probability. Note that making a choice makes the utility of the incompatible outcome ZERO,
   and hence these cases are omitted from the computation below.

   Answering A utility: 0.1 * (10 + 0.4 * 10) = 1.4

   Answering B utility: 0.5 * (10 + 0.4 * 10) = 7

   Answering C utility: 0.4 8 (10 + 0.8 * 10) = 7.2

   Therefore, the student should answer C despite the fact that it is less likely to be true.

b) We will compute the value of information for the clue. Suppose the student has received the
   clue. In this case, the student will get the argumentation right in answers A and C if
   these are indeed the answers, and no effect in case B is the answer. After the clue, we get:

   Answering A utility: 0.1 * 20 = 2

   Answering B utility: 0.5 * (10 + 0.4 * 10) = 7

   Answering C utility: 0.4 * 20 = 8

   In this case, the student should also answer C, and will get 8 points in expectation, rather than
   7.2. Therefore, the value of information for the clue is 0.8, the student should be willing to pay
   up to, but no more than 0.8 points for the clue (thus should refuse to pay 5 points for this clue).


Question 3:
-----------

Clue (costs 5 points, but usually worth MORE than that in some cases, NOT the same behaviour as in
the model of question 2 above!) is:

Consider computing the PARITY function with perceptron multi-layer networks (from exercise 6),
and what it would cost in terms of decision tree implementation.

Answer: Multi-layer percepton networks are strictly more expressive than decision trees for binary
        functions over binary variables.

Proof: Consider the parity function over n inputs. Using the solution to exercise 6, we can see that
       one can easily implement this function using n+1 perceptrons, assuming that the number of inputs
       for each perceptron and the size of the weights is not limited (though it is easy to overcome
       the "number of inputs" requirement if we allow a factor of n more units. The implementation is:

       Each hidden unit Hi has a threshold i-0.5, and receives all inputs, with weight 1. Each such unit
       thus has output 1 just when the number of active inputs is at least i. The output unit has threshold
       0.5, and is connected to the Hi, with weights Wi = (-2)^(i-1). Clearly, this setup computes the
       parity function.

       In a decision tree, in order to implement the parity function, it is easy to see that the entire
       tree with 2^n leaf nodes must be expanded, regardless of the way the tree is set up, making the
       tree representation exponentially larger than the perceptron network for this problem.


       Conversely, consider any binary function, and let T be a (non-redundant, consistent)
       decision tree that implements this
       function. We will show that T can be simulated very efficiently by multi-layer perceptron
       networks. This  works  simply as follows. Let Ni be a decision tree node, that creates a
       split based on variable Var(Ni). Assume without loss of generality that the left branch is for
       FALSE, and the right branch is for TRUE. 
       Assume the standard representation: TRUE is 1, and FALSE is 0.

       We will "and-gate" perceprton units to simulate each of the ouput branches of Ni.
       The LEFT units has threshold 0.5, and 2 inputs: one to the output
       of the unit representing the branch leading into the tree node Ni, if any, with weight 1,
       and the other input variable Var(Ni), with weight (-1). Clearly (can be proved formally by
       recursion along the path from the root) the output of this unit
       is 1 just when the respective tree node is relevant, and Var(Ni) is false.
       Similarly, the RIGHT unit has threshold 1.5, with the same connections, except that the
       weight on the connection to Var(Ni) is 1. Clearly the output of this unit
       is 1 just when the respective tree node is relevant, and Var(Ni) is TRUE.

       By construction, this setup simulates the decision tree. Still, only ONE result is required,
       so we need to add one output unit that is an "or" of all the "TRUE" leaves of the (simulated)
       tree. This is easily done with weights of 1 and a threshold of 0.5. This simple construction
       computes the same function as the decision tree, with at most one unit more than the
       number of nodes (internal and external) in the tree, and is thus at least as efficient
       as the decision tree, for ANY given binary function! This completes the proof.


Question 4:
-----------

a) I({A},{B} |{}) is false, due to the unblocked path ACB where C is a non-evidence passthough node.

b) I({A},{B} |{C,D}) is false, due to the unblocked path AEDB where E is a non-evidence passthough node,
              and D is a converging evidence node.

c) I({A},{B} |{C,D,E}) is true, since the path ACB is blocked by the pass-through evidence node C,
              the path AECB is blocked by the diverging evidence node C,
              and AEDB must use E as a pass-through evidence node.

d) We need to compute P(A|B,C). Observe that for this query D and E are barren nodes, and can
   be removed. We are lft with just nodes ACB. Now, we have I({A},{B} | {C}) since C is
   a pass-through evidence node on the only path from A to B. So P(A|B,C) = P(A|C).

   The latter is very simple to compute: 
   P(A|C) = P(A,C)/P(C) = P(A,C) / (P(A, C)+P(~A,C)) = P(A)P(C|A) / (P(A)P(C|A)+P(~A)P(C|~A))
          = 0.5*0.2/(0.5*0.2+0.5*0.5) = 0.1/(0.1+0.25) = 2/7


Question 5:
-----------

We need to minimize average residual absolute error per example at every step. For simplicity, we will
minimize TOTAL residual error, which is proportional to the average, just to make the computation
simpler.

Starting at the top of the tree, we attempt to use each of the 3 attributes:

For the AI attribute, we have the split into ({3,7,7,10},{1,2,2,3}) with the averages being (6.75,2).
The total errors are thus:
RE(AI) =  Sum(|(3,7,7,10,1,2,2,3)-(6.75,6.75,6.75,6.75,2,2,2,2)|) = Sum(3.75,0.25,0.25,3.25,1,1,0,0) = 9.5

For the Java attribute, we have the split into ({1,7,7,10},{3,2,2,3}) with the averages being (6.25,2.5).
The total errors are thus:
RE(Java) =  Sum(|(1,7,7,10,3,2,2,3)-(6.25,6.25,6.25,6.25)|) = Sum(5.25,0.75,0.75,2.75,0.5,0.5,0.5,0.5) = 11.5

For the Proj attribute, we have the split into ({10,7,3,1},{7,3,2,2}) with the averages being (5.25,3.5).
The total errors are thus:
RE(Proj) =  Sum(|(10,7,3,1,7,3,2,2)-(5.25,5.25,5.25,5.25,3.5,3.5,3.5,3.5)|) = 
            Sum(4.75,2.75,2.25,4.25,...) > 10 
(no need to complete the computation since this attribute will not be chosen ANYWAY).

Therefore, the AI attribute is the first node in the tree. We recursively apply the method to each of the
branches. 

Consider the AI=high branch first. 

Now, suppose we try the Java Attribute.
The split is ({10,7,7},{3}), with averages (8,3) and residual error Sum(2,1,1,0) = 4.

Now, suppose we try the Proj Attribute.
The split is ({10,7},{7,3}), with averages (8.5,5) and residual error Sum(1.5,1.5,2,2) = 7.

Clearly the Java attribute is better here. Since for Java=low we only have one
example, there is no need to expand that branch, and need to consider just Java=high.
For that branch, the Proj attribute can reduce the residual error from 4 to 3, and should thus
be used there.


Going back to the AI=low branch, trying the Java attribute we get the split: ({1},{3,2,2}) with
averages (1, 2+1/3) and residual error 4/3.

Trying the Proj attribute, we get a split: ({3,1},{2,2}) with averages (2,2) and residual error 2.
So the Java attribute is better in this case as well.

Recursing into sub-trees, the case Java=high has only one example, so consider just Java=high, where
using the Proj attribute can reduce the residual error to 0. To sum up, we get the tree:

AI:  high,   Java:  high,   Proj:  high   (decide: 8.5, note that we still have some error here)
                            Proj:  low    (decide: 7)
             Java:  low     (decide: 3)
     low,    Java:  high    (decide: 1)
                    low     Proj:  high   (decide: 3)
                                   low    (decide: 2)