Introduction to Artificial Inteligence

Assignment 6-Solution

Written assignment: planning, probabilistic inference, and learning

Justify all answers!

1) We are given the following independence statements about random variables
   A, B, C, D, E, F:

      not I({A},{B}|{C})
      not I({D},{E}|{})

a) Construct a Bayes network that is consistent with the 
   above set of independence statements.

b) Is the answer above unique? If not, what is the set of possible BNs 
   that is consistent with all the statements?

ANSWER (a and b):
The first 2 independence statements mean no arcs between {A,B,C} and {D,E} or {F}, so
{A,B,C} is a separate subgraph. The 3rt statement means no arcs between {D,E} and {F} so
these are also separate subgraphs. That is, F is a node with no arcs.

The statement  not I({D},{E}|{}) means that there is a path between D and E, and since they are
on their own component, there is an arc between D and E. This arc can be in either direction,
so answer is not unique here.

The statement I({A},{B}|{}) means that there is no arc between A and B. But not I({A},{B}|{C})
means that there is a path between A and B, and since A,B,C are the entire connected component
then the path must be through C, and in addition it must be converging at C, otherwise
C would block the path A-C-B. So in this part the answer is unique.
However, overall there for the entire BN there are 2 possibilities, as there are 2 possibilities
for the edge between D and E.

2) Consider the following variables and their immediate causes (parents):

     variable         parents
        A               C
        B               none
        C               none
        D               A
        E               A
        G               D B
        H               E

    a) Is this network a poly-tree?
ANSWER: Yes, the underlying undirected graph has no cycles. Quick check: the underlying undirected graph
      is connected, has 7 nodes and 6 arcs, and is thus a tree (theorem of graph theory).

    b) Is this network (directed-path) singly connected?
ANSWER: Yes, all polytrees are directed-path singly connected.

    c) Determine the truth of the following independence
       statements (using d-separation):
       1)  I({D}, {E} | {})
ANSWER: no, path D-A-E is not blocked (A is a diverging non-evidence node)
       2)  I({D}, {E} | {A})
ANSWER: yes, only path D-A-E is blocked (A is a diverging evidence node)
       3)  I({D}, {E} | {A, G})
ANSWER: yes, as above.
       4)  I({B}, {E} | {D})
ANSWER: yes, only path B-G-D-A-E is blocked by D, a passthrough evidence node.
       5)  I({B}, {C} | {})
ANSWER: yes, only path C-A-D-G-B is blocked by G, a converging non-evidence node with no children.
       6)  I({B}, {C} | {G})
ANSWER: no, path C-A-D-G-B is not blocked by G (a converging evidence node), and not by A, D (passthrough non-evidence nodes)

    d) Assume that the nodes are binary-valued.
       Suppose that P(C=true) = 0.2, P(B=true)= 0.1, P(E=true|A=false) = 0.5,
       P(E=true|A=true) = 0.5, 
       P(H=true|E=true) = 0.2, P(H=true|E=false) = 0.4,
       P(A=true|C=true) = 0.9, P(A=true|C=false) = 0.4

       Find P(C=true | A=true, H=true)

       Hint: you may want to preprocess to get rid of nodes before doing any computation.
   G is a barren node (not query or evidence, and has no children), and can be removed.
   Now D and B become barren, and can be removed.
   In the remaining graph, C is d-separated from H given A, so the evidence at H can be ignored,
   and we need to compute P(C=true | A=true), which we do simply by using Bayed formula
   (or we can drop H followed by E which now become barren before using a belief updating algorithm,
   but here we are computing by hand so no need to do that).

   P(C=true | A=true) = P(A=true|C=True)*P(C=true) / (P(A=true|C=true)*P(C=true) + P(A=true|C=false)*P(C=false)) =
                   0.9*0.2 / (0.9*0.2 + 0.4*0.8) = 0.18 / (0.18 + 0.32) = 0.36

3) You need to construct a 4-input neural network that has 2 outputs: the first
   has value 1 in the output just when the number of its inputs that are "on" is even,
   and the other has a value of 1 in the ouput just  when the number of its inputs that are "on" is odd.
   (assume inputs are in {0, 1}).
   a) Can this be done without hidden units?
ANSWER: No, because the projection of this function (say for the 2nd output) and the first 2 inputs being 0 is
    a XOR function, which already cannot be done without hidden units.

   b) Show a network using threshold elements that performs
      the required computation (specify the weights, too!)

ANSWER: Can be done with 2 hidden units "counting" the number of "1" inputs. 
        Each hidden units is connected to all inputs, e.g. with a weight of 0.9
        Hidden unit A will compute "more than 1", e.g. by a threshold of 1.
        Hidden unit B will compute "more than 3", e.g. by a threshold of 3.

        The output units are each connected to all input, as well as to the output of both hidden units.
        The rest of the weights are as follows (note that this is a bit delicate, it is possible
        to make the answer less tricky but then we need additional hidden units).

        The output unit for the first function ("even") needs to output "1" if all or no inputs are "1"
        or if "more than 1" not more than 2 inputs are on. This can be done if (for the output unit)
        its input weights for the circuit inputs are -1, and the threshold is -0.5. 
        So with all inputs 0 we are over the threshold, but
        when any input becomes 1 we are below the threshold. When 2 inputs are on, this activates the
        "more than 1" (unit A) and with a weight of 2 on this connection this cancels the negative weights and
        the output is 1 again. When 3 inputs are 1 the total weighed input becomes -1 again, as needed, and
        the output goes to 0.
        Then using the "more than 3" (unit B, with weight 2) cancels out all the negative weights and swings the
        output to 1 again when all four inputs are 1.

        The output unit for the second function ("odd") can work by inverting the weights and thresholds.
        Thus it will have weights from the circuit input connections all 1, and a threshold of 0.5.
        The connection from unit A ("more than 1") will have weight -2, canceling out the positive inputs
        when 2 inputs are on. Then again when the 3rd units is on, the circuit inputs overcome the negative
        weights of A and the output becomes 1. The connection from unit B ("more than 3") will have a weight -2,
        helping to overcome the circuit inputs when al inputs are on.

        (Alternately, the 2nd output unit is connected only to the output of the first output unit,
        and does "negation", say by having input weight of -1 and a threshold of -0.5)

4) a) For the blocks world as defined in class, show how POP constructs the plan
      (that is, show all partial plans from the initial plan until a correct
      plan is reached) for initial state:
    C  B

      and the goal state containing: On(A,B) and On(B,C)

   Initial state consists of the dummy steps START and END, with constraint START before END,
   where the steps are:

   START: Preconditions: none
          Effects: On(C,Table), On(B,Table), On(A,B), On(D,A), Clear(D), Clear(C)

   END:   Preconditions: On(A,B), On(B,C)
          Effects: none

   Now we select open preconditions, suppose we pick On(B,C). This requires adding:

   STEP1: Action: Puton(B,C)
          Preconditions: Clear(B), Clear(C), On(B,x)
          Effects: On(B,C), Clear(x), not Clear(C), not On(B,x)

   With constraints START before STEP1 before END. The link from STEP1 to END is set accordingly.
   Clobbering not possible.
   Suppose we now select precondition Clear(C) of STEP1. It is possible to create the link from START to STEP1
   to satisfy it, and again clobbering is not possible.
   Likewise, we can select On(B,x) of STEP1, and can create a link from start to STEP1 to
   satisfy it, with x=Table. Again no clobbering.

   Now we might select precondition Clear(B) of STEP1. Satisfying this requires a new step, for example:

   STEP2:  Action: PutOnTable(A)
           Preconditions: Clear(A), On(A,B)
           Effects: Clear(B), On(A,Table), not On(A,B)
   And we add the link from STEP2 to STEP1 and constraints START before STEP2 before STEP1.
   Selecting precondition On(A,B) of STEP2, we can reslove it by adding a link from START. There are no
   Selecting precondition Clear(A) of STEP2, we need to add a step:

   STEP3:  Action: PutOnTable(D)
           Preconditions: Clear(D), On(D,A)
           Effects: Clear(A), On(D,Table), not On(D,A)
   And we add the link from STEP3 to STEP2 and constraints START before STEP3 before STEP2.
   Selecting precondition On(D,A) of STEP3, we can reslove it by adding a link from START. There are no
   Selecting precondition Clear(D) of STEP3, we can resolve it by adding a link from START. There are no

   The only remaining open precondition is On(A,B) of END. We MIGHT resolve it by adding a link from START
   but then this is clobbered by PutOnTable(A) in STEP 2. In this case both promotion and demotion will not
   work, and we will need to backtrack. Another possibility is adding another step:

   STEP4: Action: Puton(A,B)
          Preconditions: Clear(A), Clear(B), On(A,x)
          Effects: On(A,B), Clear(x), not Clear(B), not On(A,x)

   And the link from STEP 4 to END. Note that this step's "not Clear(B)" clobbers the link from
   STEP2 to STEP1. Therefore STEP4 must be either before STEP2 (will not work), or after STEP1. So we
   add the constraint STEP1 before STEP4.
   Now all preconditions of STEP4 can be resolved: Clear(B) by the effects of STEP2. Clear(A) by the
   effects of STEP3. On(A,x) by the effects of STEP2 with x=Table. The links must be added appropriately.
   We get the total order:  START before STEP3 before STEP2 before STEP1 before STEP4 before END
   with no open preconditions and no cloberers, and we are done.

   b) Suppose that is not known whether On(A,B) or On(B,A) in the initial state,
      and that there is an operator "Check(On(x,y))" which results in knowing whether
      this is true or not. Show how a conditional plan is constructed for the above

ANSWER: This is a bit difficult, as if On(B,A) then in fact On(D,B) and so we need THREE
check actions: Check(On(A,B)) and Check(On(D,A)) and Check(On(D,B)) and we may have to generate
EIGHT plans (some for impossible cases). So for simplicity suppose that Check(On(A,B))
gives us ALL the needed information w.r.t. D as well, and now generate only TWO plans:
One for the case On(A,B) On(D,A)
 and one for On(B,A), On(D,B). The initial state is thus:

   START: Preconditions: none
          Effects: On(C,Table), Clear(D), Clear(C)

   END:   Preconditions: On(A,B), On(B,C)
          Effects: none

 After adding STEP1, we find that we cannot satisfy its preconditions and add the CHECK action which gives us
 (note that it has 2 sets of possible effects):

  CHECK:  Action: Check(On(A,B)
          EffectsTRUE: On(A,B) On(D,A)
          EffectsFALSE: On(B,A),On(D,B)

  Now on the TRUE branch we can proceed as before, where some of the liks aref satisfied by the EffectsTRUE of CHECK.
  The END step must be duplicated, generating END' to handle the FALSE branch as well. Then we proceed similar to
  the case before, generating a STEP1', a STEP3' and a STEP4' and the appropriate links.
  (STEP2', the counterpart of STEP2 is not needed here, because Clear(B) after STEP3')

5)  Consider the following voting scenario, occuring in a certain middle-eastern country.*
    We will assume that you are a risk-neutral, i.e. that
    your utility function is linear in the amount of money that your assets are worth.
    Assume also that moral considerations are not part of your utility function.
    There are 3 candidates for whom you could vote: Yahoo, Future, and White,
    all equally likely to win, unless your vote happens to swing the election.
    Assume that there is a 5% chance that your vote will actually swing the election,
    and a 0.5% chance that "The Man" (now officially endorsed by Yahoo) will somehow find out what your vote was.
    If that happens, and you voted for someone OTHER than Yahoo, your business
    will be trashed for a loss of 100,000 NIS and also lose every potential side-benefit
    of serving with Yahoo (see below).

    Your accountant has notified you that your taxes for FY 2012 amount to 1,000,000 NIS
    on your cottage-cheese-encapsulation business.
    Also, you happen to be a member of Yahoo's party, and if this party wins
    you have a 10% chance of being in the legislature, in which case you will be able to
    pass a law that declares all cottage-cheese-related businesses tax-exempt. But there is also a
    10% chance that in this case a corruption charge against you will stick and you will
    go to jail and lose your business, a total additional cost of 5,000,000 NIS.

    Additional information is as follows: Yahoo claims he will reduce your taxes by 50%,
    and you think he will actually deliver with probability 0.5 given that he is elected.
    If White is elected there will be mediocre-size bonuses which will gain you 100,000 NIS.
    Future getting elected will improve police forces, which will decrease your security
    costs by 150,000 NIS.

    Assume that event combinations not mentioned above have a probability of 0, e.g.
    the probability that you will be a member of the legislature given that Yahoo
    does not win is 0.

    You have the following options:
       1) Vote for Yahoo
       2) Vote for Future
       3) Vote for White
       4) Not vote at all

     Draw the decision diagram for each case below, and solve it.

    a) What is your rational choice in voting?

    b) As above, but you have the choice to pay 10,000 NIS to "filch-tech polling inc."
       for a poll which tells you with certainty which 1 of the above 3 will surely NOT win
       (leaving the other 2 equally likely). If you do make the payment, you get this
       information before you need to decide about your vote. 
       Do you pay for the poll?
       What do you vote after getting the poll information, assuming you do?

  We will define the diagram for case b, for case a this is the same except for removing the POLL
decision node. There are various versions that are acceptable, compact of expanded. We will provide
an expanded version.

There are 3 decision variables: 
  POLL? (binary decision on whether to buy the poll)
  VOTE (a 4-valued vote decision)
  LAW (binary decision on whether to pass the law).

There are random variables for: 
  POLL_DAT (answer from polling company, values: Y, F, W, Nix
           one for each non-winner, plus "Nix" meaning: you did not pay, so go fish...)
  WINNER (ternary variable, who wins the elections: Y, F, W)
  MP     (binary, will I be an MP - member of parliament)
  J      (binary, will I be convicted of corruption)
  D      (binary, will the fact that I did not vote for Yahoo be discovered)
  TR     (binary, will my taxes be reduced)

  We can model it as if:
     POLL_DAT is a child of POLL?
     WINNER is a child of POLL_DAT and VOTE (note that we are modeling it AS IF the poll determines the winner!)
     MP is a child of WINNER and VOTE
     D is a child of VOTE
     MP is a child of VOTE and D
     TR is a child of VOTE and D
     J is a child of MP and LAW

The distributions are according to the values defined in the problem data.

Finally, we have a utility node U which is a child of everything except POLL_DAT and MP.
(And in case a we can drop the node POLL? and assume that POL_DAT is Nix). The utility
node function simply sums up the monetry value of the rewards.

Tos solve this, we will simply examine all the cases. But we will begin with the last decision, whether to
pass the law making the business tax-exempt.
If you pass the law (when in a position to do so), you gain either half (if Yahoo keeps his promise) or all your taxes
(if not) each with probability 0.5 (assuming independence), so you gain: 0.5 * 500K + 0.5 *1000K = 750K.
But from this you lose 5000K with probability 0.1, losing 500K. So on average you gain 250K in expeced value by
passing the law when in a position to do so. Therefore we can simply fix a utility of 250K instead of these nodes,
whenever MP is true.

We are now ready to find the utilities for the various global outcomes.
If future wins, you gain 150K, unless D is true, in which case you lose 100K from that and gain only 50K.
If White wins, you gain 100K, unless D is true, in which case you lose 100K from that and gain 0.
If Yahoo wins, and D is false, you gain 0 if TR is false and 500K if TR is true (each probability 0.5, so
  expected value is 250K). In addition you gain 250K if MP is true (probability 0.1) and 0 if not, an average of 25K,
  and thus total gain 275K.
If Yahoo wins and D is true, you still gain 500K with probability 0.5, but no other gains and also lose 100K,
  for a total expeced gain of 150K.

Now for determining the actula expected value of the vote actions.
If you VOTE YAHOO, then D will always be false, and also the chance that Yahoo wins increases by 5% of 67%,
i.e. will win with probability 0.3665 and the other parties each with probability 0.3167.
  U(VOTE=Y) = 0.3665 *275K + 0.3167*(150K + 100K) = 180K

If you VOTE FUTURE, then D will be true with probability 0.005 and false with probability 0.995
The winner probabilities swing according to the above, so we get:
  U(VOTE=F) = U(VOTE=F,D)P(D) + U(VOTE=F, not D)P(not D) =
            = 0.995* (0.3167 *275 + 0.3665*150 + 0.3167*100)K + 0.005* (0.3167 *150 + 0.3665*150 + 0.3167*100) K =
            = 172.86 + 0.6707 K = 173.53 K

If you VOTE WHITE, then D will be true with probability 0.005 and false with probability 0.995
The winner probabilities swing according to the above, so we get:
  U(VOTE=W) = U(VOTE=W,D)P(D) + U(VOTE=W, not D)P(not D) =
            = 0.995* (0.3167 *275 + 0.3665*100 + 0.3167*150)K + 0.005* (0.3167 *150 + 0.3665*100 + 0.3167*150) K =
            = 170.39 + 0.658 K = 171.048 K

Therefore the optimal action is to vote Yahoo.

When you have a POLL? option, you need to consider what to do in all possible answers, each has probability 1/3.
Each of the cases will be done similar to the above. Doing the math you will see that if the poll
says that White or Future wil not win, then you still should gain by voting Yahoo, so in these cases the
information does not help you. However, if the poll says that Yahoo will not win, then you should switch
to VOTE FUTURE, which will give you (considering the value of D):
  U(VOTE=F, POLL_DATA=Y) = 0.995 * (0.55 *150 + 0.45 *100) + 0.005*(0.55 *50) K = 127 K
That is because U(VOTE=W, POLL_DATA=Y) = 0.995 * (0.45 *150 + 0.55 *100) + 0.005 * (0.45 *50) K = 122 K
and because U(VOTE=Y, POLL_DATA=Y) = (0.5 *150 + 0.5 *100) = 125 K
where the decision to vote Yahoo would have been correct without this information.

Therefore you gain 127 K - 125 K = 2K in this case, and total expected gain from this information is 2K / 3.
As this is less than the cost of the poll, 10K, then you should not pay for the poll.

6) Consider the following Canadian traveller problem instance.

   The (undirected) graph has 4 vertices, as follows:
     (I, V1, V2, G)
   The edges are as follows:
     (I, V1), weight 10, no ice.
     (I, V2), weight 20, no ice.
     (V1, V2), weight 30, probability of ice 0.5
     (V1, G), weight 10, probability of ice 0.8
     (V2, G), weight 20, probability of ice 0.5
     (I, G), weight 1000, no ice.
   b1) Formalize the above problem as a (belief-state) MDP, i.e. 
       what are the states, the transition probability
       function, and the reward function?
   The belief state should contain the location of the agent, as well as the state of knowledge
   about each the the 3 unknown edges. Thus the state varaible vector is:

     (L, E(V1,G), E(V2,G), E(V1,V2))

   Where the domain of the L variable is {I,V1,V2,V3} and the domain of the edge variables is (Ice, No Ice, Unknown)
   where "Unknown" actually means "ice with probability as in the problem description for this edge".
   We will denote the latter values by (I, N, U) for conceseness.
   The MDP transitions are deterministic, but in the belief-state MDP the transitions are stochastic.
   For example, if the currrent state is (I,U,U,N) and the action is "traverse (I, V1)" then
   we get to (I,I,U,N) with probability 0.8 (i.e. the probability that edge (V1,G) is icy in the problem
   description) and to (I,N,U,N) with probability 0.2  and transition to any othe state is not possible.

   The REWARD function is best described as a function R(s,a) of the current state and the action,
   and is minus the weight of the edge, e.g. R((I,I,U,N), (I,V)) = -10     etc.
   Assuming we need to reach G, then any state (G, x,y,x) is a "terminal" state.
   This is an "additive reward" optimization problem (no discounting).

   b2) Find the optimal policy. You may assume that the robot starts
       at vertex I in order to save some computation.
   The dumb method of using value determination (same as value iteration, except setting the value
   of all terminal states as 0, and all other states to minus infinity in the initialization will work.
   However, we will use value determination together with intelligent bounding and expectimax to
   prune unreachable and uninteresting states and actions, and save a LOT of work.

   First, we note that edge (V1, V2) has weight 30, and we also have (I,V1) and (I,V2) which are known
   not to be icy and the SUM of their weights is 30. This means that for EVERY policy that ever traverses
   (V1,V2) there is a policy with equal or lower cost which instead takes the detour through I. This means
   we can effectively DROP the edge (V1,V2) from our problem description, as well as the respective
   (last) state variable!

   Now, when we do value determination, we will proceed to update state values from the goal backwards, as follows:
     Utility(V1, N, x) = -10      (because we can get to the goal for a cost of 10, independent of the state of (V2,G)
                                   AND there is no cheaper path in any such case).
     Utility(V2, x, N) = -20      (similar to above argument).
     Utility(I, I, I) = -1000     (both paths blocked, must take expensive edge (I,G))

     Utility(I, N, x)  = -10 + Utility(V1, N, x) = -20
     Utility(I, I, N)  = -20 + Utility(V1, I, N) = -40
     Utility(V1, I, N) = -10 + Utility(I, I, N) = -50
     Utility(V2, I, I) = -20 + Utility(I, I, I) = -1020
     Utility(V1, I, I) = -10 + Utility(I, I, I) = -1010

Now to handle the cases where the uncertainty makes a difference:
     Utility(I, I, U) = -20 + (0.5*Utility(V2,I,N) + 0.5*Utility(V2,I,I)) = -20 + (0.5*(-20) + 0.5*(-1020)) = -540
     Utility(I, U, I) = -10 + (0.8*Utility(V1,N,I) + 0.2*Utility(V1,I,I)) = -10 + (0.2*(-10) + 0.8*(-1010)) = -820

Now we can compute:
     Utility(V1, I, U) = -10 + Utility(I, I, U) = -550
     Utility(V2, U, I) = -20 + Utility(I, U, I) = -840

     Finally, we check the optimal action from the initial state (I, U, U):
       Trying (I,V1) we get expected reward: 
         -10 + (0.2*Utility(V1,N,U) + 0.8*Utility(V1,I,U)) = -10 + (0.2*(-10) + 0.8*(-550)) = -452
       Trying (I,V2) we get expected reward:
         -20 + (0.5*Utility(V2,U,N) + 0.5*Utility(V2,U,I)) = -20 + (0.5*(-20) + 0.5*(-840)) = -450
     So in the initial state it is better to try (I,V2) first, and we have:
       Utilty(I,U,U) = -450   which is the expected value of the optimal policy.
     The optimal policy can be retrieved from the above computation steps, namely:
     1.  Traverse (I,V2).
     2.  If (V2,G) is not icy, then traverse (V2,G) and stop.
     3.  Traverse (I, V2)    (thereby returning to I)
     4.  Traverse (I, V1)
     5.  If (V1,G) is not icy, then traverse (V1,G) and stop.
     6.  Traverse (I, V1)    (thereby returnning tp I)
     7.  Traverse (I, G)  and stop.

   b3)In addition, from node I only, and before it starts moving,
      the agent can perform a SENSING action
      for a cost of 1, which results in the agent knowing the state of
      the edge (V2, G). Re-compute the optimal policy for this case.
    By sensing, we can reach belief state (I,U,N) or (I,U,I) each with probability 0.5.

    We know that:
       Utility(I,U,I) = -820
       Utility(I,U,N) = -20 - Utility(V2,U,N) = -40
    So in this case after sensing we get expected utility:
       0.5*Utility(I,U,I) + 0.5*Utility(I,U,N) = -430

    This is 20 better than before sensing, (thus 20 is the expected value of information for knowning
    whether (V2,G) is icy when in the initial state). Since the value of information is better than the cost
    of sensing, then we should begin by sensing, and then, if (V2,G) not icy then traverse (I,V2) followd by (V2,G).
    Otherwise, traverse (I,V1) and if (V1,G) is not icy then traverse it and stop. Finally, all else failing,
    go back to I and traverse (I,G).
    The expected reward of this entire policy is -431.

7)  A principal researcher in the Intelligent Smart Grid (ISG) consortium
    would like to construct a decision procedure based on results from hiring
    employees in the Canadian Traveler Problem research project.
    The decision procedure will be based on a decision tree, using information
    in the case database. Attributes for each employee are:
    degree level (A),  C programming ability (B) Self motivation (C), number of
    papers published in AI workshops (D).
    The target attribute is the "decision", whether to hire or not.

  a) Construct by hand (using the greedy search using the best information
     gain heuristic) a consistent decision tree for the following table:

     input variables               decision
   A      B      C      D
   2      F      L      0             N
   2      F      M      1             Y
   2      T      L      0             N
   3      F      H      0             Y
   3      T      M      2             N
   3      T      M      0             N
   3      T      H      1             Y

  b) Is a more compact (fewer internal nodes) decision tree possible for the above table?
     If no - prove it. If yes, which tree?

   Consider each of the 4 attributes and the outcomes on each of their branches.
We will see if we can save time and not have to compute the information gain precisely.
Starting with A, we get distributions:
   A=2:  1Y, 2N
   A=3:  2Y, 2N

Starting with B, we get distributions:
   B=F:  2Y, 1N
   B=T:  1Y, 3N
Note that the distribution in A=2 has the same counts (ignoring labels) as for B=F, so
same information, but in A=3 the distribution is even for 4 cases, and in B=T 4 cases with uneven
distribution. So B is more informative than A (we will thus ignore A from now on).

Starting with C, we get distributions:
   C=L:  2N
   C=M:  1Y, 2N
   C=H:  2Y
The case C=M is has the same counts as for B=F, but the other cases are a perfect classification, so
C is better than B.

Starting with D, we get distributions:
   D=0:  2Y, 2N
   D=1:  2Y
   D=2:  1N
Here D=0 is even distribution, and more examples than for X=M, so D is NOT better than C.
Therefore we select C, and need a recursive call on the branch C=M, with 3 examples remaining.
Checking A, we see that we have:
  A=2:  1Y
  A=3:  2N
Which is perfectly classified, so no need to look further: we might get something equally good, but nothing
will be better. So the overall returned tree is:

C?  L: N
    H: Y
    M: A?  2: Y
           3: N

This tree has 2 internal nodes, and is optimal in this respect because we already saw
by exhaustive search that there is no tre with 1 node that is consistent.

Deadline: Wednesday, January 23, 2013, at 12 noon.

Note: submission (as well as WORK) on this exercise is SOLO (that is, NOT in pairs or other groups of size > 1)

* Disclaimer: any seeming identification with existing countries, institutions, political parties, or people, is unintentional.