1) We have the following distribution over the binary variables A, B, C, D, given as a full table: A B C D probability ------------------------------- F F F F 0.0 F F F T 0.05 F F T F 0.025 F F T T 0.05 F T F F 0.0 F T F T 0.05 F T T F 0.025 F T T T 0.05 T F F F 0.0 T F F T 0.15 T F T F 0.075 T F T T 0.15 T T F F 0.0 T T F T 0.15 T T T F 0.075 T T T T 0.15 a) Construct the Bayes network that most compactly describes the distribution (i.e. one that has the smallest number of arcs). ANS: By glancing at the probability table, it is obvious that numbers in lines 1-4 (B=F) are repeated in lines 5-8 (B=T), and likewise lines 9-12 are repeted in lines 13-16. So clearly these numbers do not depend on B, and all variables are thus independent of B (therefore B is completely disconnected from the rest of the variables). We only need to consider nodes A, C, D now. Suppose we start with A, then C, Then D. Checking whether C depends on A, compare P(C) and P(C|A). We have P(C=True) = 2 * (0.025+0.05+0.075+0.15) = 0.6 P(A=True) = 2 * (0.15+0.075+0.15) = 0.75 P(A=True, C=True) = 2 * (0.15+0.075) = 0.45 So we have P(A=True, C=True) = P(C=True)*P(A=True) We need to check the other possible instantiations: P(A=True, C=False) = 2 * (0.15) = 0.3 = P(A=True)*P(C=False) = 0.75 * 0.4 P(A=False, C=False) = 2*0.05 = 0.1 P(A=false)*P(C=False) = 0.25 * 0.4 P(A=False, C=True) = 2*(0.05+0.025) = 0.15 = P(A=false)*P(C=True) = 0.25 * 0.6 So C does not depend on A either, and thus also has no parents. Finally, check D. We have: P(D=False) = 2 * (0.025+0.075) = 0.2 and therefore P(D=True) = 0.8. We check all the cases of D and A and find that P(A, D) = P(A) * P(D), and thus D does not depend on A either. It remains to be seen whether D depends on C, and in fact it does: P(C=False, D=False) = 0 as can be seen from the respective 0 entries. However, P(C=False) = 0.4 and P(D=False) = 0.2 and multiplying them does NOT give 0. So D depends (only) on C. The resulting Bayes network has only one arc - from C to D. b) Is the answer above unique? If not, what is the set of possible BNs that describe the distribution compactly? ANS: Not unique. The Bayes nework containing only one arc from D to C is also possible. But there are no other options beyond these two. 2) Consider the following variables and their immediate causes (parents): variable parents ------------------------------- A none B none C none D A C E A B F B G D F H C E a) Is this network a poly-tree? ANS: No, because e.g. A-E-H-C-A is a cycle in the underlying undirected graph. b) Is this network (directed-path) singly connected? ANS: Yes, there are no cases with more than 1 directed path between any pair of nodes. c) Determine the truth of the following independence statements (using d-separation): 1) I({D}, {E} | {}) ANS: No, because in path D-A-E node A is a non-evidence diverging node, and thus this path is not blocked by the (null) evidence. 2) I({D}, {E} | {A}) ANS: Yes, because in path D-A-E node A is an evidence diverging node, and thus this path is blocked by the evidence. All other paths contain a converging non-evidence node with no children (a "barren" node), that blocks the path. 3) I({D}, {E} | {A, G}) ANS: No. Now path D-G-F-B-E is unblocked, because G is a converging evidence node, F is a passthrough non-evidence node, and B is a diverging non-evidence node. 4) I({B}, {G} | {D, F}) ANS: Yes, as every simple path from B to G must traverse eithe D or F as passthrough nodes, and both are evidence nodes, blocking the respective paths. 5) I({B}, {C} | {}) ANS: Yes, successively removing barren nodes leaves just B and C disconnected. Alternately (longer), we can show directly that every path from B to C has a non-evidence converging node (and cannot have evidence descendants, as the evidence is null). 6) I({B}, {C} | {H}) ANS: No. Path C-H-E-B is unblocked, since H is a converging evidence node and E is a passthrough non-evidence node. d) Assume that the nodes are binary-valued. Suppose that P(A=true) = 0.4, P(B=true)= 0.2, P(E=true|A=B=True) = 0.5, P(E=true|A=false,B=true) = 0.1, P(E=true|A=true,B=false) = 0.5, P(E=true|A=B=false) = 0.5 P(H=true|E=true) = 0.1, P(H=true|E=false) = 0.4, Find P(A=true | B=true, E=true, H=true) Hint: you may want to preprocess to get rid of nodes before doing any computation. ANS: In pre-processing, remove barren node G, then D and F. Now, we can see that I({A}, {H}| {B, E}) in the remaining graph (E blocks the only path A-E-H as it is a passthrough evidence node), so it is sufficient to compute the equivalent query: P(A=true | B=true, E=true) But for the latter query, H is barren, and so is C. So we remain only with a BN with the 3 nodes A, B, E. We now use the definition of conditionals: P(A=true | B=true, E=true) = P(A=true, B=true, E=true)/P(B=true, E=true) = P(A=true, B=true, E=true)/(P(A=true, B=true, E=true)+ P(A=false, B=true, E=true)) We now use factoring according to the definition of BN distributions, i.e.: P(A, B, E) = P(A)*P(B)*P(E|A, B) to get: P(A=true | B=true, E=true) = 0.4*0.2*0.5 / (0.4*0.2*0.5 + 0.6*0.2*0.1) = 0.2/(0.2+0.06) = 10/13 3) You need to construct a 6-input neural network that has value 1 in the output just when the number of its inputs that are "on" is either 1 or 4 (assume inputs are in {0, 1}). a) Can this be done without hidden units? ANS: No. In order to separate the case of no inputs active from one input active, a hyperplane is needed that cannot possibly also separate the case of one input active from 2 inputs active. Another way to see it is projecting the space onto any 2 inputs x1, x2, when all other inputs are 0. In this case, the output needs to be a XOR(x1, x2), which is known not to be implementable without hidden units. b) Show a network using threshold elements that performs the required computation (specify the weights, too!) ANS: Many ways to do this. The simplest would have 3 hidden units, one for encoding "more than 1", another for encoding "4 or more" and one for encoding "more than 4". We will call these unit outputs a2, a4, and a5, respectively. An additional unit is the output unit o. All units are connected to all inputs, with a weight of 1. In addition, the outputs a2, a4, a5 are also connected as inputs of o. The theresholds and still unspecified weights are as follows. The threshold of unit o is 0.5, so that any input unit being "on" can activate it. The weight on the connection from a2 to o is -10, so that it will override the normal inputs and deactivate o if more than one input is "on". The weight from a4 to o is 10, so that it will override the override generated by a2 if a4 is active. The weight from a5 to o is again -10, so as to override the override provided by a4, if more than 4 inputs are "on". Finally, the threshold of the aj units are: Threshold of a2: 1.5 Threshold of a4: 3.5 Threshold of a5: 4.5 Additional challenge: can this function be implemented with a) One hidden unit? b) Two hidden units? 4) a) For the blocks world as defined in class, show how POP constructs the plan (that is, show all partial plans from the initial plan until a correct plan is reached) for initial state: A B C D ------------ and the goal state cointaining: On(A,B) and On(B,C) ANS: Initial plan contains dummy step: INIT PRECONDITIONS: null POSTCONDITIONS: On(A,B), On(B,D), On(C,Table), On(D,Table), Clear(A), Clear(C) and a dummy step: END PRECONDITIONS: On(A,B) On(B,C) POSTCONDITIONS: null and constraint INIT < END. Say POP picks On(B,C) of END as the unsatisfied precondition to handle. The only way to make it work is to add a step S1: PutOn(B,C) PRECONDITIONS: Clear(B), Clear(C), On(B,x) POSTCONDITIONS: On(B,C), not Clear(C), Clear(x) and constraints INIT < PutOn(B,C) < END This causes no potential violations. Now, POP might pick unsatisfied precondition Clear(C) of S1, which can be satisfied by INIT. Next, can pick unsatisfied precondition On(B, x) of S1, which can be satisfied by INIT with x=D. Next, can pick unsatisfied precondition Clear(B) of S1, which only be satisfied by addin a step Puton(A, y) for some y. We will use y=Table (or alternately a specialized step PutOnTable(A)), as otherwise we might run into trouble, e.g. PutOn(A,C) will cause a violation of Clear(C) which cannot be fixed. So we add step S2: PutOn(A,Table) PRECONDITIONS: Clear(A), On(A, x) POSTCONDITIONS: On(A,Table), Clear(x) With x=B, the postcondition Clear(x) satisfies this precondition of S1, so we add a constraint S2 < S1. Also INIT < S2. No we look at the Clear(A) precondition of S2, which can be satisfied by INIT. Likewise, precondition On(A,B) is satisfied by INIT. Finally, there is the precondition On(A,B) of END. Although it is satisfied by INIT, this would not work because then PutOn(A,Table) would violate it in a way that cannot be fixed. So instead we add step S3: PutOn(A,B) PRECONDITIONS: Clear(A), Clear(B), On(A,x) POSTCONDITIONS: On(A,B), not Clear(B), Clear(x) Note that S2 violates the precondition On(A,B) so S3 must be after S2, so we need to promote S2 and add the constraint S2 < S3. Also, note that S3 is a potential threat to precondition Clear(B) of S1, so we must promote S3, and require S1 < S3 as well. Overall, we have the constraints INIT < S2 < S1 < S3 < END, which are consistent. Finally, Clear(A) of S3 is satisfired by INIT, and On(A,x) is satisfied by S2 with x=Table. Thus, we have a complete plan. b) Suppose that is not known whether On(B,D) or On(D,B) in the initial state, and that there is an operator "Check(On(x,y))" which results in knowing whether this is true or not. Show how a conditional plan is constructed for the above case. ANS: Same as above, except that after adding S1, we run into trouble when trying to satisfy precondition Clear(B) of S1, as we do not know if we have On(B,D) or On(D,B) (we will assume that we do know that A is wither on B or on D, depending on the above information, otherwise we will need to check that as well. So we add an SC (conditional) step: Check(On(B,D)) PRECONDITION: null POSTCONDITION: Know(On(B,D)) In the "true" branch we continue as before. Now we have the "false" branch which has "On(D,B)". In this case we must add a step S4: PutOn(D,Table) PRECONDITIONS: Clear(D), On(D, x) POSTCONDITIONS: On(D,Table), Clear(x) And now we must duplicate the step PutOn(A,Table) for this contingency, (call this S5) so as to get Clear(D). So in fact the "Check" must be the first action, and we get the plan. INIT -- SC -- (true) --- S2 -- S1 -- S3 -- END (false) -- S5 -- S4 -- S1 -- S3 -- END (also note that the planner may not be smart enough to re-use S1 and S3, in which case it must duplicate those steps as well). 5) Consider the following scenario, occuring in a certain middle-eastern country.* We will assume that you are a risk-neutral, i.e. that your utility function is linear in value unit discused below (morts). Assume also that moral considerations are not part of your utility function. A well-known band of criminals has stolen your FOO, is threatening to destroy it immediately (a damage of 1 mort), and will only return it to you if you give them your mojo (C)** instead. Surrendering your mojo is not of immediate material damage, but may cause you a loss in the future. With probability 0.5 this will cause a loss of 1 mort next fortnight, and additionally with probability 0.4, independent of the above, it will cause a damage of 10 morts the one fortnight after the next. Your options are as follows: a) Do nothing, and then the criminals destroy your FOO. b) Surrender your mojo, and possibly encur the respective future losses. c) Request the police to try a raid to recover your FOO. Since you do not know where it is, the results are as follows: With probability 0.1 they will succeed, and recover your FOO. With probability 0.1, this will cause a loss of 1 mort, regardless of whether they succeed in recovering your FOO or not. 5.1) What is your optimal course of action? Consider two reward accumulation functions: a) Simple sum of rewards. ANS: Do nothing: reward -1 mort. Surrender mojo: expected reward 0.5*(-1)+0.4*(-10) = -4.5 Police action: 0.9*(-1)+0.1*0+0.1*(-1) = -1 So either doing nothing or police action are optimal here (though the latter is more risky). b) Discounted sum of rewards with a discounting factor of 0.2 (per fortnight). ANS: No change of utility for actions a and c since the rewards are immediate. But for action b the rewards are in the future, so we modify the computation to: Surrender mojo, expected discounted reward: 0.5*(-1)*0.2 + 0.4*(-10)*0.2*0.2 = -0.26 So for the discouted reward case the "surrender mojo" action is optimal. 5.2) Would the answer be different, if in surrendering your mojo, there is in addition a probability of 0.5 that the band of criminals will steal your FOO again after 2 fortnights (and then leave you with the same options?) ANS: No. Because in the undiscounted rewards case, the "surrender mojo" action is already bad, and this possibility only makes it worse. In the discounted rewards case, we can bound the damages by 1 mort 2 fortnights ahead, and multiplied by 0.2 squared this is 0.04, so deducting this from -0.26 still leaves the "surrender mojo" action as optimal in this case. 5.3) In 5.1, suppose that an informer can provide information that will increase the probability of success of a police raid to 0.5. How much should you be willing to pay for such information (in both cases a and b)? ANS:Assuming you choose "police action" (optimal anyway), receiving the information will not make you change the decision, but will improve the expected utility of this action. The new expected utility for this action will be: 0.5*(-1)+0.5*0+0.1*(-1) = -0.6 So you should be willing to pay up to the difference, 0.4 mort, for the information. In the discounted rewards case, the improvement in expected discounted reward will not make police action optimal anyway, so this action will not be chosen. Therefore, the information here is worth 0. 6) Consider the following belief-state MDP: the Canadian traveller problem. The (undirected) graph has 4 vertices, as follows: (I, V1, V2, G) All vertices are directly connected by edges. The edges are as follows: (I, V1), weight 1, no ice. ; e1 (I, V2), weight 2, no ice. ; e2 (I, G), weight 10, no ice. ; e3 (V1, V2), weight 3, probability of ice 0.5 ; e4 (V1, G), weight 1, probability of ice 0.8 ; e5 (V2, G), weight 2, probability of ice 0.5 ; e6 b1) Formalize the above problem as a (belief-state) MDP, i.e. what are the states, the transition probability function, and the reward function? ANS: In general, the states of the belief-space MDP can be factored into the (known) agent vertex, and what is known about each edge. Since edge state in the underlying process cannot change, edges known initially to be traversible need not be represented. Thus, the belief space is: B = V x {0, 1, U} ^ |E'| Where V is the set of vertices, and E' the set of unknown edges. For each edge e, "0" denotes blocked, "1" denotes unblocked, and "U" denotes unknown (probabilty p(e) of being blocked). Actions are of the form move(v, u), i.e. traverse the edge from v to u. The transition probabilities for actions not acievable in a belief state (trying to traverse an edge not adjacent to the current location, or a blocked edge) are 1 for staying in the same state, 0 for any other state. For achievable actions, a= move(v, u) we can define the distribution over post-action states in a decomposable manner: P(b' | a, b) = P(v' | a,b) * P(e' | a,b) * P(e'' | b) * P(e''' | a,b) Where P(e'''| v, a) is a distribution over the edges unknown before the action that are not adjacent to u, which remain unknown with probability 1 after the transition, P(v' | v, a) = 1 just when v' = u, and 0 otherwise, and P(e'' | v, e) is a distribution over edges that are already known, and their state remains the same with probability 1 (note that this does not depend on a). Finally, the distribution P(e' | v,a) is over previously unknown edges e' that are adjacent to u, and is a product over the respective edge states: P(e' | v,a) = (prod (e over blocked edges) p(e)) * (prod (e over unblocked edges) 1-p(e)) We will generate the numbers for the example on-the-fly below as needed. Rewards for an achievable action are R(move(v,u)) = -w(v,u) b2) Find the optimal policy. You may assume that the robot starts at vertex I in order to save some computation. In the current example, there are 3 unknown edges, so the belief states are of the form: [v; e4, e5, e6] Where v indicates a vertex, and the ei are the state of the respective edge, out of {0,1,U}. The initial state might be: [I; U, U, U] This can be done by value iteration, where all belief state utilities are initialized to (-infinity), except for (X stands for "any possible configuration of the edge"): U([G;X,X,X]) = 0 ; Being at the goal, no actions needed, utility 0. Since we need to do it manually, we will do this efficiently by beginning with cases that need only be updated once (we hope). First, the ones which have only certain transitions: U([V1;X,1,X]) = U([G;X,X,X]) - w(V1,G) = 0 - 1 = -1 ; at V1, edge e5 to goal clear U([V2;X,X,1]) = U([G;X,X,X]) - w(V2,G) = 0 - 2 = -2 ; at V2, edge e6 to goal clear U([I;X,X,1]) = U([V2;X,X,1]) - w(I,V2) = -2 -2 = -4 ; At I, knowing e6 clear U([I;X,1,X]) = U([V1;X,1,X]) - w(I,V1) = -1 -1 = -2 ; At I, knowing e5 clear ; note that these are ASSIGMENTS, we have e.g. U([I;X,1,1]) improved to -2 here. U([V1;X,0,1]) = U([I;X,X,1]) - w(I,V1) = -4 - 1 = -5 ; At V1, knowing e5 blocked, e6 clear U([V2;X,1,0]) = U([I;X,1,X]) - w(I,V2) = -2 - 2 = -4 ; at V2, knowing e6 blocked, e5 clear ; best actions in these last 2 cases are "return to I". U([I;X,0,0]) = U([G;X,X,X]) - w(I,G) = 0 - 10 = -10 U([V1;X,0,0]) = U([I;X,0,0]) - w(I,V1) = -10 - 1 = -11 U([V2;X,0,0]) = U([I;X,0,0]) - w(I,V2) = -10 -2 = -12 Now we need to handle a few uncertain transitions: U([I;X,0,U]) = max { (U([G;X,X,X]) - w(I,G)) ; = -10 (U([V1;X,0,U) - w(I,V1)) ; = ? (will not help!) 0.5*U([V2;X,0,0]) + 0.5*U([V2;X,0,1])- w(I,V2)) ; = 0.5(-12-2)-2 = -9 } = -9 ; with optimal action: move(I,V2) U([I;X,U,0]) = max { (U([G;X,X,X]) - w(I,G)) ; = -10 (U([V2;X,U,0) - w(I,V2)) ; = ? (will not help!) 0.8*U([V1;X,0,0]) + 0.2*U([V1;X,0,1])- w(I,V1)) ; = 0.8*(-11)+0.2*(-1)-1 } = -10 ; with optimal action either move(I,V1) or move(I,G), since they cause the same value, so we will keep move(I,G) as the optimal (simpler). U([V1;X,0,U]) = U([I;X,0,U]) - w(I,V1) = -9 - 1 = -10 U([V2;X,U,0]) = U([I;X,U,0]) - w(I,V2) = -10 - 2 = -12 And finally we get to update the "initial" state: U([I;U,U,U]) = max { (U([G;X,X,X]) - w(I,G)) ; = -10 0.5*U([V2;X,U,0])+0.5*U([V2;X,U,1])-w(I,V2) ; = 0.5*(-12-2)-2 = -9 0.8*U([V1;X,0,U])+0.2*U([V1;X,1,U])-w(I,V1) ; = 0.8*(-10)+0.2*(-1)-1=-9.2 } = -9 ; With optimal action move(I,V2) Final optimal policy is intocated in the moves above. Stated as a tree starting in the "initial" position it looks like: move(I,V2) --- (e6 unblocked) --- move(V2,G) --- (e6 blocked) --- move(V2,I) --- move(I,G) Expected cost is 9.2, since U([I;U,U,U])= -9.2 Note that we "cheated" since it never paid to use e4 even if it is unblocked, so we "bundled" all respective cases of the state of e4 into "don't care"s. We will get the same result if we don't do this, but the solution will be at least twice as long! 7) A project leader in the IMG4 consortium would like to construct a decision procedure based on results from hiring employees in the KITE consortium. The decision procedure will be based on a decision tree, using information in the case database. Attributes for each employee are: degree level (A), Java programming ability (B), self-motivation (C), grade in AI (D). The target attribute is the "decision", describing the amount of money to pay the employee. a) Construct by hand (using the greedy search using the best information gain heuristic) a consistent decision tree for the following table: input variables decision A B C D -------------------------------------------- 1 F M 95 8 1 T L 70 4 1 F L 70 4 2 T M 95 6 2 T M 80 6 2 T H 95 8 2 T H 80 8 2 F H 95 8 ANS: We need to try all possible attributes (input variables). We will try a qualitative comparison, and if not clear we will actually need to compute entropy. Trying A, we get for A=1: {8, 4, 4} A=2: {8, 8, 8, 6, 6} Trying B, we get for B=F: {8, 6, 6} B=T: {8, 8, 6, 6, 4} The case B=F has the same distribution type as A=1, and just as much information. The case B=T is strictly less informative than that of A=2. So attribute B is worse than A. Trying C, we get for C=L: {4, 4} ; fully classified C=M: {8, 6, 6} C=H: {8, 8, 8} ; fully classified Since the case C=M is like the case A=1, but C=M and C=H are clearly better than the case A=2, then C is better than A (and better than B). The case C=M requires a bit less than 3 bits to fully classify the 3 examples (distribution not fifty-fifty), to get the exact number do: -(log(1/3)+2*log(2/3)) Base of logs is 2 throughout. Trying D, we get for D=70: {4, 4} D=80: {8, 6} D=95: {8, 8, 8, 6} So it looks like C is also better than D, but it is not very obvious. The case D=80 requires 2 bits to classify the 2 examples as distribution is (0.5, 0.5). The case D=95 requires more than 2 more bits total for the 4 examples, to get the exact number do: -(log(1/4)+3*log(3/4)) So we can see that indeed D is worse than C. So we pick C for the root of the tree. We now need to handle just the branch C=M. Here we can pick either A or B (but not D), for a complete classification. For example, pick A. We have: A=1: {8} A=2: {6, 6} So we are done (zero additional bits needed for final classification. So the resulting decision tree is: C? --- C=L : 4 --- C=H : 8 C=M : A? --- A=1 : 8 --- A=2 : 6 b) Is a more compact (fewer internal nodes) decision tree possible for the above table? If no - prove it. If yes, which tree? ANS: Not possible. Proof: The resulting tree has 2 internal nodes. A more compact tree would have only 1 internal node. We have enumerated all possible such trees, and did not get a tree consistent with the training data, so a 2-node tree is the most compact consistent tree.
Deadline: Wednesday, January 13, 2010, at 12 noon.
Note: submission (as well as WORK) on this exercise is SOLO (that is, NOT in pairs or other groups of size > 1)
* Disclaimer: any seeming identification with existing countries, institutions, political parties, or people, is unintentional.
** Mojo: copyright some film studio we do not recall.