Justify all answers! 1) We have the following distribution over the binary variables A, B, C, given as a full table: A B C probability ------------------------------- F F F 0.125 F F T 0.125 F T F 0.125 F T T 0.125 T F F 0.02 T F T 0.08 T T F 0.08 T T T 0.32 a) Construct the Bayes network that most compactly describes the distribution (i.e. one that has the smallest number of arcs). The algorithm presented in class assumed an ordering. We will be a bit more general and first find all the dependencies. From the table we get (denoting A=true by a, A=false by ~a, etc.) P(a) = P(~a) = 0.5 P(b) = P(c) = 0.65 P(~b) = P(~c) = 0.35 P(a,b) = 0.4 P(~a,b) = 0.25 P(a,~b) = 0.1 P(~a,~b) = 0.25 P(a,c) = 0.4 P(~a,c) = 0.25 P(a,~c) = 0.1 P(~a,~c) = 0.25 P(b,c) = 0.445 P(~b,c) = 0.205 P(b,~c) = 0.205 P(~b,~c) = 0.145 Using the above, we can see that: P(b,c |a) = P(a,b,c) / P(a) = 0.32 / 0.5 = 0.64 P(b|a) = P(a,b)/P(a) = 0.4 / 0.5 = 0.8 P(c|a) = P(a, c)/P(a) = 0.4 / 0.5 = 0.8 So we have P(b,c |a) = P(b|a)*P(c|a) Likewise, we will get P(B,C | A) = P(B|A) * P(C|A) for all possible assignments of A, B, C. Therefore, we have independence of B and C given A, and in the BN we should have d-sep({B},{C} | {A}). What about other d-sep statements? Well no other such statments are true, e.g. consider: d-sep({A},{B} |{}). This does not hold since P(a,b) = 0.4 which is NOT equal to P(a)*P(b)=0.5*0.65=0.325 The other possible independence statements (or d-separation statements): d-sep({A},{C} |{}) d-sep({B},{C} |{}) d-sep({A},{C} | {B}) d-sep({A},{B} | {C}) Also do NOT hold. Now, consider the construction ordering A, B, C. Obviously, in this ordering, A has no parents. Now consider B - since we do not have d-sep({A},{B} |{}) we need to make A a parent of B. Now consider C - since we do not have d-sep({A},{C} |{}) we need to make A a parent of C. However, since we DO have d-sep({B},{C} | {A}) we do NOT need to make B a parent of C. The resulting network is a tree with A as a root node, and each of B and C are its children, a total of 2 arcs. b) Is the answer unique? The answer is not unique. If we had the ordering B, A, C we would have B as a parent of A, and A as a parent of C (no need to make B a parent of C because we need to have: d-sep({B},{C} | {A}). We get a chain with 2 arcs. Likewise, an ordering C, A, B would result in a chain in the reverse direction. Observe that using an ordering B, C, A would result in a graph with 3 arcs, and this graph would not satisfy d-sep({B},{C} | {A}), (and also would not be minimal). 2) Consider the following variables and their immediate causes (parents): variable parents ------------------------------- A none B A C B D E D A E D F E a) Is this network a poly-tree? b) Determine the truth of the following independence statements (using d-separation): 1) I({A}, {E} | {}) No, path A-D-E is not blocked (D is a non-evidence passthough node) 2) I({A}, {E} | {C}) No, for the same reason as above. 3) I({A}, {E} | {D}) Yes. Path A-D-E is blocked (D is an evidence passthough node) Paths A-B-C-E and A-B-C-D-E are blocked by C (converging non-evidence) 4) I({D}, {B} | {}) No. Path B-A-D is not blocked (B is a non-evidence diverging node) 5) I({B}, {D} | {A} Yes. Path B-A-D is blocked by B (an evidence diverging node) Paths B-C-D and B-C-E-D are blocked by C (converging non-evidence) 6) I({B}, {D} | {A, C}) No. Path B-C-D is not blocked (C is a converging evidence node) 7) I({A}, {F} | {E}) Yes. Path A-B-C-E-F is blocked by E (a diverging evidence node) Paths A-B-C-D-E-F and A-D-E-F are blocked by E (a passthrough evidence node) c) Assume that the nodes are binary-valued. Suppose that P(A=true) = 0.8, P(D=true|A=true) = 0.5, P(D=true|A=false) = 0.1, P(E=true|D=true) = 0.5, P(E=true|D=false) = 0.2 Find P(A=true | E=true, F=true) Hint: you may want to preprocess to get rid of nodes before doing any computation. Answer: Node C is has no children, is not evidence and not query, and is thus barren and removed. After that, node B is also seen as barren, and removed. Now, we have d-sep({A}, {F} | {E}) so we know: P(A | E, F) = P(A|E) So this is the same as if the query did not have F as evidence. If so, then node F would also be barren, and can be removed. We remain with the network: A-D-E, and need to compute (using "a" to denote "A=true", etc.) P(a | e) = P(a, e) / P(e) = P(a, e) / (P(a, e) + P(~a, e)) But: P(a, e) = P(a, d, e) + P(a, ~d, e) = P(a)*P(d|a)*P(e|d) + P(a)*P(~d|a)(P(e|~d) = 0.8*0.5*0.5 + 0.8*(1-0.5)*0.2 = 0.4*(0.5+0.2) = 0.28 P(~a, e) = P(~a, d, e) + P(~a, ~d, e) = P(~a)*P(d|~a)*P(e|d) + P(~a)*P(~d|~a)(P(e|~d) = (1-0.8)*0.1*0.5 + (1-0.8)*(1-0.1)*0.2 = 0.2*(0.05+0.18) = 0.2*0.23 = 0.046 Finally: P(a | e) = P(a, e) / (P(a, e) + P(~a, e)) = 0.28 / (0.28 + 0.046) = (approx) 0.86 3) You need to construct a 4-input neural network that has value 1 in the output just when an even number of its inputs are 1, and 0 otherwise (assume inputs are in {0, 1}). a) Can this be done without hidden units? No. Consider the case where the first 2 inputs are 0 and 1, respectively, and consider the output as a function of the remaining inputs (this is a projection of the 4-dimensional problem onto a 2-dimensional space). The number of 1's in the input will be even if exactly 1 of the remaining inputs will be 1. Thus, the desired output is a XOR of the remaining inputs. We already KNOW that XOR is not linearly seperable, so cannot be done without hidden units. b) Show a network using threshold elements that performs the required computation (specify the weights, too!) Answer: there are many possibilities, using hidden units to count the number of 1's (actually, to specify whether it is over a threshold). We will use one unit (A) to indicate 0 inputs active, a second (B) to indicate at least 2 inputs active, a third (C) to indicate at most 2 inputs active, and a fourth (D) to indicate 4 inputs active. The output O should be 1 if either A or D are active, or if both units B and C are active (in the latter case this means exactly 2 inputs active). For simplicity, we will let all thresholds be 1, except for the thresholds of A and C, which are -1. The rest of the weights are as follows: For unit A, all weights are -1.5 (as a result, A will be active only if none of the inputs are active). For unit B, all weights are 0.6 (as a result, any 2 active inputs will activate B) For unit C, all weights are -0.4 (as a result, any 3 active inputs will DEactivate C) Finally, for unit D all weights are 0.3 (thus, D will activate only if all inputs are active) Now, in order to achieve the correct output, make the weights from the respective units to the output unit as follows: From A and D: 2 From B and C: 0.6 As a result O will be active if either A or D are active, or if B both and C are active. Obviously, here are other correct answers as well here. 4) a) Using the perceptron learning rule (the "delta" rule), show the steps of learning 3-input AND function with a single perceptron and a step activation function with threshold 1. Initially all weights are 0.9, and the learning rate is 0.5. Assume the following presentation order (and results): 011: error signal is -1 (need 0, got 1) so updating weights 2 and 3 by -0.5 (weights now 0.9, 0.4, 0.4) 110: error signal is now -1 (need 0, got 1), so updating weights 1 and 2 by -0.5 (weights now 0.4, -0.1, 0.4) 111: error signal is now 1 (need 1, got 0), so updating all weights by 0.5 (weights now 0.9, 0.4, 0.9) 101: error signal is now -1 (need 0, got 1), so updating weights 1 and 3 by -0.5 (weights now 0.4, 0.4, 0.4) We now do the rest of the examples, but no errors, and thus no changes, occur (the epoch ends). On the next epoch, ALL examples pass with no errors (including the above 4), so the algorithm terminates with the correct set of weights. b) Show (by example) that the order of presenting the examples can change the number of training episodes. Suppose we now start (weights (0.9, 0.9, 0.9) with examples: 000, 001, 010, 100, 111. No errors in any of these examples, no changes. Now we go to (similar to last time): 011: error signal is -1 (need 0, got 1) so updating weights 2 and 3 by -0.5 (weights now 0.9, 0.4, 0.4) 110: error signal is now -1 (need 0, got 1), so updating weights 1 and 2 by -0.5 (weights now 0.4, -0.1, 0.4) 101: no error, so no change. This ends the epoch, with an INCORRECT set of weights: (0.4, -0.1, 0.4) Suppose we keep the same order of for the first examples, so we do: 000, 001, 010, 100 111: error signal is now 1 (need 1, got 0), so updating all weights by 0.5 (weights now 0.9, 0.4, 0.9) 101: error signal is now -1 (need 0, got 1), so updating weights 1 and 3 by -0.5 (weights now 0.4, 0.4, 0.4) 011: no error, no change 110: no error, no change At the end of this epoch, we have the weights (0.4, 0.4, 0.4). Ooops, back to square one... If we repeat the above, it is clear that the correct weights will NEVER be reached! 5) A project leader in the DT Prosero project would like to construct a decision procedure based on results from hiring employees in the IMG4 consortium. The decision procedure will be based on a decision tree, using information in the case database. Attributes for each employee are: degree level (A), Java programming ability (B), self-motivation (C), grade in AI (D). The target attribute is the "decision", describing the amount of money to pay the employee. a) Construct by hand (using the greedy search using the best information gain heuristic) a consistent decision tree for the following table: input variables decision A B C D -------------------------------------------- 1 F H 95 8 1 T L 70 4 1 F L 70 4 1 T M 95 6 1 T H 80 6 2 T H 95 8 2 T H 80 8 2 F H 95 8 We consider a greedy assesment of each of the possible attributes. Trying A, we have: A=1 gives us an (unnormalized) distribution (2, 2, 1) for the cases (4, 6, 8) resp. A=2 results in distribution (0,0,3), i.e. a full classification. Remaining information required to classify all examples is thus: Inf(2,2,1)+0 Trying B, we have: B=F gives distribution (1,0,2) B=T gives distribution (1,2,2) Remaining information required to classify all examples is thus: Inf(1,2,2)+Inf(1,0,2) which is clearly greater (thus worse) than for A so B is clearly not the best root node... (Shortcut, saves us computing the value of B exactly). Trying C, we have: C=L gives distribution (2, 0, 0), i.e. a full classification, C=M gives distribution (0, 1, 0), i.e. a full classification C=H gives distribution (0, 1, 4) Remaining information required to classify all examples is thus: 0+0+Inf(0, 1, 4) which is BETTER than for attribute A (because the distribution is STRICTLY more peaked than (1,2,2), and with the same number of examples). (By "strictly more peaked" I mean that some of the values in (0, 1, 4) are the result of moving the examples of some item INTO another item, and all the rest stay the same). Also, permutations make no difference on the value of Inf(). Trying D we have: D=70 gives distribution (2, 0, 0), a full classification, D=80 gives distribution (0, 1, 1) D=95 gives distribution (0, 1, 3) Remaining information required to classify all examples is thus: Inf(0,1,1) + Inf(0,1,3) = -(0+log(0.5)+log(0.5)) - (0+log(0.25)+3*log(0.75)) = 4-3*log(0.75) (all logs in base 2) = approx 5.245 (It is not IMMEDIATELY obvious which is better between C and D, so we need to compute.) For attribute C we had: Inf(0,1,4) = -(0+log(0.2)+4*log(0.8)) = approx 2.32 + 1.29 = 3.61 Thus, remaining information needed to classify all examples after getting the value of attribute C is smaller, so C is the best attribute and is selected as the root node. In recursive calls to the algorithm, for branches L and M no further brancing is needed, but for C=H additional attributes are needed, so we look at each of the remaning attributes: Trying A we have (only for the examples where C=H): A=1 gives distribution (0,1,1) A=2 gives distribution (0,0,3), a full classification Remaining information needed is Inf(0,1,1)+0 = 2 bits. Trying B we have (only for the examples where C=H): B=F gives distribution (0,0,2), a full classification B=T gives distribution (0,1,2) Remaining information needed is Inf(0,1,2) = -log(1/2)-2*log(2/3) = approx 1.58 + 1.17 = 2.73 Trying D we have (only for the examples where C=H): D=70 gives distribution (0,0,0), a full classification D=80 gives distribution (0,1,1) D=95 gives distribution (0,0,3), a full classification Remaining information needed is Inf(0,1,1)+0+0 = 2 bits. So either A or D are better than B, we can pick either according to the information criterion. Let us pick A because it has fewer children... In the recursive calls, the branch 2 of A is a full classification, so we need more attributes only for the case A=1 (and, of course, C=H as before). Trying ot the remaining attributes we get: Trying B we have: B=F gives distribution (0,0,1), a full classification B=T gives distribution (0,1,0), a full classification So we are done. But still let us try D as well: D=70 gives distribution (0,0,0), a full classification D=80 gives distribution (0,1,0), a full classification D=95 gives distribution (0,0,1), a full classification so D seems just as good, but has more children, so we would still prefer B. The resulting tree is (with 3 internal nodes) is: variable : value variable : value variable : value C ---- L --- 4 \ -- M --- 6 -- H ------------ A ---- 1 ----------- B ----- F --- 8 \ ----- T --- 6 -- 2 ---- 8 b) Is a more compact (fewer internal nodes) decision tree possible for the above table? If no - prove it. If yes, which tree? We can have a tree with fewer than 3 internal nodes only if we can find a tree with 2 nodes that can be used for exact classification. So we can simply check all possible trees with 2 internal nodes. Now, we have already seen that no such tree is possible with root C (otherwise it would have been found above). Looking at the other results in step 1, we find: With root B, we have 2 branches that are not fully classified thus requiring at least 2 more internal nodes, so no-go. With root D, again we have 2 branches that are not fully classifies, so again no-go. We remain with possible root A, where branch A=2 is a full classification, so need to check branch A=1, which has distribution (2,2,1). Note that there is no way to fully classify this distribution with a BINARY attribute, so we need to check whether attribute C or D can result in a complete classification. For C=H we get distribution (0,1,1) which is not a full classification so C is no good. For D=95, we get distribution (0,1,1) which AGAIN is not a full classification, so D is ALSO no good. Therefore, there is no decision tree with fewer than 3 nodes consistent with this training set. 6) Consider the following taxation scenario, occuring in a certain middle-eastern country. We will assume that you are a risk-neutral, i.e. that your utility function is linear in the amount of money that your assets are worth. Assume also that moral considerations are not part of your utility function. Your accountant has notified you that your taxes for FY 2005-2006 amount to 1,000,000 NIS. You have the following options: 1) Declare the correct taxes verbatim, and pay the full amount. 2) Attempt to use a tax shelter, in which case if you succeed you will only need to pay 100,000 NIS. However, a senior tax collection evaluator will then assess your forms, and can either accept it, or refuse, in which case you will also pay fines, and your total taxes will be 2,000,000 NIS. There are 4 possible evaluators, equally likely to be on your case: Evaluator A will accept your claim with probability 0.5 Evaluator B will accept your claim with probability 0.2, but in the rest of 80% of the cases can be convinced that your claim is justifiable with probability 0.9, if 100,000 NIS (from your money) can be redirected to better causes (such as helping out with a certain project of evaluator B's cousin). Evaluators C and D will never accept your claim. a) Which of the above taxation options should you prefer as a rational agent? Answer: You have 2 possible actions. Action 1 costs 1,000,000 always, so E[U(be honest)] = -1,000,000 Action 2 depends on the evaluator assigned, so we have: E[U(cheat)] = 0.25*E[U(A|cheat)] + 0.25*E[U(B|cheat)] + 0.25*E[U(C|cheat)] + 0.25*E[U(D|cheat)] = 0.25 * [ [(0.5 * -100,000) + (0.5 * -2,000,000)] // case A + [(0.2 * -100,000) + // case B 0.8 * (-100,000 + (0.1 * -2,000,000 + 0.9 * -100,000))] + 2 * -2,000,000] /cases C, D = 0.25 * (-1,050,000 + -20,000 + 0.8 * (-100,000 + -290,000) + -4,000,000) < -1,000,000 So cheating is worse! Note that there is a shortcut here, because we know that with probability 50% either C or D will be chosen, in which case we pay double, which already costs 1,000,000. Since cases A and B also incur some cost, we can conclude quickly that cheating does NOT work, WITHOUT doing the computations (but not so in general). b) Suppose that a "friend" of the accountant working at the treasury department can give you a "hint" about which evaluator will be on your case, if you grease his palm. How much is this information worth, given that your "friend" is perfectly reliable in this respect? Answer: The "friend" provides perfect information on the evaluator, so we need to compute PVI for this item of information. Now, since the "friend" is always right, the distribution over his answers will be equal to the distribution over evaluators. So we need to take the (weighted) average of the gains for each possible answers, which are as follows: If you know the evaluator to be A, and you cheat, you pay on the average (0.5 * -100,000) + (0.5 * -2,000,000) = -1,050,000 So here if does not pay to cheat, so if you know that A will be chosen you will STILL not cheat. So in this case the information is worthless to you. Note that if you know the evaluator to be C or D, obviously it does NOT pay to cheat, so in these cases the information is ALSO worthless. However, if the evaluator is B and you cheat, you pay only: E[U(B|cheat)] = (0.2 * -100,000) + 0.8 * (-100,000 + (0.1 * -2,000,000 + 0.9 * -100,000))) = -20,000 + 0.8 * -390,000 = -332,000 So by cheating here you gain 668,000. Therefore, VPI(which evaluator) = 0.25*0 + 0.25*668,000 + 0.25*0 + 0.25*0 = 167,000 So, you should be willing to pay up to 167,000 NIS for the "friend"'s information. 7) Consider the following MDP, where scoring is total value of rewards obtained in 3 steps. The world has 4 locations in linear sequence, as follows: F1 I F2 P Actions are: R attempts to move right, and L attempts to move left. Locations F1 and F2 contains flags, which have value of 1 and 10, respectively. A robot visiting a location with a flag gets a reward equal to the flag value, and then the flag disappears. P is an absorbing state, with a reward of -100. Transition probabilities are based on: motion succeeds with probability 0.8, but with probability 0.2 the robot either stays in the same location, or moves in the opposite direction (if possible), with probability 0.1 each (0.2 to stay in place if at end of sequence). b1) Formalize the above problem as an MDP, i.e. what are the states, the transition probability function, and the reward function? Answer: to define an MDP, we need to define the state space, the action space, the reward function, and the transition distribution. The states (excluding time) are triples: (Location, Flag1, Flag2) where the first is a location (one of four of F1, I, F2, P), and the last 2 are binary (flag exists true/false). Total number of states is thus 4*2*2=16. The action space contains two actions: L, R. Reward: we will assume reward is received based only on the current state, before the action is made, and is also received in "last" step, when no actions are executed. (Thus, 3 steps above means 3 actions and 3 transitions, but 4 rewards). The reward function is defined as follows (x and y below stand for "don't care", or "any value"): R(F1, 1, x) = 1 R(F2, x, 1) = 10 R(P, x, y) = -100 All other rewards are zero. Transition probabilities: stating this explicitly requires an array with 16*2*16 probability entries (state, action, resulting state). However, the array is sparse and well structured, so we state just the non-zero transition probabilities. We will assume that the flags "disappear" only in the transition AFTER reaching them, and thus after receiving the reward. Pit is absorbing, you stay there no matter what action is taken, and flags never change either. P((P, x, y) | (P, x, y), a) = 1 Starting at location I with action L (flags do not change): P((F1, x, y) | (I, x, y), L) = 0.8 P((I, x, y) | (I, x, y), L) = 0.1 P((F2, x, y) | (I, x, y), L) = 0.1 Starting at location I with action R (flags do not change): P((F2, x, y) | (I, x, y), R) = 0.8 P((I, x, y) | (I, x, y), R) = 0.1 P((F1, x, y) | (I, x, y), R) = 0.1 Starting at location F1 with action L (flag 1 disappears): P((F1, 0, y) | (F1, x, y), L) = 0.9 P((I, 0, y) | (F1, x, y), L) = 0.1 Starting at location F1 with action R (flag 1 still disappears): P((I, 0, y) | (F1, x, y), R) = 0.8 P((F1, 0, y) | (F1, x, y), R) = 0.2 Starting at location F2 with action L (flag 2 disappears): P((I, x, 0) | (F1, x, y), L) = 0.8 P((F2, x, 0) | (F1, x, y), L) = 0.1 P((P, x, 0) | (F1, x, y), L) = 0.1 Starting at location F2 with action R (flag 2 still disappears): P((P, x, 0) | (F1, x, y), R) = 0.8 P((F2, x, 0) | (F1, x, y), R) = 0.1 P((I, x, 0) | (F1, x, y), R) = 0.1 b2) Find the optimal policy. Finding the optimal policy with finite horizon is done using the value function. But since this is a finite history, the value function depends on time, so we denote Vt(state). We will count t by the number of actions remaining, so V0(s) will thus be the value of being at state s when no actions remain to be executed, and thus clearly we just get the reward at s. That is, we have: V0(s) = R(s) Specifically, we have: v0(F1, 1, x) = 1 v0(F2, x, 1) = 10 v0(P, x, y) = -100 with v0=0 for all other states. To compute Vi(s) we use the recursive equations: Vi(s) = R(s) + max (a over actions) sum (s' over states) P(s'|s,a) Vi-1(s') We can now compute V1(s), observe that we can ignore in the summation all zero terms. Also, we will take the liberty of NOT computing exactly the VERY BAD cases, and just use bounds, for conciseness. Note that during the computation, we also designate the best action for each state, this is the needed optimal policy! If no actions appear below, this means that all actions are equally good (or bad!) for the state. V1(P,x,y) = -200 (all actions equally bad!) V1(F1,1,x) = 1 (get flag 1 and cannot get any other reward in 1 step) V1(F1,0,x) = 0 (same as above, but no flag) V1(I,0,0) = 0 (no flags/rewards available, all actions indifferent) V1(I,0,1) = 8 (optimal action R, get flag 2 reward next time with probability 0.8). V1(I,1,0) = 0.8 (optimal action L, get flag 1 reward next time, with probability 0.8) V1(I,1,1) = 8.1 (optimal action R, get flag 2 reward next time, with probability 0.8, or flag 1 reward, prob. 0.1) V1(F2,x,0) = -10 (best action L, still land in pit with prob. 0.1) V1(F2,x,1) = 0 (best action L, reward 10 but still land in pit with prob. 0.1, exactly cancelling the flag reward) Now we can use the recursive equation to compute V2(s). V2(P,x,y) = -300 (all actions equally bad!) V2(F1,1,0) = 1 (get flag 1 and cannot get any other reward in 1 step) V2(F1,1,1) = 7.4 (get flag 1, best action R to get to (I,0,1) with prob 0.8) V2(F1,0,1) = 6.4 (as above, best action R, but no flag 1) V2(F1,0,0) = 0 (no best action) V2(I,0,0) = -1 (Optimal action L, but can land in (F2,0,0) with prob. 0.1) V2(I,0,1) = 0.8 (Optimal action L, can land in (I,0,1) with prob. 0.1) V2(I,1,0) = -0.22 (optimal action L, get flag 1 reward next time, with probability 0.8, but could hit (F2,1,0),prob 0.1) V2(I,1,1) = 1.61 (optimal action L, get flag 1 reward next time, with probability 0.8, or to (I,1,1) prob. 0.1) V2(F2,0,0) = -21 (best action L, still land in pit with prob. 0.1, and in (F2,0,0) with prob. 0.1) V2(F2,0,1) = -11 (best action L, reward 10 but still land in pit with prob. 0.1, and in (F2,0,0) with prob. 0.1) V2(F2,1,0) = -20.36 (best action L, still land in pit with prob. 0.1, and in (F2,1,0) with prob. 0.1, but can reach (I,1,0) V2(F2,1,1) = -10.36 (best action L, reward 10 but still land in pit with prob. 0.1, (F2,1,0) with prob. 0.1, reach (I,1,0) with probability 0.8) Now we can compute V3(s) using V2(s), in a similar manner. Observe that with two transitions remaining, the best action was always L, due to the possible "catastrophe" of the pit, except for location F1 where the pit is unreachable in 2 steps! With THREE transitions remaining, the optimal policy will be L from ANY location, independent of the flag states! (Well, except for location P where the agent is equally dead no matter what...) Note that the above is in fact simple expectimax with memoization!