Justify all answers! 1) We have the following distribution over the binary variables A, B, C, given as a full table: A B C probability ------------------------------- F F F 0.81 F F T 0 F T F 0 F T T 0.09 T F F 0 T F T 0.09 T T F 0.01 T T T 0 a) Construct the Bayes network that most compactly describes the distribution (i.e. one that has the smallest number of arcs). b) Is the answer unique?First, we need to check dependencies. We have (by summing up numbers in respective rows): P(A) = 0.1, P(~A) = 0.9 P(B) = 0.1, P(~B) = 0.9 P(C) = 0.18, P(~C) = 0.82 P(AB) = 0.01 P(~AB) = 0.09 P(A~B) = 0.09 P(~A~B) = 0.81 Thus we know that I(A, B| {}) and thus A and B are not immediate neighbors in the BN. Now, we also have: P(AC) = 0.09 which is NOT equal to P(A)P(C) = 0.1*0.18 = 0.018 so there must be a path from A to C. Also: P(BC) = 0.09 which is NOT equal to P(B)P(C) = 0.1*0.18 = 0.018 so there must be a path from B to C. Thus the graph is connected, and since there is no arc beteen A and B the only way to do that is that the graph must be of shape A-C-B, and it remains to be seen if we can determine the directions of the arcs. In fact, since A and B are independent given no evidece, it must be the case that C (which is on the path from A to B) is a converging node on the path A-C-B. Thus the arcs are both directed towards C, and this solution is unique. 2) Consider the following variables and their immediate causes (parents): variable parents ------------------------------- A none B A C B E D A E B D F D a) Is this network a poly-tree? b) Determine the truth of the following independence statements (using d-separation): 1) I({B}, {D} | {}) False, path B-A-D is not blocked (A is diverging non-evidence) 2) I({B}, {D} | {A}) True: path B-A-D is blocked since A is diverging evidence path B-C-E-D is blocked since C is converging non-evidence path B-E-D is blocked since E is converging non-evidence 3) I({B}, {D} | {A, C}) False: path B-C-E-D is not blocked by C (converging evidence node) not blocked by E (pass-through non-evidence node) 4) I({B}, {F} | {A}) True: path B-A-D-F is blocked by A (diverging evidence node) path B-E-D-F is blocked by E (converging non-evidence) path B-C-E-D-F is blocked by C (converging non-evidence) Observe that this last path is NOT blocked by E, which is pass-through non-evidence! Blocking is thus path-dependent! 5) I({B}, {F} | {A, C}) False: path B-E-D-F not blocked by D (diverging non-evidence) not blocked by E (converging, evidence at child C) c) Assume that the nodes are binary-valued. Suppose that P(A=true) = 0.4, P(D=true|A=true) = 0.5, P(D=true|A=false) = 0.1, P(E=true|D=true) = 0.5, P(E=true|D=false) = 0.2 Find P(A=true | D=true, F=true) Hint: you may want to preprocess to get rid of nodes before doing any computation. We begin by removing non-evidence child-less node C - barren node. Now remove E (is now non-evidence child-less node - barren node). Now remove B (is now non-evidence child-less node - barren node). In the remaining graph, node that A is d-separated from F given D, that is: I(A,F|{D}) and thus we have P(A|DF) = P(A|D). To compute the latter we just apply Bayes rule: P(A=t|D=t) = = P(D=t|A=t)P(A=t)/(P(D=t|A=t)P(A=t)+P(D=t|A=f)P(A=f)) = 0.5*0.4 / (0.5*0.4 + 0.1*0.6) = 0.2 / 0.26 = 10/13 3) You need to construct a 5-input neural network that has value 1 in the output just when the number of its inputs that are on is either 0 or 3. (assume inputs are in {0, 1}). a) Can this be done without hidden units? No. the answer should be 1 for input vector 00000 and 0 for input vectors 10000 where the 1 can be in any position. A hyperplane which separates 00000 from the 10000 vectors cannot intersect any other edges of the 2^5 hypercube. But the point 10000 must also be separated from 11100 (the latter should have outpput 1), requiring at LEAST one more hyperplane. In fact, it is easy to see that we need THREE hyperplanes to separate all the points that need to be separated. b) Show a network using threshold elements that performs the required computation (specify the weights, too!) There are numerous possibilities. The simplest is to have hidden units as featuer detectors such as: "1 or more", and "3 or more" (we will call it "3+"), and "more than 3" (we will call it "3++") etc. Then we have an output unit that is 1 whenever: A) None of the detectors is on B) The "3+" detector is on but the "3++" is not. So we have 3 units (ignoring "input units"). The "3+" detector has as inputs all 5 circuit inputs, each with a weight of 1 and with a threshold of 2.5. Thus its output will be 1 just when at least 3 of its inputs are 1. The "3++" detector also has as inputs all 5 circuit inputs, each with a weight of 1 and with a threshold of 3.5. Thus its output will be 1 just when at least 4 of its inputs are 1. The output unit has as inputs all 5 circuit inputs, each with a weight of -1. Also the output of the "3+" detector, with a weight of 5, and the output of the "3++" detector with a weight of -10. Its threshold is -0.5. Thus, when no circuit input is 1, the wighted sum is 0 (above the -0.5 threshold) and the output is thus 1. Once one or 2 of the input units is 1, but without turning on any of the detectors, the weighted sum is -1 or -2, and the output of the network is 0. When 3 of the inputs are 1, the "3+" detector turns on (assuming the "3+" detector is off), and the wighted sum is again non-negative, and the network output is again 1. Finally, when more than 3 inputs are 1 the "3++" detector turns on, and its large negative weight overrides all the other inputs and again makes the network output zero, as required. 4) a) Using the perceptron learning rule (the "delta" rule), show the steps of learning 2-input NAND function with a single perceptron and a step activation function with initial threshold 1 (note that the threshold can also be learned as a weight). Initially all other weights are minus 0.9, and the learning rate is 0.5. Note: in this case use the constant 1 instead of the derivative of the g() function. To make the initial threshold 1, we have a 3rd input connected constantly to a "1" signal, with a weight of -1 (we will call it W0). We also make the assumption that if the weighted sum is EXACTLY equal to 0, the preceptron output is 0 (we need to make SOME assumption, as in a step function the value at 0 can be otherwise undefined). Starting with weights: W0 = -1, W1 = -0.9, W2 = -0.9, 00: weighted sum is -1, result 0, should be 1, error 1. Change by adding 0.5 to the "threshold" input to get W0 = -0.5. 01: weighted sum is -1.3, result 0, should be 1, error 1. change by adding 0.5 to W0, W1, to get W0=0, W1=-0.4 10: weighted sum is -0.9, result 0, should be 1, error 1. change by adding 0.5 to W0, W2, to get W0=0.5, W1=-0.4 11: weighted sum is -0.3, result 0, correct. W0 = 0.5, W1 = -0.4, W2 = -0.4 Now start epoch 2: 00: weighted sum is 0.5, result 1, correct. 01: weighted sum is 0.1, result 1, correct. 10: weighted sum is 0.1, result 1, correct. 11: weighted sum is -0.3, result 0, correct. And training is done. b) Show (by example) that the order of presenting the examples can change the number of training episodes. Starting with weights: W0 = -1, W1 = 0.9, W2 = 0.9, Suppose that the order of example presentation is 00, 01, 10, 11. We get: 00: weighted sum is -1, result 0, should be 1, error 1. Change by adding 0.5 to the "threshold" input to get W0 = -0.5. 01: weighted sum is 0.4, result 1, correct. 10: weighted sum is 0.4, result 1, correct. 11: weighted sum is 1.3, result 1, should be 0 error -1. Change by subtracting 0.5 from all weights, to get: W0 = -1, W1 = 0.4, W2 = 0.4. (Note how this last "fix" spoils the improvement of W0 achieved in the first correction). This ends the first epoch. Now a 2nd epoch: 00: weighted sum is -1, result 0, should be 1, error 1. Change by adding 0.5 to the "threshold" input to get W0 = -0.5. 01: weighted sum is -0.1, result 0, should be 1, error 1. change by adding 0.5 to W0, W1, to get W0=0, W1=0.9 10: weighted sum is 0.4, result 1, correct. 11: weighted sum is 1.3, result 1, should be 0 error -1. Change by subtracting 0.5 from all weights, to get: W0 = -0.5, W1 = 0.4, W2 = -0.1. Now start epoch 3: 00: weighted sum is -0.2, result 0, should be 1, error 1. Change by adding 0.5 to the "threshold" input to get W0 = 0. 01: weighted sum is 0.4, result 1, correct. 10: weighted sum is -0.1, result 0, should be 1, error 1. Change by adding 0.5 to W0, W2 to get: W0 = 0.5, W2=0.4 11: weighted sum is 0.8, result 1, should be 0 error -1. Change by subtracting 0.5 from all weights, to get: W0 = 0, W1 = -0.1, W2 = -0.1. Now start epoch 4: 00: weighted sum is 0, result 0, should be 1, error 1. Change by adding 0.5 to the "threshold" input to get W0 = 0.5. 01: weighted sum is 0.4, result 1, correct. 10: weighted sum is 0.4, result 1, correct. 11: weighted sum is 0.3, result 1, should be 0 error -1. Change by subtracting 0.5 from all weights, to get: W0 = 0, W1 = -0.6, W2 = -0.6. Now start epoch 5: 00: weighted sum is 0, result 0, should be 1, error 1. Change by adding 0.5 to the "threshold" input to get W0 = 0.5. 01: weighted sum is -0.1, result 0, should be 1 error 1. Change by adding 0.5 to W0, W1 to get: W0 = 1, W1 = -0.1. 10: weighted sum is 0.4, result 1, correct. 11: weighted sum is 0.3, result 1, should be 0 error -1. Change by subtracting 0.5 from all weights, to get: W0 = 0.5, W1 = -0.6, W2 = -1.1. Now start epoch 6: 00: weighted sum is 0.5, result 1, correct. 01: weighted sum is -0.1, result 0, should be 1 error 1. Change by adding 0.5 to W0, W1 to get: W0 = 1, W1 = -0.1. 10: weighted sum is -0.7, result 0, should be 1, error 1. Change by adding 0.5 to W0, W2 to get: W0 = 1.5, W2 = -0.6. 11: weighted sum is 0.8, result 1, should be 0 error -1. Change by subtracting 0.5 from all weights, to get: W0 = 1, W1 = -0.6, W2 = -1.1. Now start epoch 7: 00: weighted sum is 1, result 1, correct. 01: weighted sum is 0.4, result 1, correct. 10: weighted sum is -0.2, result 0, should be 1, error 1. Change by adding 0.5 to W0, W2 to get: W0 = 1.5, W2 = -0.6. 11: weighted sum is 0.3, result 1, should be 0 error -1. Change by subtracting 0.5 from all weights, to get: W0 = 1, W1 = -1.1, W2 = -1.1. Now start epoch 8: 00: weighted sum is 1, result 1, correct. 01: weighted sum is -0.2, result 0, should be 1 error 1. Change by adding 0.5 to W0, W1 to get: W0 = 1.5, W1 = -0.6. 10: weighted sum is 0.3, result 1, correct. 11: weighted sum is -0.3, result 0, correct. W0 = 1.5, W1 = -0.6, W2 = -1.1. Now start epoch 9: 00: weighted sum is 1.5, result 1, correct. 01: weighted sum is 0.9, result 1, correct. 10: weighted sum is 0.4, result 1, correct. 11: weighted sum is -0.2, result 0, correct. Finally we are done... Note how many times fixes to one weight spoiled the fixes done previously to another weight!! Starting with weights: W0 = -1, W1 = 0.9, W2 = 0.9, suppose the order of the examples at epoch 1 were: 01, 10, 11, 00, to get: 11: weighted sum is 0.8, result 1, should be 0, error -1. Change by adding -0.5 to all to get: W0=-1.5, W1=0.4, W2=0.4 01: weighted sum is -1, result 0, should be 1, error 1. Change by adding 0.5 to W0, W1, to get: W0 = -1, W1 = 0.9 10: weighted sum is -0.5, result 0, should be 1, error 1. Change by adding 0.5 to W0, W2, to get: W0 = 0, W2 = 0.9 00: weighted sum 0, result 0, should be 1, error 1. Change by adding 0.5 to W0 to get: W0 = 0.5, W1 = 0.9, W2 = 0.9 Now, in epoch 2: 11: weighted sum is 2.3, result 1, should be 0, error -1. Change by adding -0.5 to all to get: W0=0, W1=0.4, W2=0.4 01: weighted sum is 0.5, result 1, correct. 10: weighted sum is 0.5, result 1, correct. 00: weighted sum 0, result 0, should be 1, error 1. Change by adding 0.5 to W0 to get: W0 = 0.5, W1 = 0.4, W2 = 0.4 Now, in epoch 3, suppose order is: 11, 00, 01, 10 11: weighted sum is 1.3, result 1, should be 0, error -1. Change by adding -0.5 to all to get: W0=0, W1=-0.1, W2=-0.1 00: weighted sum 0, result 0, should be 1, error 1. Change by adding 0.5 to W0 to get W0 = 0.5 01: weighted sum is 0.4, result 1, correct. 10: weighted sum 0.4, result 1, correct W0 = 0.5, W1 = -0.1, W2 = -0.1 Now, in epoch 4, suppose order is: 11, 00, 01, 10 11: weighted sum is 0.3, result 1, should be 0, error -1. Change by adding -0.5 to all to get: W0=0, W1=-0.6, W2=-0.6 00: weighted sum 0, result 0, should be 1, error 1. Change by adding 0.5 to W0 to get W0 = 0.5 01: weighted sum is -0.1, result 0, should be 1, error 1. Change by adding 0.5 to W0, W1, to get: W0 = 1, W1 = -0.1 10: weighted sum is 0.4, correct. W0 = 1, W1 = -0.1, W2 = -0.6 Now, in epoch 5, suppose order is 11, 10, 01, 00 11: weighted sum is 0.3, result 1, should be 0, error -1. Change by adding -0.5 to all to get: W0=0.5, W1=-0.6, W2=-1.1 10: weighted sum is -0.6, result 0, should be 1, error 1. Change by adding 0.5 to W0, W2, to get: W0 = 1, W2 = -0.6 01: weighted sum is 0.4, correct. 00: weighted sum 1, result 1, correct W0 = 1, W1 = -0.6, W2 = -0.6 Now, in epoch 6 (in any order): 11: weighted sum is -0.2, result 0, correct. 01: weighted sum is 0.4, result 1, correct. 10: weighted sum is 0.4, result 1, correct. 00: weighted sum is 1, result 1, correct. And we are done. It took 3 fewer epochs in this case! 5) Consider the following taxation scenario, occuring in a certain middle-eastern country.* We will assume that you are a risk-neutral, i.e. that your utility function is linear in the amount of money that your assets are worth. Assume also that moral considerations are not part of your utility function. Your accountant has notified you that your taxes for FY 2006-2007 amount to 1,000,000 NIS. You have the following options: 1) Declare the correct taxes verbatim, and pay the full amount. 2) Claim that you have used much of the money to fund scientific experiments on monkeys, in which case you will only need to pay 200,000 NIS. However, a senior tax collection evaluator will then assess your forms, and some of them have militant friends in the ASPCM* which will trash your house, costing you an additional 1,500,000. There are 3 possible evaluators, equally likely to be on your case: Evaluator A has no friends in the ASPCM. Evaluator B has a militant friend in the ASPCM, but likes the colour green, and if he sees a sufficient amount of it (say $30,000 = 100,000 NIS), with probability 0.5, will forget that monkeys have anything to do with animals (and otherwise will take the money and still tell his friend). Evaluators C will always report to his militant ASPCM friends. a) Which of the above taxation options should you prefer as a rational agent? Decision diagram contains 2 random variables: Evaluator (=E) and (Default on bribe = D). It contains 2 decision variables: Pay full (=P) and pay Bribe (=B). The utility node is a child of all the variables. The random variables can be seen as independent, and there is an information arc from the E node to the B decision node. If you pay the full amount, your (certain, as well as expected) utility is U(pay full) = (-1000000) If you try the funding experiments claim, we have an uncertain outcome, and need to evaluate an expectation: E(U(claim experiments)) = (-200000) + 1/3 * 0 ; case A + 1/3 * ((-100000)+(0.5*0+0.5*(-1500000))) ; case B + 1/3 (-1500000) ; case C = (-200000) + 1/3*(-850000) + (-500000) = (-983333) Therefore we prefer to claim useage for experiments. Note that the expectation above is assuming the decision in case B was for showing the colour green, otherwise this case is replaced by (-1500000) which is worse. b) Suppose that a "friend" of the accountant working at the treasury department can give you a "hint" about which evaluator will be on your case, if you grease his palm. How much is this information worth, given that your "friend" is perfectly reliable in this respect? Decision diagram contains all the variables of the previous diagram, but has one additional decision node F (whether to obtain infromation from the "friend") and one additional random vairiable I (the information provided). The I node has as parents the E node and the F decision node. There is now an information arc from the E node to the P decision node. The utility node has all nodes as parents. Suppose you had the information already. Then your action would change only in cases B and C. That is because in case B you would lose 50000 on the average by not paying the full amount, and in case C you would lose 700000 by not paying the full amount. These are the expected gains from having the information in these cases, and we must take the expectation over these information possibilities, so: VPI(knowing evaluator) = 1/3*0 + 1/3 * 50000 + 1/3 * 700000 = 250000 Draw the decision diagram for each case, and solve it. 6) Consider the following belief-state MDP, where scoring is total value of rewards obtained in 3 steps. The world has 4 locations in linear sequence, as follows: F1 I F2 P Actions are: R attempts to move right, and L attempts to move left, and S senses location F2. Location F1 contains a flag, which has a value of 1. Location F2 contains a flag, with value either 10 or 100, with probability 0.5 each. A robot visiting a location with a flag gets a reward equal to the flag value, and then the flag disappears. P is an absorbing state, with a reward of -100. Transition probabilities are based on: motion succeeds with probability 0.8, but with probability 0.2 the robot either stays in the same location, or moves in the opposite direction (if possible), with probability 0.1 each (0.2 to stay in place if at end of sequence). Sensing, if performed, tells the robot the actual value of the flag at F2. b1) Formalize the above problem as a (belief-state) MDP, i.e. what are the states, the transition probability function, and the reward function? First, we define the underlying MDP. The state consists of location L (one of 1,2,3,4), and states of flags F1 (0 if collected, + if not) and F2 (0 if collected, + if value 10 and not collected, ++ if value 100 if value 100 and not collected). Additionally there is time... If we assume that transition into a location of a flag collects it immediately, then our reward function must depend on the entire transition (both pre- and post- action state, as well as possibly the action). Transition probabilities are decomposable - the location does not depend on the flag states, so we start off with that part. For move right actions: P(L'=i+1 | L=i, A=R) = 0.8 ; for i from 1 to 3 P(L'=i | L=i, A=R) = 0.1 ; for i from 2 to 3 P(L'=1 | L=1, A=R) = 0.2 P(L'=i-1 | L=i, A=R) = 0.1 ; for i from 2 to 3 P(L'=4 | L=4, A) = 1 ; for any action A For move left actions: P(L'=i-1 | L=i, A=L) = 0.8 ; for i from 2 to 3 P(L'=i | L=i, A=R) = 0.1 ; for i from 2 to 3 P(L'=1 | L=1, A=L) = 0.9 P(L'=i+1 | L=i, A=L) = 0.1 ; for i from 1 to 3 For sense actions: P(L'=i | L=i, A=S) = 1 ; for i from 1 to 4 Now the flag collections can be represented deterministically if depending just on the pre- and post- action locations: P(F1' = 0 | F1 = 0) = 1 P(F1' = 0 | L' = 1) = 1 P(F1' = + | F1 = +, L' > 1) = 1 P(F2' = 0 | F1 = 0) = 1 P(F2' = 0 | L' = 3) = 1 P(F2' = + | F2 = +, L' not = 3) = 1 P(F2' = ++ | F2 = ++, L' not = 3) = 1 Now, the complete MDP transition probability, P(L',F1',F2'|L,F1,F2,A) = P(L'|L,A)*P(F1'|F1,L')*P(F2'|F2,L') The reward function is zero, except for: R({L,F1,F2}, A, {L'=4,F1',F2'}) = -100 ; for all A not = S, L, F1, F2, F1', F2' R({L,F1=+,F2}, A, {L',F1'=0,F2'}) = 1 ; for all A not = S, L, L', F2, F2' R({L,F1,F2=+}, A, {L',F1',F2'=0}) = 10 ; for all A not = S, L, L', F1, F1' R({L,F1,F2=++}, A, {L',F1',F2'=0}) = 100 ; for all A not = S, L, 1', F1, F1' R({L,F1,F2}, A=S, {L',F1',F2'}) = sensing cost ; for all L, L', F1, F1', F2, F2' Note that some of the above combinations are impossible, thus any defintion of the reward for these cases is OK. Also, we did not define a sensing cost in the exercise, so we will assume it is 0, although it does waste a move... This completes the definition of the MDP. But to define the belief-state MDP one must also account for the observation history. There are only 3 cases which make a difference: no observation, observed F2=+, and observed F2=++. We can model this by adding an additional ternary variable O, with values {U, +, ++, 0} (observing F2 after collection is meaningless, as we know it is no longer there, otherwise we should have an additional possible value). Note that since the observation is perfect, we can do as in the example in class, without the extra variable, and just add the value U to the domain of the F2 variable. Here, however, we will use the additional O variable. The transitions for this variable are: P(O'=+|O=U, A=S) = 0.5 ; These are based on the probabilities P(O'=++|O=U, A=S) = 0.5 ; for the value of F2 P(O'=U|O=U, A not =S, L' not =3) =1 ; F2 value stays unknown unless sensed or picked P(O'=++|O=++, L' not = 3) = 1 ; Known value of F2 does not change P(O'=+|O=+, L' not = 3) = 1 ; P(O'=0|L' = 3) = 1 ; agent knows that F2 picked at L=3 P(O'=0|O=0) = 1 ; agent knows that a picked flag stays picked Must also change the reward function for F2: R({L,F1,O=++}, A, {L',F1',O=0}) = 100 ; for all A not = S, L, L', F1, F1' R({L,F1,O=+}, A, {L',F1',O=0}) = 10 ; for all A not = S, L, L', F1, F1' R({L,F1,O=0}, A, {L',F1',O=U}) = 55 ; for all A not = S, L, L', F1, F1' The latter is picking up F2 before it is known, must average the rewards weighted by the probabilities. Note that now in fact variable O playes the role of F2 for the agent. b2) Find the optimal policy. You may assume that the robot starts at location I in order to save some computation. First, we compute V1, the value function for having only one transition until the end of the world. We will denote in the table both the value and the action leading to that value. Impossible states we will ignore by marking them X. 0 | + (F1 values) -+---------------------+-------------------- | 0 | + | ++ | U | 0 | + | ++ | U (O values) L============================================== 1 | 0 | 0 | 0 | 0 | | | | | any | any| any| any | X | X | X | X ------------------------------------------------ 2 | 0 | 8 | 80 | 44 | 0.8 | 8.1|80.1|44.1 | any| R | R | R | L | R | R | R ------------------------------------------------ 3 | S | | | | S | | | | 0 | X | X | X | 0 | X | X | X ------------------------------------------------ 4 -100 (for any action) Observe how at location 3 we prefer the sensing action, PRECISELY because it does nothing (no chance to fall into the pit). If we are not allowed to do S, then we will need to do L, but this will have a value of -10... (still better than R, which will be -80). Note that for L=2 we need to average over the possible outcomes, and the probability that we reach F2 is only 0.8. But on the right-hand side of the table there is also some bonus from the fact that we might end up picking F1 when we try to pick up F2. Now we construct the value function and policy for V2: 0 | + (F1 values) -+---------------------+-------------------- | 0 | + | ++ | U | 0 | + | ++ | U (O values) L============================================== 1 | 0 | 6.4| 64 |35.2 | | | | | any | R | R | R | X | X | X | X ------------------------------------------------ 2 | 0 | 8 | 80 | 44 | 0.8 | 8.1|80.1|44.1 | any| R | R | R | L | R | R | R ------------------------------------------------ 3 | S | | | | S | | | | 0 | X | X | X | 0 | X | X | X ------------------------------------------------ 4 -200 (for any action) Note how in location 3 you STILL do S, even if reachable F1 is still available, because trying to do L has too high a chance of dropping the agent in the pit! Finally, the value function and policy for V3. But here we need only do this for the initial state, which is L=2, F1=+, and O=U. In fact, we could have avoided computing values for (e.g.) L=1, F1=0, O=0 even earlier, since it is unreachable from the initial state in 2 moves... In the initial state, action R results in: E(U(R)) = 0.8 * (55+0) ; success in moving right + 0.1 * (0+44) ; stay in same place + 0.1 * (1+35.2) ; moved left instead = 44 + 4.4 +3.62 = approx 52 (note that above, in parenthesis, the left numbers are immediate rewards, the right numbers are from the respective position in the table for V2). For action L: E(U(L)) = 0.8 * (1+35.2) ; success in moving left + 0.1 * (0+44) ; stay in same place + 0.1 * (55+0) ; move right instead = 28.96 + 4.4 + 5.5 = 38.14 For action S: E(U(S)) = 0.5 * 8.1 + 0.5 * 80.1 = 44.1 So the best initial action is to move R for an expected utility of 52
Deadline: Wednesday, April 9, 2008, at 12 noon.
Note: submission (as well as WORK) on this exercise is SOLO (that is, NOT in pairs or other groups of size > 1)
* Disclaimer: any seeming identification with existing countries, institutions, or people, is unintentional.