Sequential decision making under uncertainty using belief-state MDP for decision-making: the EU refugee problem. (This is an obscure variant of the Canadian treveler problem.)
The domain description is silmilar to that described in assignment 1, except that again we do not know the locations of the border police, and food, similar to assignment 4. For simplicity, however, we will assume that the police and food appear independently, with a known given probability, at each vertex. They are revealed with certainty when a refugee reaches a neigbouring vertex.
Thus, in the current problem to be solved, we are given a weighted undirected graph, where each vertex has a known probability of having police and food. The probabilities of police are mutually independent, and they do not move. If there is food at a node, there is a sufficient amount that it is never exhausted. The refugee's only action are traveling between vertices. Traversal costs are just the weight of the edge in this variant, except when starting the action in a vertex with food, (giving the refugee additional energy), in which case the travel cost of the edge is divided by 2. Also for simplicity, we will assume only one agent, starting at s, and only one goal at vertex t, The problem is to find a policy that brings the agent from s to t, that has minimal expected cost, without encountering police.
The graph can be provided in a manner similar to previous assignments, for example:
#V 4 ; number of vertices n in graph (from 1 to n) #V 1 0 0 ; Vertex 1, 0 probability of police, 0 probability of food. #V 3 0.3 0.1 ; Vertex 3, 0.3 probability of police, 0.1 probability of food. etc. #E 1 2 W3 ; Edge from vertex 1 to vertex 2, weight 3 #E 2 3 W2 ; Edge from vertex 2 to vertex 3, weight 2 #E 3 4 W3 ; Edge from vertex 3 to vertex 4, weight 3 #E 2 4 W4 ; Edge from vertex 2 to vertex 4, weight 4 #E 1 4 W100 ; Edge from vertex 1 to vertex 4, weight 100 #Start 1 #Goal 4
The start and goal vertices, should be determined via some form of user input (as in the file, or querying the user). For example, in the above graph the start vertex is 1, and the goal vertex is 4.
The Canadian traveller problem is known to be PSPACE-complete (and it is likely that the refugee refugee traveler problem is also PSPACE complete) so you will be required to solve only very small instances. We will require that the entire belief space be stored in memory explicitly, and thus impose a limit of at most 10 vertices with possible police and/or food in the graph. Your program should initialize belief space value functions and use a form of value iteration (discussed in class) to compute the value function for the belief states. Maintain the optimal action during the value iteration for each belief state, so that you have the optimal policy at convergence.
Your program should read the data, including the parameters (start goal vertices). You should construct the policy, and present it in some way. Provide at least the following types of output:
Official deadline: January 21, 2016