Introduction to Artificial Inteligence

Assignment 6

Written assignment: inference and learning

Justify all answers!

1) Write down the axioms and facts for example below in FOPC:

Predicates: 
   OS(x)           ; x is an operating system
   Windows(x)      ; x is a windows type object
   Linux(x)        ; x is a Linux type object
   XP(x)           ; x is windows XP
   NT(x)	   ; x is windows NT
   Security(x, y)  ; object x has security y
   OnmyPC(x)       ; x is on my computer
   Hates(x, y)     ;  x hates y
   Likes(x)        ; x likes y
   UserFriendly(x) ; x is user friendly
Constants:
   Good, Lousy, I

 The taxonomic hierarchy for operating systems contains WINDOWS and
 LINUX type systems, which are disjoint and exhaustive.

   1. forall x  Linux(x) -> OS(x)
   2. forall x  Windows(x) -> OS(x)
   3. forall x  OS(x) -> Windows(x) v Linux(x)
   4. forall x  not Linux(x) v not Windows(x)

 WINDOWS type systems contain Windows XP and Windows NT systems.
   6. forall x  NT(x) -> Windows(x)
   7. forall x  XP(x) -> Windows(x)

 All windows operating systems are both user-friendly and have
   lousy security.
   8. forall x  Windows(x) -> (Security(x, Lousy) ^ Userfriendly(x))

 The system on my computer is LINUX.
  ("all operating systems on my computer are linux")
   9. forall x (OnmyPC(x) ^ OS(x)) -> Linux(x)
   

 I hate all systems with lousy security.
   10. forall x (OS(x) ^ Security(x, lousy)) -> hates(I, x)

 LINUX has good security.
   11. forall x Linux(x) -> Security(x, Good)

 I always either hate an operating system or like it, but never both.
   12. forall x OS(x) => 
      ((Hates(I, x) ^ not Likes(I, x)) v (not Hates(I, x) ^ Likes(I, x))) 

 Security of an OS cannot be both good and lousy.
   13. forall x  OS(x) => not (Security(x, Good) ^ Security(x, Lousy))

   a) Convert the knowledge base into CNF.
   In answers, lower-case letters are universally quantified variables.
   Assume "automatic standartization apart" between clauses.

   1. not Linux(x) v OS(x)
   2. not Windows(x) v OS(x)
   3. not  OS(x) v Windows(x) v Linux(x)
   4. not Linux(x) v not Windows(x)

   6. not NT(x) v Windows(x)
   7. not XP(x) v Windows(x)

   8.  not Windows(x) v Security(x, Lousy)
   8a. not Windows(x) v Userfriendly(x))

   9. not OnmyPC(x) v not OS(x)) v Linux(x)
   
   10. not OS(x) v not Security(x, lousy) v Hates(I, x)

   11. not Linux(x) v Security(x, Good)

   12.  not OS(x) v Hates(I, x) v Likes(I, x)
   12a. not OS(x) v  not Hates(I, x) v not Likes(I, x)

   13. not OS(x) v not Security(x, Good) v not Security(x, Lousy)

   b) Show how this knowledge-base should be indexed - using
      a table.

We will denote the index using clause number dot position in clause.

Name        Positive         Negative        Arg 1       Arg 2
--------------------------------------------------------------
OS          1.2              3.1
            2.2              9.2
                             10.1
                             12.1
                             12a.1
                             13.1
--------------------------------------------------------------
Linux       3.3              1.1
            9.3              4.1
                             11.1
--------------------------------------------------------------
Windows     3.2              2.1
            6.2              4.2
            7.2              8.1
                             8a.1
--------------------------------------------------------------
NT                           6.1
--------------------------------------------------------------
XP                           7.1
--------------------------------------------------------------
Security    8.2              10.2
            11.1             13.2
                             13.3
--------------------------------------------------------------
Userfriendly  8a.1
--------------------------------------------------------------
OnmyPC                       9.1
--------------------------------------------------------------
Hates       10.3             12a.2
            12.2 
--------------------------------------------------------------
Likes       12.3             12a.3
--------------------------------------------------------------
I                                           12.2
                                            12a.2
                                            12.3
                                            12a.3
--------------------------------------------------------------
Good                                                     11.2
                                                         13.2
--------------------------------------------------------------
Lousy                                                    8.2
                                                         10.2
                                                         13.3  
--------------------------------------------------------------


   c) Prove or disprove by using resolution  (or argue that neither 
      can be done):
      1) I hate all WINDOWS systems.
         Need to show:
         Q1: forall x  Windows(x) -> Hates(I, x)
         (in CNF, this is: not Windows(x) v Hates(I, x))
         Its negation is: Exists x Windows(x) ^ not Hates(I, x)
         For CNF, we need a skolem constant, say W123, and now:
         Q1': Windows(W123)
         Q1'a: not Hates(I, W123)

         From 8 and 10, we get:
         A: not Windows(x) v not OS(x) v Hates(I, x)

         From A and 2 we get:
         B: not Windows(x) v Hates(I, x)  
         which is what we need to prove. Using a refutation proof, we need
         to resolve A with Q1' (unifier {x/A123}) and then resolve the result
         with Q1a' to get an empty clause.

      2) If the system on my machine is WINDOWS NT, then I like it.
         Need to prove:
         Q2: forall x (OnmyPC(x) ^ NT(x)) -> Likes(I, x)
         Negation gives Q2': exists x  OnmyPC(x) ^ NT(x) ^ not Likes(I, x)
         In CNF we get, after skolemizing (skolem constant N8):
          Q2a': OnmyPC(N8)
          Q2b': NT(N8)
          Q3c': not Likes(I, x)        (actually, this last one is redundant)

         From 6 and Q2b', unifier {x/N8} we get:
         B: windows(N8)

         From B and 2, unifier {x/N8} we get:
         C: OS(N8)

         From 9, using C and then Q2a' on the result, we get:
         D: Linux(N8)

         Then from 4, using D and then B on the result we get the empty clause.

         (Intuitively, this is odd - the only reason I like the all NT on
          my machine is because there is NO NT on my machine - the false
          antecedent makes the senetence true!)

      3) I do not hate the operating system on my computer.
          This is a bit ambiguous. Does it mean: I do not hate ANY of
          the operating systems on my computer, or: there is ONE OS on
          my computer, and I don't hate it? We will assume the former:
           Q3: forall x  (OS(x) ^ OnmyPC(x)) -> not Hates(I, x)
          Negating, we get:
           Q3': exists x  OS(x) ^ OnmyPC(x) ^ Hates(I, x)
          In CNF, with skolem constant O2, we get:
           Q3':  OS(O2)
           Q3a': OnmyPC(O2)
           Q3b': Hates(I, O2)

          Using 9 with Q3' (unifier {X/O2}), and then the result with Q3a' we
          get:
           G: Linux(O2)

          We can continue to get, from G and 11, that:
           H: Security(O2, Good)
          but then from  H and 10 and Q3' we can only get:
             Hates(I, O2)
          and then we are stuck! So we cannot either prove or disprove.

          

      4) If I like all systems with good security, then I like the operating
         system on my computer.
         (Again,  a bit ambiguous, so we will be making the same assumption
          as above).
           Q4: (forall y (OS(x) ^ Security(x, Good)) -> Likes(I, x)) ->
                   (forall x  (OS(x) ^ OnmyPC(x)) -> Likes(I, x))

           Negated, this becomes: 
	    (forall y (OS(x) ^ Security(x, Good)) -> Likes(I, x)) ^
            (exists x  OS(x) ^ OnmyPC(x) ^ Hates(I, x))
           The last part is just like in the previous case, and can be treated
           similar to last time, (in CNF, with skolem constant O3, we get):
           (note that y is not really within the scope of the universal
            quantifier)
            Q4':  OS(O3)
            Q4a': OnmyPC(O3)
            Q4b': not Likes(I, O3)
           to which we add a clause for the first part:
            Q4c': not OS(x) v not Security(x, Good)) v Likes(I, x))

           We can show:
            K: Security(O3, Good)
           In the same we we did above for O2. This time, however, we
           are NOT stuck. We Use Q4c' with K, and then the result with
           O4', and then the result with Q4b' to get the empty clause.

      5) If a system is user-friendly then it has lousy security.

        This cannot be either proved or disproved. 

      SHOW ALL STEPS, including the relevant unifiers, etc.

   d) Would using INPUT RESOLUTION only change the answers to c?

      Item 1 was proved where all resolutions had at least one clause from
         the KB or negated query, so there is NO CHANGE for INPUT RESOLUTION.

      Item 2 contained a step "using B on the result", that was NOT input
      resolution. It seems that this step is unavoidable, since both
       "Linux(N8)" and "Windows(N8)" are NOT in the input, and there seems
      to be no alternate proof.

      Items 3-4 also used only input resolution, so no change.

      In 5, resolution is complete - but we could not prove or disprove, so
      obviously we can do no better with just input resolution.

   e) Convert the KB into implication normal form.

   1. Linux(x) -> OS(x)
   2. Windows(x) -> OS(x)
   3. OS(x) -> Windows(x) v Linux(x)
   4. Linux(x) ^ Windows(x) -> FALSE

   6. NT(x) -> Windows(x)
   7. XP(x) -> Windows(x)

   8.  Windows(x) -> Security(x, Lousy)
   8a. Windows(x) -> Userfriendly(x))

   9. (OnmyPC(x) ^ OS(x)) -> Linux(x)
   
   10. (OS(x) ^ Security(x, lousy)) -> Hates(I, x)

   11. Linux(x) -> Security(x, Good)

   12.  OS(x) -> Hates(I, x) v Likes(I, x)
   12a. OS(x) ^ Hates(I, x) ^ Likes(I, x) -> FALSE

   13.  OS(x) ^ Security(x, Good) ^ Security(x, Lousy) -> FALSE

   f) Can you prove all the above using only either forward or
      backward chaining?

      Cannot do any of those, because axioms 3, 12 are not Horn, and
      also the queries contain disjunctions.


2) Consider the following variables and their immediate causes (parents):

     variable         parents
-------------------------------
        A               none
        E               none
        B               A
        C               E
        D               B C
        F               B C

    a) Is this network a poly-tree?
        No, because there is a cycle in the underlying undirected graph:
            B-D-C-F-B

    b) Is this network directed-path singly connected?
         Yes, only 1 directed path, at most, between any pair of vertices.

    c) Determine the truth of the following independence
       statements (using d-separation):
       1)  I({A}, {E} | {})
            Yes. Al paths go through either D (or F), s.t. D (resp. F)
          is a converging node on the path, with no evidence at D (resp.
          F), thus the path is blocked there.

       2)  I({A}, {E} | {D})
           No. Now the path A-B-D-C-E is unblocked, because D is a converging
           node that is part of the evidence.

       3)  I({A}, {E} | {D, B})
           Yes, because the path through D is now blocked by B, a pass-through
           node on this path that is an the evidence set.

       4)  I({A}, {E} | {D, B, F})
           Yes, because B also blocks the path A-B-F-C-E, for the
           same reason as above.

       5)  I({A}, {D, F} | {B})
           Yes, B is a blocking node on the path from A to either D or F,
           same as above.

       6)  I({A}, {F} | {B, C})
           Yes, again same reason as above.

    d) Determine at least 2 different ways to convert the network into
       a poly-tree using cutset conditioning.
         Can break the only cycle by conditioning on EITHER B or C.

    e) Suppose that P(A) = 0.8, P(E) = 0.2, P(B|A) = 0.1, P(B|~A)  = 0.8.
       Find P(A|E).
     
        Very easy, since A is independent of E given no evidence, so
          P(A|E) = P(A) = 0.8
        (the rest data are just red herrings).

3) You need to construct a 4-input neural network that has value 1 in the 
   output just when an even number of its inputs are 1, and 0 otherwise
   (assume inputs are in {0, 1}).
   a) Can this be done without hidden units?
      No, this function is not linearly separable.

   b) Show a network using threshold elements that performs
      the required computation (specify the weights, too!)

      We can do this by having internal units that essentially COUNT
      the number of inputs which are 1. The simplest (but NOT most compact)
      method is to have 4 hidden units, call them A-D, all of which
      are connected to all the 4 inputs. The output unit is connented to
      all the hidden units, and also has a "constant" input, i.e. an input
      which is always 1. 
      The units A-D will represent: "at least 1 ones", "at least 2 ones",
      "at least 3 ones", and "4 ones" respectively. This is easily done:
       Assume that the threshold of all units is 1. 
       The input weights for the hidden units are as follows:
       Unit A: all weights are 2
       Unit B: all weights are 0.7
       Unit C: all weights are 0.4
       Unit D: all weights are 0.3

       The weights for the output unit O are as follows:
       constant input -> O: 2
       A->O: -2
       B->O: 2
       C->O: -2
       D->O: 2

       It is easy to see that the above network computes the required function.

4) a) Using the perceptron learning rule (the "delta" rule), show the steps
      of learning 3-input function AND with a single perceptron
      and a step activation function with threshold 1. Initially all
      weights are 0.7, and the learning rate is 0.5.

      Suppose that the order of presentation is the same as counting in
      binary. Then, in first pass:

        000: no error, no change
        001: no error, no change
        010: no error, no change
        011: should be 0, gives 1. Fix 2nd and 3rd weights to 0.2
        100: no error, no change
        101: no error, no change
        110: no error, no change
        111: no error, no change

      A second pass through all examples gives no errors, and in fact just one
      pass resulted in a correct output.


   b) Show (by example) that the order of presenting the examples can
      change the number of training episodes.

      In the above example it does not matter. However, suppose that
      the original weights were: 1.1, 0.2, 0.3.
      Using the example 100 first, we get:
        100: should be 0, we get 1, fix first weight to 0.6
      And only one pass is needed, because the result is correct for all
      examples.

      However, if the order were:
        000: no error
        001: no error
        010: no error
        011: no error
        111: no error
        110: should be 0, we get 1, fix first weight to 0.6 and 2nd to -0.3
        101: no error
        100: no error
      Now the result is incorrect, because 111 will now give us 0 instead
      of 1, and at least 1 more pass necessary to fix it!


   c) Is it possible to combine back-propagation with genetic algorithms, and
      if so, how?

      Yes. For example, perform backpropagation on any member of the population
      in a GA. This is called "local improvement". As a result, every member
      of the population will be at its own local optimum, and convergence
      of te GA may be faster.


5) a) Construct by hand (using the greedy search using the best information
      gain heuristic) a decision tree for the following table, a set of
      examples for considering desirability of a potential employee in
      a high-tech company. Data is degree level (A),
      existence of work experience (B),
      self-motivation (C), and  last grade average (D).

     input variables               decision
   A      B      C      D
--------------------------------------------
   1      T      F      95             good
   1      T      F      80             good
   1      T      F      70             good
   1      T      T      70             good
   1      F      F      90             good
   2      F      T      80             bad
   2      F      T      95             bad
   2      T      F      70             average
   2      T      F      80             average

Amount of information needed to classify riginal data is accouding to
the likelihood ratio (5, 2, 2), i.e. probabilities (5/9 2/9, 2/9). Total
information required is:
  I = 9(- 5/9 log (5/9) - 2/9 log(2/9) - 2/9 log(2/9)) =approx 13 bits

Now, we try question A. If answer is 1, remaining needed information is 0.
      If answer is 2, (w. 4 examples), they are divided 2-2, so 1 bit per
      example is needed. Total remainder is 4 bits. (Thus answer to A
      gives approx. 9 bits of information for  all the test set.

Trying question B. If the answer is T, we have 6 examples, divided 4-2.
     Needed remaining info in this branch is:
     6(-2/3 log 2/3 - 1/3 log 1/3) = approx 5.5
     which is already greater than 4, so B cannot be better than A.

For Question C: If the answer if D again we get 6 examples, divided 4-2,
    (remaining information needed in this branche approx. 5.5)
    so C is also worse than A.

For Question D: Answer 70: 3 examples divided 1-2, need approx 2.75 bits,
                Answer 80: 3 examples divided 1-1-1, need approx 4.5 bits,
    So already more than 4 bits altogether.

So, clearly A is best from greedy perspective. Now we need to look at
2 subtrees separately: A=1 and A=2. 

For the case A=2, clearly using
either question B or question C will allow exact classification between
"bad" and "average", and question D will not, so we can pick either
arbitrarily.

For the case A=1, we already have perfect classification, so no more
need be done. We thus are done, we have a tree with only 2 internal nodes.

   b) Is a more compact decision tree possible? If yes, which tree?

No, because there is no 1-internal-node tree that results in complete
classification. But in general, the greedy algorithm is NOT guaranteed
to find the optimal tree.

6) A) Use situation calculus (i.e. the result function, and fluents) to
   write down the axioms for the following simplified domain.
   The robot can grab an item, and as a result it will hold it.
   The robot can put item A inside item B, but only if it does not
   hold an item. In the initial situation, S0, the robot holds only item A.

   Actions are: 
   a) Insert: will cause item A to be inside item B, but only if the robot
      holds no items. 
   b) Grab(item1): if the robot is not holding any other item, the
      result will be that the robot will hold item1, and any other item
      inside item1.
   c) Release: as a result of this action, the robot will hold nothing

    Fluents (situation-dependent predicates) are: 
        Inside(item1, item2, s)         (Item1 is inside item2)
        Holds(item, s)                  (The robot holds item)
        Success(s)                      (The robot holds both A and B)


     Axioms for actions:

     a) forall x, y, s    (not exists z  Holds(z, s)) ->
          Inside(x, y, Result(Insert, s))

     (Note "bug" that if x=y, this may result in an object being inside itself.
      But then we didn't say it should not be so in the specifications...)

     b) forall x, s      (not exists y Holds(y, s)) ->
          Holds(x, Result(Grab, s))

     b1)   forall x, y, s   (Holds(x, s) ^ Inside(y, x, s)) -> Holds(y, s)

      (Note ambiguity in spec - will it hold things that were
       inside in the pre-action situation, or things that are inside in the
       post-action situation? The above is the 2nd option. A fine distinction,
       but potentially significant!)

    c) forall x, s      not Holds(x, Result(Release, s))

    d) forall s    (holds(A, s) ^ Holds(B, s)) -> Success(s)

    This looks like it should do, but it is NOT, as we can see when trying
    to prove what we need.

    B) Now, you need to prove that, for the initial situation S0,
       that the action sequence:
            Release, Insert, Grab(B)
       results in success, that is, prove that:
         Success(Result(Grab(B), Result(Insert, Result(Release, S0)))) 

      First, from c and b we get:
       R1:    Inside(A, B, Result(Insert, Result(Release, S0)))

      Now, we want to use a to get that:
       Q:  Holds(B,   Result(Grab(B), Result(Insert, Result(Release, S0))))
      but we CANNOT, because we cannot prove that:
       Pre:  not Holds(x, Result(Insert, Result(Release, S0)))
      Although the above is intuitive, it does not follow from our axioms.
      We need frame axioms for things that do NOT change. E.g. Insert
      should not cause things to be held! The following is incomplete,
      but will do for our proof:

      F:  forall x, s   (not Holds(x, s)) -> (not Holds(x, Result(Insert, s)))

      Now we can use F and c to  prove Pre, and can then use Pre, R1 and a
      to get Q. But what about object A? We also need to prove that:
       Q1: Holds(A,   Result(Grab(B), Result(Insert, Result(Release, S0))))
      which looks OK, except that we cannot prove that:
       Pre1: Inside(A, B, Result(Grab(B), Result(Insert, Result(Release, S0))))
      
      We need another frame axiom for "Inside", e.g. (also intuitive, but
      needs to be stated!):
       forall x, y, z, s   Inside(x, y, s) -> Inside(x, y, Result(Grab(z), s))
      and THEN we can prove Pre1, and then Q1!

     Now from Q and Q1 and axiom d for "success" we finally get what we need.