Michael Elhadad

XML Background


Objectives

References

Obtain the Xerces XML parser from Apache.

IBM XML Online Courses
XML courses that can be accessed freely:

The XML FAQ

The XML/SGML homepage: a big, serious page on XML with lots of good pointers.

The XML home page on the WWW Consortium: the official reference for all standard activity.

XMLSoftware and XMLInfo: two sites with news and info about XML products and software.

Project Cool Introduction to XML: good intro to XML.

The Microsoft page on XML.

XML Tree: a bit complex, lots of links but also lots of good data.

Web developer's virtual library page on XML.

Cafe con Leche on XML.

Notes on SOAP
SOAP (Simple Object Access Protocol) is a recent (2000) XML-based protocol which allows software components to communicate over TCP/IP connections. Messages are encoded in XML. SOAP FAQ by Don Box.

Apache SOAP SOAP in Java with messaging over HTTP or SMTP.
Local documentation is available as well as the local jar files.

SOAP v1.1 specification

What is XML

XML is a syntax specification based on the idea of markup languages. A markup language is a language where the structure of the document is explicitly encoded using tags.

XML documents conceptually contain two parts: the DTD (Document Type Description) defines the grammar of the language used in the document itself; and the content itself.

DTDs encode a context-free language and are equivalent to BNF grammars.

Because XML documents encode both the grammar and the content explicitly, parsers can rely on the self-documenting nature of the document.

The following example shows a very simple XML encoding of information about employees in a company (example taken from the Apache XERCES package):


<?xml encoding="US-ASCII"?>

<!-- Revision: 60 1.2 data/personal.dtd, docs, xml4j2, xml4j2_0_9  -->

<!ELEMENT personnel (person)+>
<!ELEMENT person (name,email*,url*,link?)>
<!ATTLIST person id ID #REQUIRED>
<!ELEMENT family (#PCDATA)>
<!ELEMENT given (#PCDATA)>
<!ELEMENT name (#PCDATA|family|given)*>
<!ELEMENT email (#PCDATA)>
<!ELEMENT url EMPTY>
<!ATTLIST url href CDATA #REQUIRED>
<!ELEMENT link EMPTY>
<!ATTLIST link
  manager IDREF #IMPLIED
  subordinates IDREFS #IMPLIED>


<?xml version="1.0"?>
<!DOCTYPE personnel SYSTEM "personal.dtd">

<!-- Revision: 61 1.3 data/personal.xml, docs, xml4j2, xml4j2_0_0  -->

<personnel>

  <person id="Big.Boss" >
    <name><family>Boss</family> <given>Big</given></name>
    <email>chief@foo.com</email>
    <link subordinates="one.worker two.worker three.worker four.worker five.worker"/>
  </person>

  <person id="one.worker">
    <name><family>Worker</family> <given>One</given></name>
    <email>one@foo.com</email>
    <link manager="Big.Boss"/>
  </person>

  <person id="two.worker">
    <name><family>Worker</family> <given>Two</given></name>
    <email>two@foo.com</email>
    <link manager="Big.Boss"/>
  </person>

  <person id="three.worker">
    <name><family>Worker</family> <given>Three</given></name>
    <email>three@foo.com</email>
    <link manager="Big.Boss"/>
  </person>

  <person id="four.worker">
    <name><family>Worker</family> <given>Four</given></name>
    <email>four@foo.com</email>
    <link manager="Big.Boss"/>
  </person>

  <person id="five.worker">
    <name><family>Worker</family> <given>Five</given></name>
    <email>five@foo.com</email>
    <link manager="Big.Boss"/>
  </person>

</personnel>


Software Engineering Issues addressed by XML

  1. Information Interchange: a standard format to exchange information among software components. The format is self documented. The only interface required is the DTD. In that sense, XML documents replace ASCII files and formats like CSV (comma-separated values).
  2. Structured Information: XML can be used to encode information that is nested and lists. It supports required and default values. In that sense, XML documents can replace relational SQL-like encoding of data by providing a more "object-like" encoding of data.
  3. Openness and Standard: because XML is open and standard, software components can use standard libraries (for example, the Apache Xerces library) to parse and manipulate XML documents and the corresponding DOM trees.
  4. Robustness and Versioning: because XML documents are self-documented (they include the tags), applications can check that documents are well-formed. When the format evolves, different versions of the application can still co-exist and each use the parts of the documents they find relevant.
  5. Availability of Specialized XML languages: SMIL is an example of a powerful language defined as a standard for multimedia presentations and defined as an XML instance. Similary, many industries are defining standard, well defined document standards to help applications inter-operate.

Using Xerces

The Xerces package provided by the XML Apache project implements a full parser of XML and of the DOM specifications. It supports many character encodings (ASCII, EBCDIC, all languages).

To use the library, you only need to add the JAR file to your classpath. For example:

% java -cp xerces.jar DOMCount data/personnel.xml
A JAR file is a Java archive that contains all the class files defined in a library. You can think of it as a library file (lib or dll) in other languages. Put the local copy of the Xerces jar file in a convenient location.

The examples in the Xerces samples directory will help you. You can download the whole zip file and open it in your own machine for easy compilation. To run the examples, download the Samples jar file.


Last modified November 7th, 2000 Michael Elhadad