Get Started on XML

Saikat Goswami's picture
articles: 

XML is everywhere. Whether you are an Oracle system administrator, a .NET developer, or a J2EE analyst, XML undoubtedly is something that you bump into. In this article, I attempted to give you the basics of what you need to know to jumpstart your understanding of XML.

Introduction


XML stands for Extensible Markup Language. XML is never compiled, neither it is interpreted by an interpreter. This is so because a XML document is a plain text file as far as your computer is concerned. It just has tags, just like an HTML file has. Since it has tags, it is a 'markup' language. XML and HTML has the same ancestor, namely SGML (Standardized Generalized Markup Language).

Purpose of XML

The purpose of XML is to 'structure' data in tags. If you are not familiar with the concept of tags, you can refer to the February 2004 article at orafaq.com/articles. XML, thus, can be used as a database, since it is nothing but a text file with organized data. XML is also replacing EDI (Electronic Data Interchange). EDI is an industry standard for exchanging flat files. So, you can use XML wherever in your system you have a need to exchange or transmit high-volume data. One of the golden reasons why XML is spreading like wildfire is that it is platform-independent, hardware-independent, software-independent.

Confusing HTML with XML

HTML has 'pre-defined' tags, like <TITLE>, <BODY>, <P>, etc. You cannot have your own tag. The browser will not understand if you have a '<NEWTAG>' in your HTML file. It will treat it as a regular text and display on the browser. A browser only understands HTML. It does not understand XML.

In XML, you create your own tags. If you are organizing your book collection, you might come up with tags like <AUTHOR>, <TITLE>, <PUBLISHER>, etc. For your XML document to understand that author and title are not the same entity, you have to have a supporting document. This supporting document, the grammar of your document is a DTD or a schema. DTD stands for Document Type Definition.

XML Components

Some of the common elements a typical XML file contains are: DTD, Tags, PI's (Processing Instructions) and Comments. A Processing Instruction always starts with "<?" and ends with "?>". PI's are passed to the application by the parser. Parsers are discussed later. For now, parser is an application that 'reads' the XML document. Comments start with "<!-" and end with "->".

DTD's and stuff common to any XML files

Let us take a look at a sample DTD. The DTD for our very own web.xml is at
java.sun.com/dtd/web-app_2_3.dtd. Also, if you have Tomcat on your machine (see my July 2004 article introducing Tomcat at orafaq.com/articles). Browse to the 'webapps' directory, and then go to any directory beneath it. The directory beneath the webapps is the directory for web application. Each web application has a root directory. Go the WEB-INF directory of any application, and pull up the web.xml file. Now you have the XML file and the DTD têtê à têtê. An XML file always starts with a line like this:

<?xml version="1.0" encoding="ISO-8859-1"?>


Version of XML and the character set used. The second line is mandatory for a deployment descriptor using Servlet 2.3 specifications.
<!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
"http://java.sun.com/dtd/web-app_2_3.dtd">

An XML document has one and only one 'root' element. In XML nomenclature, tags are called elements. It is the first tag, and the document ends with its matching end tag. (The syntax of tags, and end tags are similar to HTML tags).

<!ELEMENT web-app (icon?, display-name?, description?, distributable?,
context-param*, filter*, filter-mapping*, listener*, servlet*,
servlet-mapping*, session-config?, mime-mapping*, welcome-file-list?,
error-page*, taglib*, resource-env-ref*, resource-ref*, security-constraint*,
login-config?, security-role*, env-entry*, ejb-ref*,  ejb-local-ref*)>

"web-app" is the root element. The words in parentheses are the 'child' elements. The number of allowable child elements is determined by the symbol after the word. A question mark means the element is optional and asterisk means there can be one or more of its kind. Notice that I am saying 'XML document' (because that is the jargon), but I am saying 'HTML file'.

Well-Formed and Valid

A well-formed document is one that is syntactically correct. No unmatched open tags, no loose elements (in other words, all tags are nested). A valid document is one that adheres to its DTD or schema. A well-formed document may not be valid. Who does the validation? A 'validating parser' does it. Some examples of validating parsers are Xerces (produce of Apache Software Foundation, available for the latest version of XML, which is 1.1). Sun has a parser that you get when you download the latest Java SDK 1.4.2.

If the DTD has something like this:

<!ELEMENT env-entry-value (#PCDATA)>

it means the values between the tags is going to be parsed by the parser. PCDATA stands for Parsed Character Data. Had it been '#CDATA', it means that the value is going to taken as-is, with no parsing.

XML Schema

Schema serves the same purpose as a DTD, meaning XML documents are validated against it. Schemas are written in XSD (XML Schema Definition), a language based out of XML. XML Schema is an XML application itself. Schemas came later, and offer much more flexibility in design. Let us take a look at a sample XML schema:

<?xml version='1.0'?>
<Schema name="mySchema" xmlns="my:namespace" xmlns:dt="datatypes">

<ElementType name="firstName" content="textOnly"/>
<ElementType name="lastName" content="textOnly"/>
<ElementType name="age" content="textOnly" dt:type="int"/>

</Schema>

As you can see, you can specify data types in a schema, which you cannot in a DTD. The allowable data types are integer, float, etc. XML Schema has a DTD too, in order for it to be a 'valid' schema.

You can view the specifications at http://www.w3c.org. W3C is for World Wide Web Consortium, a platform where experts agree on specifications for the web.

Note on XSL

XSL stands for Extensible Style Sheet. It is a technology to format XML documents for presentation in the front-end. CSS (Cascading Style Sheet) is used to format HTML documents, while XSL 'transforms' XML to HTML.

XML Parsers

A parser takes an XML document, checks for well-formedness, validity and builds a tree. The front-end, then, displays the tree. The XSL comes in the picture now that the XML is ready to be displayed. Parsers may or may not validate a document. If it does, then it is a 'validating parser'. XML Parsers are written in C++, Java, and many others. Microsoft Corporation, IBM, Sun Microsystems all have their parsers. Java has two API's DOM (Document Object Model) and SAX (Simple API for XML).

Certification

IBM offers a certification 'XML Solutions Analyst'. javaranch.com has a forum for this certification. Also, jdiscuss.com is a good site for XML certification discussion.

Summary

XML is being used in the financial industry, real estate, mathematics, and voice recognition to name a few. They are all creating their own language based out of XML. Nevertheless, XML is a vast topic. The purpose of this article was to break the ice on XML. Hope it did. Some gold mines on XML: w3schools.com/xml, xml.com. Feel free to post messages here. You are encouraged to post questions right here on this page. Feel free to browse our discussion forum at orafaq.com/msgboard.



Prepared by Saikat Goswami, Boston, Massachussets, sai_nyc@hotmail.com

Comments

Excellent document on XML. Your article cleared my few basic doubts. Thanks a lot.

Useful For a beginner like me

Very good for those who learn xml first time. Total view and purpose is higthlighted here.