| about us | advertise | careers | links |

[an error occurred while processing this directive]


The Promise of XML

The Internet can be described as a massive system of interconnected computers that send millions of documents every second. These documents could be web pages, e-mail, news postings, advanced information documents, and more. The Internet exists as it does today because of a document format called HTML that offers an easy way to transfer information between computers of all types.

Even though HTML has been so successful, the technology is aging. It does not have the ability to define exactly what type of information is contained in a document. This makes the Internet a very unorganized technology, with billions of virtually undefined documents just waiting for someone to find them, often by chance. With the Internet in an obvious need of a better information infrastructure, a new format called XML has been developed to improve the data definition problems. XML has the potential to change the Internet into a much more efficient medium.

The Origin of the Markup Languages

To understand these document formats, and how they compare, it is useful to understand their relationships. HTML and XML are related formats; in fact, HTML can be defined as a subset of XML. A third format called SGML is a very complex superset of XML, with GML being the predecessor of SGML. GML was the initial format of this family, called the markup languages.

GML is an abbreviation for Generalized Markup Language, and IBM first developed it under the name Text Description Language with the explicit purpose of storing law office documentation. IBM quickly realized how universally useful this format was, and began to test it on other computer applications. However, using different document types on a computer in the early 1970’s was no small task, as computers at that time were very specialized. GML was useful because it was easily portable to different types of computers, rather than building a whole new computer system to use it.

An example of GML is demonstrated here:

.ce
Centered Title
.ll 39
.ss

This is a paragraph that is single spaced with a line length of 39 characters.

A document created as such would then be saved as plain text. A computer reading this document with GML capability would then process the information and display it on the monitor or printer like this:

Centered Title

This is a paragraph that is single
spaced with a line length of 39
characters.

This example has shown how a non-intensive GML document can be created to produce a simple, formatted document, with a title and a defined paragraph. However, GML was also adapted for use with repetitive types of data, such as lists. This application let GML be defined as a raw data storage format, which does not happen again for a markup language until XML is developed.

SGML is the second iteration of the markup language, an it dates back to its conception in 1974 by Charles Goldfarb, a co-developer of GML at IBM. SGML stands for Standard Generalized Markup Language, and it has become an internationally used document standard.

The major goal of SGML was to be useful for electronic manuscripts and documentation. These applications for SGML are still widely used today in a much more advanced state.

Goldfarb wanted SGML to become a standardized format to ensure its usefulness and compatibility between computers:

SGML is designed to make your information last longer than the computer systems that created it. Such longevity also implies immunity to short-term changes – such as a change from one application to another – so SGML is also inherently designed for re-purposing and portability. [. . .] But the real key to SGML’s success – both politically and technically – is the fact that SGML is a bona fide International Standard, not the creation of a dominant vendor or a consortium. I say "politically" because large users feel they can safely invest millions to convert to SGML because the SGML specification is stable and is maintained by a neutral organization.

SGML also introduced a new syntax that was much easier to read, more versatile, and less prone to error when compared to GML. This syntax has also been carried from SGML to HTML and XML. The best way to explain the concept is by comparison:

"This is a statement."

The statement is defined as a quotation because of double-quotes before and after it. You can also notice that there is a difference in the double-quotes, with a beginning and ending type.

Taking this quotation, and putting it in an arbitrary SGML format, it would look like this:

<quotation>This is a statement.</quotation>

There is a beginning and ending of what are called tags, and these tags have replaced the double-quotes. However, the statement is still defined as a quotation because of the tags surrounding it.

There can be multiple tags encapsulating a statement, or any type of marked up data. By doing this, the data can be better defined. For example, if we wanted to define a statement as both a quotation, and referring to Goldfarb, we could do this:

<goldfarb>
    <quotation>This is a statement.</quotation>
</goldfarb>

As shown above, the tags can be placed in any manner, however, it is the sequence of the tags that defines the document. For readability, all marked-up documents should be well organized.

The full SGML specification of today is extremely complex. For example, not only do tags define the data; the tags are also defined by something called a Document Type Definition, or DTD, resulting in a full definition language with specific rules.

Since SGML can be so complex, computer programs have been written to aid in the creation of SGML documents, therefore reducing errors, and increasing production speed dramatically.

The Rise of HTML

With SGML being a very large specification, it is only suitable for industrial and professional applications where data integrity is a priority. With the creation of the Internet, it was obvious that SGML would not be suited for Internet applications, so a new subset language had to be created.

This new specification was called HTML, or HyperText Markup Language, and was first established in 1992. The World Wide Web Consortium, or W3C, was then a newly formed Internet standards organization that developed the HTML specification as a simple language. The HTML tags were predefined by a standard DTD to be used by all HTML documents. The tags that were defined focused mostly on defining such items as titles, paragraphs and their properties, much like the GML example. However, simplified syntax was borrowed from SGML for the use of embedded images and hyperlinks, therefore making it more useful as an Internet medium.

An HTML web page is displayed on a computer using a web browser. The web browser reads HTML sent to the computer across the Internet by a server. A server is a computer that has the task of talking to other computers by sending and receiving data. In order to receive an HTML document, you have to go out and request it by typing in a web address, or clicking on a link in a web page. The two processes are exactly the same type of request, even though they seem very different in their use.

HTML makes creating a basic web page easy because it is such a simple format. However, making a complex web page becomes tedious because not all web browsers read HTML the same, understand all the same tags, and conform to the standards. This is the type of fragmentation that Goldfarb was able to avoid by setting up a strict standards system for SGML. Even though HTML is a standard, intense competition in the Internet software industry have been the cause of fragmentation.

A basic HTML document can be shown by example:

<html>
   <body>
     <img src="johnsphoto.jpg"/>

     <b><font color="red">John Doe. </font></b>
     <a href="resume.html">Link to my Resume.</a>
   </body>
</html>

This document would display a photo with a bold, red "John Doe" next to it. Next would be a link to his resume, which is a separate HTML document. Even with a handful of HTML, a web page can be made that proves useful.

The Need for a New Language

A drawback of HTML is that it cannot define the data within it. The tags in HTML do not say anything about the data, therefore making it ambiguous. Searching HTML data can return inaccurate results, resulting in a loss of time by manually sorting data.

This drawback of HTML can be shown by example. For instance, we have an HTML document with some items for sale:

<html>
   <body>
      <p>Red chair for sale, $40.</p>
      <p>Blue table for sale, $60.</p>
   </body>
</html>

Imagine that you are a buyer looking for a red table. In your search, you are going to get the above example document in your search results because it contains both the words "red" and "table". However, this document does not contain a red table for sale, and time was lost looking at this irrelevant page.

What if we could combine the precise data definition of SGML and the Internet capabilities of HTML? Such a language would have likely saved us time in our search for a red table. There is such a language, and it is XML. With XML, it is even possible to tell what is contained in a document without actually looking at the content, just the tags. If we rewrite the above example in XML, we could have something like this:

<?xml version="1.0"?>
  <sale>
    <red>
      <chair>Red chair for sale, $40.</chair>
    </red>
    <blue>
      <table>Blue table for sale, $60.</table>
    </blue>
  </sale>

The sole <table> tag is defined inside of <blue>, therefore defining that table as blue. There is not a <table> tag inside of <red>, but if the seller had another table that was red, it would go there.

Since there is not an item defined as a red table, we would not get this document in our search results, therefore saving time. For applications like this, the benefits of XML can be seen in making the Internet more useful and reliable.







Web Target PC




 

[an error occurred while processing this directive]

Contact us | About us | Advertise
Copyright © 1999-2007 TargetPC.com. All rights reserved. Privacy information.


targetpc