The
Promise of XML
The
Internet can be described as a massive system of interconnected computers that
send millions of documents every second. These documents could be web pages, e-mail,
news postings, advanced information documents, and more. The Internet exists as
it does today because of a document format called HTML that offers an easy way
to transfer information between computers of all types. Even
though HTML has been so successful, the technology is aging. It does not have
the ability to define exactly what type of information is contained in a document.
This makes the Internet a very unorganized technology, with billions of virtually
undefined documents just waiting for someone to find them, often by chance. With
the Internet in an obvious need of a better information infrastructure, a new
format called XML has been developed to improve the data definition problems.
XML has the potential to change the Internet into a much more efficient medium. The
Origin of the Markup Languages To
understand these document formats, and how they compare, it is useful to understand
their relationships. HTML and XML are related formats; in fact, HTML can be defined
as a subset of XML. A third format called SGML is a very complex superset of XML,
with GML being the predecessor of SGML. GML was the initial format of this family,
called the markup languages. GML
is an abbreviation for Generalized Markup Language, and IBM first developed it
under the name Text Description Language with the explicit purpose of storing
law office documentation. IBM quickly realized how universally useful this format
was, and began to test it on other computer applications. However, using different
document types on a computer in the early 1970’s was no small task, as computers
at that time were very specialized. GML was useful because it was easily portable
to different types of computers, rather than building a whole new computer system
to use it. An
example of GML is demonstrated here:
.ce
Centered Title .ll 39 .ss This
is a paragraph that is single spaced with a line length of 39 characters.
A document created as
such would then be saved as plain text. A computer reading this document with
GML capability would then process the information and display it on the monitor
or printer like this: Centered
Title This is a paragraph that
is single spaced with a line length of 39 characters.
This
example has shown how a non-intensive GML document can be created to produce a
simple, formatted document, with a title and a defined paragraph. However, GML
was also adapted for use with repetitive types of data, such as lists. This application
let GML be defined as a raw data storage format, which does not happen again for
a markup language until XML is developed. SGML
is the second iteration of the markup language, an it dates back to its conception
in 1974 by Charles Goldfarb, a co-developer of GML at IBM. SGML stands for Standard
Generalized Markup Language, and it has become an internationally used document
standard. The
major goal of SGML was to be useful for electronic manuscripts and documentation.
These applications for SGML are still widely used today in a much more advanced
state. Goldfarb
wanted SGML to become a standardized format to ensure its usefulness and compatibility
between computers: SGML
is designed to make your information last longer than the computer systems that
created it. Such longevity also implies immunity to short-term changes – such
as a change from one application to another – so SGML is also inherently designed
for re-purposing and portability. [. . .] But the real key to SGML’s success –
both politically and technically – is the fact that SGML is a bona fide International
Standard, not the creation of a dominant vendor or a consortium. I say "politically"
because large users feel they can safely invest millions to convert to SGML because
the SGML specification is stable and is maintained by a neutral organization.
SGML
also introduced a new syntax that was much easier to read, more versatile, and
less prone to error when compared to GML. This syntax has also been carried from
SGML to HTML and XML. The best way to explain the concept is by comparison:
"This
is a statement." The
statement is defined as a quotation because of double-quotes before and after
it. You can also notice that there is a difference in the double-quotes, with
a beginning and ending type. Taking
this quotation, and putting it in an arbitrary SGML format, it would look like
this: <quotation>This
is a statement.</quotation> There
is a beginning and ending of what are called tags, and these tags have replaced
the double-quotes. However, the statement is still defined as a quotation because
of the tags surrounding it. There
can be multiple tags encapsulating a statement, or any type of marked up data.
By doing this, the data can be better defined. For example, if we wanted to define
a statement as both a quotation, and referring to Goldfarb, we could do this:
<goldfarb>
<quotation>This is a statement.</quotation>
</goldfarb> As
shown above, the tags can be placed in any manner, however, it is the sequence
of the tags that defines the document. For readability, all marked-up documents
should be well organized. The
full SGML specification of today is extremely complex. For example, not only do
tags define the data; the tags are also defined by something called a Document
Type Definition, or DTD, resulting in a full definition language with specific
rules. Since
SGML can be so complex, computer programs have been written to aid in the creation
of SGML documents, therefore reducing errors, and increasing production speed
dramatically. The
Rise of HTML With
SGML being a very large specification, it is only suitable for industrial and
professional applications where data integrity is a priority. With the creation
of the Internet, it was obvious that SGML would not be suited for Internet applications,
so a new subset language had to be created. This
new specification was called HTML, or HyperText Markup Language, and was first
established in 1992. The World Wide Web Consortium, or W3C, was then a newly formed
Internet standards organization that developed the HTML specification as a simple
language. The HTML tags were predefined by a standard DTD to be used by all HTML
documents. The tags that were defined focused mostly on defining such items as
titles, paragraphs and their properties, much like the GML example. However, simplified
syntax was borrowed from SGML for the use of embedded images and hyperlinks, therefore
making it more useful as an Internet medium. An
HTML web page is displayed on a computer using a web browser. The web browser
reads HTML sent to the computer across the Internet by a server. A server is a
computer that has the task of talking to other computers by sending and receiving
data. In order to receive an HTML document, you have to go out and request it
by typing in a web address, or clicking on a link in a web page. The two processes
are exactly the same type of request, even though they seem very different in
their use. HTML
makes creating a basic web page easy because it is such a simple format. However,
making a complex web page becomes tedious because not all web browsers read HTML
the same, understand all the same tags, and conform to the standards. This is
the type of fragmentation that Goldfarb was able to avoid by setting up a strict
standards system for SGML. Even though HTML is a standard, intense competition
in the Internet software industry have been the cause of fragmentation. A
basic HTML document can be shown by example:
<html>
<body> <img src="johnsphoto.jpg"/>
<b><font color="red">John
Doe. </font></b> <a
href="resume.html">Link to my Resume.</a> </body>
</html>
This document
would display a photo with a bold, red "John Doe" next to it. Next would
be a link to his resume, which is a separate HTML document. Even with a handful
of HTML, a web page can be made that proves useful. The
Need for a New Language A
drawback of HTML is that it cannot define the data within it. The tags in HTML
do not say anything about the data, therefore making it ambiguous. Searching HTML
data can return inaccurate results, resulting in a loss of time by manually sorting
data. This
drawback of HTML can be shown by example. For instance, we have an HTML document
with some items for sale: <html>
<body> <p>Red chair
for sale, $40.</p> <p>Blue table
for sale, $60.</p> </body> </html>
Imagine that you are a buyer
looking for a red table. In your search, you are going to get the above example
document in your search results because it contains both the words "red"
and "table". However, this document does not contain a red table for
sale, and time was lost looking at this irrelevant page. What
if we could combine the precise data definition of SGML and the Internet capabilities
of HTML? Such a language would have likely saved us time in our search for a red
table. There is such a language, and it is XML. With XML, it is even possible
to tell what is contained in a document without actually looking at the content,
just the tags. If we rewrite the above example in XML, we could have something
like this: <?xml
version="1.0"?> <sale> <red>
<chair>Red chair for sale, $40.</chair>
</red> <blue>
<table>Blue table for sale, $60.</table>
</blue> </sale>
The sole <table> tag
is defined inside of <blue>, therefore defining that table as blue. There
is not a <table> tag inside of <red>, but if the seller had another
table that was red, it would go there. Since
there is not an item defined as a red table, we would not get this document in
our search results, therefore saving time. For applications like this, the benefits
of XML can be seen in making the Internet more useful and reliable.
|