Your Online Presence > Strategy & Development
Databases > Sharing your data
X Marks the Spot - What is XML?
By Lasa Information Systems Team
XML (eXtensible Markup Language) can be a powerful tool for exchanging and repurposing information. This article provides an introduction to XML, its uses and potential.
Introduction
Organisations hold ever more information on computer systems. But often, different systems won't share that information - pages on your web site don't easily turn into printed leaflets, or information from your accounts package can't be easily incorporated into word processor files. The situation can be even worse when you want to share information with other agencies or with funders - which most agencies increasingly need to do.
One answer may be a technology called XML. XML can be used to store many kinds of structured information, and to exchange that information between different applications using different kinds of computer - systems that would otherwise be unable to communicate. All this can be done using relatively inexpensive software.
What is XML?
To understand XML, we need to briefly explain three technologies that work together as part of these systems: XML itself, Document Type Definitions (DTD) and Stylesheets.
XML
The best place to start is with HTML, the language in which web pages are written. HTML formats text on web pages and provides "hyperlinks" which link different pages together.
Here's an example of a very simple web page:
7 days in Greece - £250
14 days in Majorca - £350
14 days in Bournemouth - £350
The HTML for the page might look something like this:
###html_167###
The HTML is basically text, formatted by "tags" - the bits in angled brackets like <this>.
So far, so good. But what if we want to discount all the holiday prices by ten percent? There is no easy way to locate the price information in the HTML code. We could search for the numbers that follow the ###html_168### tag and £ sign. But the HTML may not be written consistently - in this case, for example, the ###html_168### tag in the last line is in a slightly different place.
Now, let's rewrite this page as an XML document. While HTML limits users to a repertoire of pre-defined tags, XML allows you to make up your own tags, or "elements" as it calls them. So the XML code for our web page might look like this:
<holidayoffers>
<holiday><days>7</days> days in <location>Greece</location> - £<price>250</price></holiday>
<holiday><days>14</days> days in <location>Majorca</location> - £<price>350</price></holiday>
<holiday><days>14</days> days in <location>Bournemouth</location> - £<price>350</price></holiday>
</holidayoffers>
This makes it much easier to modify particular elements, as we want to do with <price>. However, you can't just make up elements as you go along - that way chaos lies. You have to write a Document Type Definition (DTD).
Document Type Definition
The DTD specifies which elements are allowed, and includes some information about their content.
The DTD can be included at the start of an XML file, but will usually go in a separate file of its own. So you have your data, structured with elements defined in your DTD. But how will your browser display it? Browsers know about HTML tags but your browser has no information about how to format the elements <location> or <price>. This is where XML uses a third technology, the Stylesheet.
Stylesheets
XML makes use of of another technology, eXtensible Stylesheet Language (XSL). The XSL stylesheet combines with the XML document and uses special tools to produce an HTML document with the desired formatting.
In this case the <location> element is formatted in bold, and the <price> element is formatted in italic. What's more, stylesheets can combine with XML documents and appropriate formatting tools to produce many other kinds of files as well as HTML ones - they can generate text or database files, for example.
Simple but powerful technologies
We hope you've got this far - because that's the end of the technical bit. Now we want to explain why the way of working we've just described has the potential to be enormously useful:
XML structures data, rather than specifying presentation
HTML is only concerned with how a document looks - which bits are bold, which italic, which are headings and so forth. There is no way of structuring the data according to its meaning - which is what XML elements allow you to do.
Separating presentation and content like this has advantages for web page development. Many web pages contain embedded scripting - the programming that makes them very dynamic and easy to use - as well as the actual content of the pages.
This presents problems, because both programmers and content authors have to have access to the same files - either group can accidentally "break" the other's work.
By separating the code, which controls presentation, from the content, XML allows access to each type of file to be restricted. It becomes possible to edit the site's content without any danger of damaging presentation systems, and vice versa.
Because HTML's essential purpose is to display documents, web browsers will usually ignore mistakes in a document's code and make a best guess at what it should look like. For example, if you leave off the end of a paragraph in an HTML document, it will still display correctly. By contrast, the rules for XML files are strict. A forgotten tag makes an XML file unusable; applications cannot try to second-guess the file's creator - it's broken, and the application has to stop right there and report an error.
Reusing stuctured data in different applications
Imagine that you publish a range of information items - on paper as leaflets, and electronically as web pages. How can you ensure that the leaflets and web pages always contain the same information?
The answer is to store that information as XML documents, with elements such as <headline> <author> and so forth. You then use a stylesheet to automatically generate HTML pages from your XML information. And you use a different stylesheet to produce a file which your desk top publishing software, or word processing software, can open - with the formatting information already included.
The print leaflets make use of multiple columns, page headers and footers, tables of contents, indices, high-resolution graphics and other features particularly suited to print. Web output would make use of moderate page sizes, intra- and inter-page linking, automatic stripping of irrelevant information, low-resolution graphics to speed up page downloading and linking information to actions - for example, click on a leaflet or book to order it.
Specific elements need to behave differently depending on the medium - for example, a style sheet for printing content can be set up so that footnotes show up at the end of each article or the bottom of each page. For the same content displayed on the Web, footnotes could be displayed in a different colour.
XML makes use of a Document Type Definition file (DTD), one or more eXtensible Stylesheet Language (XSL) files, and appropriate XSL formatting tools to enable information to be output in different formats, as represented in the diagram below:
One application designed to work with XML is recent versions of Adobe's Acrobat software. From Version 5 onwards, the application stores extra information about the contents of PDF files. Text includes formatting information, and pictures can be stored alongside text descriptions of their content. All this makes the files accessible for visually impaired people in a way that PDF files aren't if produced with Acrobat 4 or earlier.
XML and information sharing
XML Isn't Just for the Web. One of the most important points to understand is that XML isn't just a replacement for HTML. HTML only exists to control the appearance of web pages. XML can do much, much more. If you look back at the XML file above, it may strike you that it looks quite like a database table. Indeed, the process of defining the elements in an XML document is very like defining the fields in a database. XML is a useful technology for storing many kinds of structured information.
XML Enables Information Sharing. So you can use XML to store information as you would in a database. Other agencies with which you need to exchange information can also save their information as XML, or export it from a database as an XML document. Hurrah. But how can you be sure that the two XML documents are compatible? What if you've called an element one thing, and the other agency has called it another? For example, what if they say <client><client> and you say <user><user>?
There are two solutions here. The first is to be sure that you're using the same elements in the first place. In many different fields, schemes are being developed to allow people to exchange information smoothly.
For example, Chemical Markup Language allows specialists to share information about molecular composition. Commerce XML describes catalogue data, and is used for business transactions.
Legal XML is used in the US to describe legal documents, improving communication between lawyers and the court system.
The second solution depends on using stylesheets. A stylesheet can combine with an XML document to produce information in various formats - an HTML document, a text document, or even a second XML document. So you can use a stylesheet to change the <user> element into the <client> element, or vice versa, and share your information.
XML in the Real World
XML makes it possible for people to share information, while allowing them to keep information in the format that best suits them. Many applications such as databases, word processors and web browsers have XML support.
A lightweight XML format, RSS (really simple syndication) is widely used for aggregating and sharing news headlines and other web content.
Most applications that manipulate data (such as document and web content management systems, or directory services) have or promise some level of XML capability.
In the voluntary sector we could also be collaborating to develop XML to describe our resources and processes, classify and exchange our information.
Imagine the advantages of a "Funding Markup Language" that could be used to agree the information required for any funding application.
Specific information could be extracted from the XML and reformatted to meet the needs of individual funders. Financial and accounting information standards could be captured using XML.
Resources such as advice leaflets and caselaw could be described and classified using XML, readily exchanged between agencies and adapted for specific purposes.
A Referrals Markup Language could be used to agree and define processes for referring clients between organisations. The possibilities are endless ... XML is likely to become the language for computer communication.
No one company controls the standard, as with Microsoft's Office applications or Adobe's PDF format. XML is already supported by many widely used programming languages, and XML users have access to to a large and growing choice of tools - indeed, there may well already be a tool out there that does what you need.
About the author
Lasa Information Systems Team
Lasa Information Systems Team provides a range of services to community and voluntary organisations including ICT Health Checks and consulting on the best application of technology in your organisation.
Lasa IST is responsible for maintaining the ICT Hub Knowledgebase.
Glossary
Browser, Database, HTML, Line, PDF, Processor, RSS, Software, Style Sheet, Web Page, Web Site, XML
Related articles
Published: 30th August 2004 Reviewed: 9th February 2007
Copyright © 2004 Lasa Information Systems Team
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.0 UK: England & Wales License.