mod_autoindex Meets XML

Author: John Tigue
URL: http://www.tigue.com/collection-indexing/presentations/2001-04-06/
Event: ApacheCon US 2001

1  Abstract 

When Apache serves up files from a disk based file system, it maps between HTTP URLs and the file system's directories and files. In the case of directories, Apache can be configured such that if a GET request is addressed to an URL with a trailing slash then the response will contain an HTML page which lists resources available within the corresponding file system directory. As its name implies the module, mod_autoindex, automatically indexes file system directories for Apache.

This paper considers several different ways of patching mod_autoindex such that it generates pages which are XML enabled as well as backwards compatible with HTML browsers. Trivial changes to the source code of mod_autoindex.c can accommodate several technologies including XML, XHTML, XSL, and XLink. Benefits of these patches include prettier pages, less load on the server CPU, and most significantly out-of-the-box Apache becomes an XLink application server.

2  Table of Contents 

 1  Abstract
 2  Table of Contents
 3  Introduction
 4  Terminology
 5  Motivational Examples
 6  Well-formedness
 7  XHTML
 8  More Tags
     8.1  Separating the Property Values from Each Other
     8.2  Grouping Property Values by Resource
 9  XLink
     9.1  Collection Indexing is the Physical Link Structure
 10  Miscellaneous Syntax Refinements
     10.1  Date and Time Formats
     10.2  Media Type
     10.3  Less Dependencies Between the XML Element Ordering and XSLTs
     10.4  Property Value Truncation
 11  Implications and Broader Context
     11.1  Implicit Freebie Beneifts
     11.2  Collection Indexing without Web Servers
 12  Wrench
 13  Call for Standardization
 14  Further Information
 15  Bibliography

3  Introduction 

This paper considers ways of modifying mod_autoindex such that the HTML documents it generates incorporate various XML technologies. The proposed changes to the code in mod_autoindex.c are minimal and essentially involve modifying existing string constants. None of the patches presented involve using an XML parser on the server. The changes affect only the auto-generated HTML documents and not the HTTP headers.

For lack of a better term, this XML enlightened directory listing behavior is termed "Collection Indexing." The term "directory listing" is undesirable as it is not implementation agnostic as protocol terminology should be. That is, "directory" is file system specific terminology. Even though this paper only covers a file system-based implementation (Apache's mod_autoindex), Collection Indexing can be described in HTTP terminology which is implementation agnostic. Collection Indexing is simply a politically correct, and distinct, term for something which seems to have not yet been formally defined.

The first patch presented causes the pages generated by mod_autoindex to be well-formed XML documents which conform with the W3C's XHTML 1.0 Recommendation. The changes involve nothing more that removing the DOCTYPE declaration and adding some slashes to empty elements (for example, "<HR>" becomes "<HR />"). As a result XSL Transformations (XSLTs) can be applied to the Collection Index pages while maintaining backwards compatibility with deployed non-XML Web browsers. Also, sorting the list of resources by name, size, etc. (that is, "FancyIndexing") can be done client-side.

After XML well-formedness is established, further patches add more markup elements and attributes to the Collection Index pages. The benefits of XHTML are considered. Then the basis of columnar formatting is changed from mono-spaced <pre> to <table>.

Another patch adds in XLink attributes. A Collection Index page can be viewed as essentially just an XLink extended link with one arc to each resource contained in the collection. The most significant benefit of this patch is that out-of-the-box, vanilla Apache becomes the basis for XLink applications.

Cumulatively, these patches result in XHTML pages which are XLink-enabled and sortable via client-side XSLT. Not all the patches need be implemented for benefits to be realized. The syntax modifications part of the paper ends by considering sundry machine readability issues such as ISO 8601 formatting of dates.

After presenting the patches the broader context of Collection Indexing is explored. Collection Indexing can be thought of as "the physical storage structure API for XML applications". Collection Indexing is a trivial, incremental innovation but it greatly increase the utility of vanilla web servers for XML-based Web applications. This is demonstrated with example Collection Indexing-aware clients. Included in the examples is "wrench," a Web browser-based equivalent to Microsoft's file system Explorer. Finally, the paper closes with a call for a standardization of Collection Indexing.

4  Terminology 

<a name="sectionTerminology">Terminology</a>

Briefly, and informally, some terms are defined.

Resource:
Quoting the HTTP/1.1 RFC: "a network data object or service that can be identified by a URI".

Collection:
A resource addressed by an URL which ends in a '/'. Note that this is not the same as a WebDAV Collection. As the WebDAV spec says in section 5.2: "A resource MAY be a collection but not be WebDAV compliant."

Collection Index:
In an HTTP context, a special case of the message returned in response to a GET request addressed to an URL ending in a slash. In Apache, when a GET request is received for an URL which ends in slash, one of three things can happen:

  1. The response reports 404, access denied, or other error messages.
  2. The response contains a default document (commonly named index.html or default.htm).
  3. The response contains an HTML document with links to other resources available off the requested URL.
In the third case, mod_autoindex is called on to automatically generate an HTML index page which lists available resources immediately relative to the URL requested, that is, the parent resource and the children resources. In this paper, these mod_autoindex generated documents shall be termed "Collection Indices" or also "Collection Index Pages".

5  Motivational Examples 

There are two main use cases for Collection Indexing. The first is that of a human browsing a Web site's URL tree. The second is a piece of software operating without the help of a human.

The case of a human browsing a Web site is widely experienced. As the next section will show, minimally XML compliant (that is, well-formed) mod_autoindex's pages can demonstrate relative advantage over Apache 1.3.19's mod_autoindex pages. Essentially, better UI styling can be done (via XSLT) and Collection Index pages can be sorted client-side. This is a "better, faster, cheaper" value proposition. Collection Indexing benefits in this case may be sufficient to justify adoption of the patch to mod_autoindex.

The more interesting benefits of Collection Indexing are probably associated with the second use case in which a piece of software is operating without the help of a human. In the former use case, it is the human which initiates link traversal and models the Web site structure. In the latter use case, the client application does the traversal and modeling internally (much like a Web crawling robot). This is a "brave new world" value propostion. This paper is focused primarily on this second use case. This paper will consider how to modify mod_autoindex such that XML reading robots can get the most out of the Collection Index pages. A motivation example will help demonstrate this.

Consider the case of an XML-based client application using a Web server for data storage. Some URL subtree of the Web server has been allocated for the application's use. The application's data is stored in multiple resources which are XML documents all located in the allocated URL subtree.

On the URL subtree the XML data documents are leaf nodes of the tree and the Collection Index pages are internal, non-leaf nodes. The internal tree nodes should be XML as well otherwise the XML client application can not traverse and enumerate it's own data space. The client's XML parser will choke if the Collection Index pages are mal-formed HTML. Simply making mod_autoindex's pages well-formed would be enough to enable this case.

This "example" will be returned to later in the paper.

6  Well-formedness 

In this section, Apache 1.3.19's mod_autoindex is patched such that it puts out HTML documents which conform to XML 1.0 [XML]. The goal of this section is to change mod_autoindex.c just enough to cause it to generate documents which are minimally XML comformant. XML processors are required to fail when parsing a document which is not well-formed. So, for an XML client to read Apache mod_autoindex pages, well-formedness is the minimum requirement. Benefits can be realized with just well-formedness. Once the documents are well-formed, XML clients can begin to process them, yet downlevel Web browsers can still render the documents as HTML. After establishing well-formedness later sections of this paper consider further refinements.

DirectoryIndex is the directive which configures Apache to server up (usually) static documents for URLs which end in slashes. When servicing a GET request on a slash-terminated URL, if the DirectoryIndex directive does not resolve to something to respond with and the Indexes directive is set then Apache calls on mod_autoindex to generate an index page on-the-fly.

Here is an example HTML page generated by Apache 1.3.19's mod_autoindex (configured for FancyIndexing).

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
 <HEAD>
  <TITLE>Index of /</TITLE>
 </HEAD>
 <BODY>
<H1>Index of /</H1>
<PRE><IMG SRC="/blank.gif" ALT="     "> <A HREF="?N=D">Name</A>                    <A HREF="?M=A">Last modified</A>       <A HREF="?S=A">Size</A>  
<HR>
<IMG SRC="/img/folder.gif" ALT="[DIR]"> <A HREF="/">Parent Directory</A>        07-Dec-2000 11:16      -  
<IMG SRC="/img/folder.gif" ALT="[DIR]"> <A HREF="bar/">bar/</A>                    13-Mar-2001 10:28      -  
<IMG SRC="/img/text.gif" ALT="[TXT]"> <A HREF="foo.html">foo.html</A>                01-Dec-2000 19:41     1k  
<IMG SRC="/img/text.gif" ALT="[TXT]"> <A HREF="junk.html">junk.html</A>               01-Dec-2000 19:41     1k  
<IMG SRC="/img/sound2.gif" ALT="[SND]"> <A HREF="noise.ram">noise.ram</A>               15-Nov-2000 11:33     1k  
<IMG SRC="/img/unknown.gif" ALT="[   ]"> <A HREF="theSecretOfLifeIsTo.hmph">theSecretOfLifeIsTo...></A> 18-Oct-2000 22:26    40k  
</PRE><HR>
</BODY></HTML>

When the above document is read by an XML processor, the parse will fail. First the DOCTYPE declaration is not a valid XML DOCTYPE declaration. In XML, PUBLIC DOCTYPE declaration need to be of the following form.

<!DOCTYPE foo PUBLIC "bar" "bas" >
XML 1.0 does not require a DOCTYPE declaration so it can be omitted.

After removing the DOCTYPE declaration, the parse will still fail. The problem is that the HR and IMG elements are not well-formed. Simply changing "<HR>" to "<HR />" and making the IMG element an empty element will solve the problem. (Note that "<HR/>" would also be valid XML but causes trouble with some deployed HTML browsers.) After these trivial changes, the parse will succeed.

Here is the result of applying these changes to the above example document.

<HTML>
 <HEAD>
  <TITLE>Index of /</TITLE>
 </HEAD>
 <BODY>
<H1>Index of /</H1>
<PRE><IMG SRC="/blank.gif" ALT="     "> <A HREF="?N=D">Name</A>                    <A HREF="?M=A">Last modified</A>       <A HREF="?S=A">Size</A>  
<HR />
<IMG SRC="/img/folder.gif" ALT="[DIR]" /> <A HREF="/">Parent Directory</A>        07-Dec-2000 11:16      -  
<IMG SRC="/img/folder.gif" ALT="[DIR]" /> <A HREF="bar/">bar/</A>                    13-Mar-2001 10:28      -  
<IMG SRC="/img/text.gif" ALT="[TXT]" /> <A HREF="foo.html">foo.html</A>                01-Dec-2000 19:41     1k  
<IMG SRC="/img/text.gif" ALT="[TXT]" /> <A HREF="junk.html">junk.html</A>               01-Dec-2000 19:41     1k  
<IMG SRC="/img/sound2.gif" ALT="[SND]" /> <A HREF="noise.ram">noise.ram</A>               15-Nov-2000 11:33     1k  
<IMG SRC="/img/unknown.gif" ALT="[   ]" /> <A HREF="theSecretOfLifeIsTo.hmph">theSecretOfLifeIsTo...></A> 18-Oct-2000 22:26    40k  
</PRE><HR />
</BODY></HTML>

The costs of these changes are trivial. The changes to the HTML documents are rather insignificant. The changes to mod_autoindex.c involve only modifying string constants and no new lines of code. Also, the network and CPU costs are minimal. In terms of network costs a few more bytes go over the network because of the extra slashes and spaces. Perhaps some very weak robots would be confuse by the changes. Note that compatibility with deployed HTML browsers is not lost by introducing these changes.

In terms of benefits, the generated HTML documents are now also well-formed XML documents. This is the minimum requirement for documents to be parsable by an XML processor. Given just this, an XML client can now process an entire URL subtree as a set of XML documents, as mentioned in the previous section. An XML client can crawl a Web site where previously the directory listing pages would have caused fatal XML parsing errors. There is much more on this point later in the section entitled Implications and Broader Context.

There are also benefits for the case of humans viewing Collection Indexing pages. First, since the mod_autoindex documents are now well-formed they can be used as input to XSL transforms. This enables prettier pages which are easier to read.

(Note that all the above can be applied similarly to mod_autoindex when it is not configured for FancyIndexing. When FancyIndexing is not enabled, mod_autoindex just lists the resource names and no other properites. The resources are listed within a <UL>. The rest of this section is only appropriate to the FancyIndexing case.)

Not only can the mod_autoindex pages be easier on human eyes, they can also be easier on the server's CPU. Apache's FancyIndexing sorting can now be implemented client-side. FancyIndexing is a feature of mod_autoindex. It allows the resources tabularly listed in the Collection Index page to be sorted by property values. These sortable properties include name, size, date of last modification, and description. Note that it does not include sorting by Content-Type. Property values can be sorted in ascending or descending order. Each time a human clicks on a column header the server is asked to resort the listed resouces in the new order. This is accomplished by specifying the sort order on the URL query term (for example, http://www.example.com/foo?S=A will GET the collection sorted by size ascending.

With FancyIndexing-like sorting happening on the client the server does less work. On many networks, the UI can respond quicker as resorting no longer requires a round trip to the server to re-fetch the same information sorted in a different order. This this case XSLT can be demonstrated to have relative advantage over CSS.

NUT is a client XML application which does the sorting client-side. It is a HTML framset-based application which loads well-formed mod_autoindex pages from a server. The pages are then parsed and transformed such that when the user requests that the list be resorted the click event is intercepted by NUT and the sort is performed on the client.

NUT has limitations though. It only works with Apache with the patch of this section. For example, it can't handle Microsoft's IIS Directory Browsing pages (even if they were well-formed). Further improvements to the Collection Index pages would need to be made to get interoperability. This will be addressed in later sections.

7  XHTML 

In this section, Apache 1.3.19's mod_autoindex is patched such that it puts out HTML pages which conform with XHTML 1.0 [XHTML]. The W3C Recommendation XHTML 1.0 modifies HTML 4 to make it conform to XML 1.0 [XML].

For this effort, it is hard to demonstrate technical benefits of XHTML over the results of the previous section (this is, just plain old well-formed documents). None the less, XHTML would certainly seem to be relevant to a paper entitled "mod_autoindex Meets XML". Displaying the XHTML seal of approval on mod_autoindex is a consise way to denote the fact that its documents are HTML and XML friendly. Perhaps some unknown XHTML clients will benefit from mod_autoindex putting out XHTML conformant documents. For example, some imaginary XHTML client may look for the XHTML DOCTYPE declaration and abort if one is not found.

The XHTML spec does mention various issues which are relevant here. Appendix C of XHTML 1.0 "summarizes design guidelines for authors who wish their XHTML documents to render on existing HTML user agents". For example, Appendix C mentions that "empty elements must end in ' />' for the benefit of downlevel HTML browsers." One side effect of adopting XHTML is that a DOCTYPE declaration can be added back to the Collection Index pages.

In order to make the last example document from the previous section conform to XHTML all that is required is several element and attribute names need to be in lower case. Adding the DOCTYPE declaration is optional. Here is that last example document modified to comform with XHTML 1.0.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd" >
<html>
 <head>
  <title>Index of /</title>
 </head>
 <body>
<h1>Index of /</h1>
<pre><img src="/blank.gif" alt="     "> <a href="?N=D">Name</a>                    <a href="?M=A">Last modified</a>       <a href="?S=A">Size</a>  
<hr />
<img src="/img/dir.gif" alt="[DIR]" /> <a href="/">Parent Directory</a>        07-Dec-2000 11:16      -  
<img src="/img/dir.gif" alt="[DIR]" /> <a href="bar/">bar/</a>                    13-Mar-2001 10:28      -  
<img src="/img/text.gif" alt="[text/html]" /> <a href="foo.html">foo.html</a>                01-Dec-2000 19:41     1k  
<img src="/img/text.gif" alt="[text/html]" /> <a href="junk.html">junk.html</a>               01-Dec-2000 19:41     1k  
<img src="/img/sound2.gif" alt="[SND]" /> <a href="noise.ram">noise.ram</a>               15-Nov-2000 11:33     1k  
<img src="/img/unknown.gif" alt="[   ]" /> <a href="theSecretOfLifeIsTo.hmph">theSecretOfLifeIsTo...></a> 18-Oct-2000 22:26    40k  
</pre><hr />
</body></html>

The necessary changes to mod_autoindex.c involve only modifying string constants and no new lines of code.

XHTML is not so very interesting by itself (curious given the exuberance evidenced in the spec: "The XHTML family is the next step in the evolution of the Internet.") It is a basic technology which is useful for blending HTML with other markup.

For the purposes of this effort, the main value of XHTML is that within well-formed XHTML document, additional XML technologies can be expressed via attribute assignments and processing instructions. Such new attributes do not extend XHTML. For HTML page rendering purposes they are completely ignored as they have no semantic significance to the page rendering process. In other words, this is not a proposal for extending HTML browsers but rather for how to extend the utility of Collection Indexing pages to non-HTML clients. All the remaining techniques introducted in this paper assume that the Collection Index pages have to be XHTML documents. In this way, the pages can be made more friendly to XML client software while still being renderable as HTML documents.

As a prelude to the changes which will be introduced in the next section, a final constraint of XHTML should be mentioned. XHTML 1.0, Appendix B in normative. It mentions that "pre cannot contain the img, [...], or sup elements." Apache's mod_autoindex can be configure (via the AddIcon* directives) to not include img elements. When img elements are used, though, they will need to be contained within some element besides pre in order for mod_autoindex's pages to conform with XHTML 1.0.

8  More Tags 

In the previous sections it was demonstrated that well-formedness alone provides valuable utility for XML clients. None the less, mod_autoindex does more than just put out arbitrary HTML documents: it puts out HTML documents which describe resources contained in a Collection. The main goal of this effort is to make mod_autoindex useful to both HTML clients and XML clients. Further utility can be realized in Collection Index pages by adding more markup tags in such a way that XML clients can more readily model the resources and their properties. In other words, a page which is well-formed is not necessarily well marked up.

For each child resource that mod_autoindex lists, it can include the following properties:

Consider this snippet of a mod_autoindex document which describes two resources.

<img src="/img/text.gif" alt="[text/html]" /> <a href="foo.html">foo.html</a>                01-Dec-2000 19:41     1k  
<img src="/img/text.gif" alt="[text/html]" /> <a href="junk.html">junk.html</a>               01-Dec-2000 19:41     1k  

The name of the first resource is foo.html, its size is 1 kilobyte, and it was last modified on December 3rd, 2000 at 7:13PM. These properties can be extracted from the document via some rather convoluted XSLT.

The problem is that mod_autoindex's pages are going to have to perform two tasks. Firstly, they need to show the resource properties to humans. This means that the property values need to be element content not attribute values otherwise they will not render as text. Mod_autoindex also needs to markup the information for XML clients. Humans and software do not process information in the same way. For example consider date and time formatting. A human may like "01-Dec-2000 19:41" and find ISO8601 format (that is, "2000-12-01T19:41") more difficult to parse. A computer will have the opposite opinion. A simple solution would be to have both formats in the Collection Index page: the human readable format is expressed as element content and the machine readable format is expressed as an attribute value. Here's an example (more on the "lastMod" attribute name later).

<span lastMod="2000-12-01T19:41">01-Dec-2000 19:41</span>

In terms of specific syntax, this section adds more element tags to the Collection Index pages. Later sections turn to adding more attributes to elements, such as that lastMod attribute. The goal is to get the element structure of the documents to mirror the structure of the information being expressed. The tags of the elements should explicitly denote where a property value begins and ends.

This section is dependent on the previous sections. Mod_autoindex's legacy is that of generating HTML documents. For backwards compatibility with HTML browsers, any new tags added to the Collection Index documents must be HTML tags. For the new XML clients, well-formedness is the minimum requirement for XML 1.0 conformance. XHTML satisfies these two requirements. Therefore, in this section, all elements names in the Collection Index pages are constrained to those of XHTML.

8.1  Separating the Property Values from Each Other 

Apache 1.3.19 mod_autoindex's Collection Index pages separate some of the values the properties of resources listed with nothing more than spaces. Separating those property values with markup tags would make the information structure more explicit to XML clients.

Theoretically, a span element without a style or class attribute assignment has no affect on HTML page layout. This visually innocuous markup is here used to add more structure to the Collection Indexing documents without affecting the HTML layout. The point is that since the answer has to involve HTML elements, any element will do for XML clients and span (without attributes) is the least disruptive to the HTML clients.

Here, repeated, is the last example document snippet from the previous section.

<img src="/text.gif" alt="[TXT]" /> <a href="foo.html">foo.html</a>          03-Dec-2000 19:13     1k  
<img src="/text.gif" alt="[TXT]" /> <a href="junk.html">junk.html</a>         01-Dec-2000 19:41     1k  

Here is that same snippet with span tags added.

<img src="/text.gif" alt="[TXT]" /> <a href="foo.html">foo.html</a>          <span>03-Dec-2000 19:13</span>     <span>1k<span>  
<img src="/text.gif" alt="[TXT]" /> <a href="junk.html">junk.html</a>         <span>01-Dec-2000 19:41</span>     <span>1k</span>  

In this way, the XML element structure of the document's markup more closely mirrors the structure of the information expressed in the document. With these changes, an XML client can more readily identify a resource's date of last modification and size (and description, as well, if mod_autoindex had been configured to generate it, as was not the case in this example). At this point, individual properties values are separated from each other by markup tags. Each property value is contained in its own element.

Note that the HTML spces says that HTML browsers should ignore tags which they do not recognise. So, theoretically a random non-HTML element name could be substituted for the span's in the above example and there would still be no effect on HTML page layout. Technically, such a document would not validate against the XHTML DTDs. Validation of a document against a DTD can detect elements not defined in the DTD but no errors are caused by the presence of extra attributes not defined in the DTD. This is a weak point especially given that XHTML was designed to be mixed with other markup. In later sections, XHTML markup will be used with the intent of affecting HTML layout. Such usage will add more justification for sticking to XHTML tags.

In terms of costs, the span tags require additional bytes to go out over the network. Still, no new lines of code are required to implement this change in mod_autoindex.c.

Even with the changes proposed in this section, some of the structure of the index pages is still not explicitly XML syntaxed. That leads to the next proposed change.

8.2  Grouping Property Values by Resource 

Grouping Property Values by Resource

Although the property values are now contained in separate elements, the values are still not grouped together by XML markup such that there is one element which contains all the properties specific to an individual resource. HTML clients do not need this but XML clients would be assisted if it were the case.

Mod_autoindex's documents are syntaxed to leverage page layout features of the widely deployed HTML browsers. Some of these features are called for in the (X)HTML specs and some are simply widely implemented.

Apache 1.3.19's mod_autoindex wraps all resource descriptions together within a big pre which causes columnar layout during HTML rendering. During the rendering process, line breaks within most XHTML elements are treated as just more whitespace. But within pre elements, line breaks are significant for HTML rendering. Mod_autoindex groups all the properties of each listed resource on a separate line within the pre element. That is, the line break is what delineates one resource description from the next.

The other browser feature which causes mod_autoindex's pages to render with column alignment is that pre element content is commonly rendered with fixed-pitch fonts. So, mod_autoindex uses whitespace padding to align the data cells into columns.

Here is repeated an earlier example mod_autoindex page which illustrates the use of a pre element. This example document was generated by Apache 1.3.19's mod_autoindex (configured for FancyIndexing) without any patches applied to the code.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
 <HEAD>
  <TITLE>Index of /</TITLE>
 </HEAD>
 <BODY>
<H1>Index of /</H1>
<PRE><IMG SRC="/blank.gif" ALT="     "> <A HREF="?N=D">Name</A>                    <A HREF="?M=A">Last modified</A>       <A HREF="?S=A">Size</A>  
<HR>
<IMG SRC="/img/folder.gif" ALT="[DIR]"> <A HREF="/">Parent Directory</A>        07-Dec-2000 11:16      -  
<IMG SRC="/img/folder.gif" ALT="[DIR]"> <A HREF="bar/">bar/</A>                    13-Mar-2001 10:28      -  
<IMG SRC="/img/text.gif" ALT="[TXT]"> <A HREF="foo.html">foo.html</A>                01-Dec-2000 19:41     1k  
<IMG SRC="/img/text.gif" ALT="[TXT]"> <A HREF="junk.html">junk.html</A>               01-Dec-2000 19:41     1k  
<IMG SRC="/img/sound2.gif" ALT="[SND]"> <A HREF="noise.ram">noise.ram</A>               15-Nov-2000 11:33     1k  
<IMG SRC="/img/unknown.gif" ALT="[   ]"> <A HREF="theSecretOfLifeIsTo.hmph">theSecretOfLifeIsTo...></A> 18-Oct-2000 22:26    40k  
</PRE><HR>
</BODY></HTML>

These HTML layout tricks that mod_autoindex uses work well with the vast majority of HTML browsers. But they do not work for the XML clients. This is another case in which adding more markup tags helps XML clients more readily model the resources and their properties.

Also note that in the above example, there are img elements contained within the pre element. As mentioned earlier, XHTML says that pre shouldn't contain img. This may be a good point at which to start considering some different XHTML elements.

Consider for a moment what Collection Index pages could look like if XHTML elements were not a requirement. The following example is one possibility.

<collectionIndex>
  <resource>
    <name>foo.html</name>
    <lastMod>03-Dec-2000 19:13</lastMod>
    <size>1k</size>
  </resource>
  <resource>
    <name>junk.html</name>
    <lastMod>01-Dec-2000 19:41</lastMod>
    <size>1k</size>
  </resource>
</collectionIndex>

As a side note, readers familiar with RDF may wonder why the above example is not RDF. For this paper, it simply complicates the syntax. None the less, mapping to RDF could probably be done with a trival XSLT. Collection Indexing needs to be compared to RDF but is not in this paper.

So, a requirement is to come up with a set of HTML tags which have the same element structure as above but which do not involve pre. Another requirement is that the information still render with column alignment in HTML browsers. An HTML table would satisfy these constraints. Changing only the element names and none of the PCDATA, this is how the above example would look.

<table>
  <tr>
    <td>foo.html</td>
    <td>03-Dec-2000 19:13</td>
    <td>1k</td>
  </tr>
  <tr>
    <td>junk.html</td>
    <td>01-Dec-2000 19:41</td>
    <td>1k</td>
  </tr>
</table>

With these proposed syntax changes, there are sufficient markup tags such that each property of each resource in contained in a separate element. Also, for each resource there is an element which contains all the properties of that resource and the element does not contain properties of other resources.

The costs of these changes include more bytes being transmitted over the network.

When mod_autoindex was first written, there were widely deployed clients which did not handle tables. So, using the pre element for formatting was a good choice especially since there were no XML clients. Presently, HTML tables still have an undesirable effect on some HTML clients: tables prevent progressive rendering. Tables cannot be rendered by some HTML browsers until the entire table has be received by the client. One solution would be to use other elements besides table, tr, and td. For example, div, p, and span could respectively replace the HTML table elements.

One user interface benefit of tables is that non fixed-pitch fonts can be used and the data will still align in columns.

9  XLink 

Collection Index pages describe parent to child links between a Collection and its contained Resources. In this section those links are expressed using XLink [XLink] attributes.

As the XLink spec says, XLink "allows elements to be inserted into XML documents in order to create and describe links between resources." The spec also says, "XLink's namespace provides global attributes for use on elements that are in any arbitrary namespace. The global attributes are type, href, role, arcrole, title, show, actuate, label, from, and to. Document creators use the XLink global attributes to make the elements in their own namespace, or even in a namespace they do not control, recognizable as XLink elements." As described earlier, Collection Index pages should only contain XHTML elements. These XLink attributes are added to the XHTML elements in order to identify the links from parent to child.

Note that as of this writting, XLink is a W3C Proposed Recommendation ("But today I am still just a bill"). In the W3C, the issue of how XHTML's a element will be recognised as an XLink link element is currently unresolved. Specifically, in the 2001-12-20 version of XLink, Section 4.5 "Using XLink with Legacy Markup" explicitly says that the href attribute of the XHTML namespace is not the same as the href attribute defined in the XLink namespace. The syntax proposed in this section sidesteps the issue by not using a elements to express the XLinks. Once the a-as-XLink-element issue is resolved, it may well be more natural to use it as such in Collection Index pages. So, this section may be soon be stale in terms of the specific XLink syntax, but that should not affect the underlying data model.

There is a potential for confusion between XHTML's built in linking mechanism and the XLink machinery that is added to the XHTML Collection Indexing pages during this section. In XLink terminology, XHTML's built in linking mechanism is a "simple link." In this section, an XLink "extended link" is added to the Collection Index pages (more on this later). XHMTL's built in linking mechanism is not sufficient to describe an extended link.

This section applies the XLink global attributes to the syntax developed in the previous section, that is, they are added to an XHTML table and elements contained within the table. As previously discussed, other XHTML elements besides table, tr, and td could be used. In other words, there is no dependancy between the XLink attributes and the specific XHTML elements they are applied to. This is quite different from the case of XHTML's built in linking as expressed via the a element and its href attribute. This independence between the XLink attributes and the elements they are applied to is simply a feature of XLink.

Note that this section assumes that there are already sufficient tags such that each property of each resource in contained in a separate element. It is also assumed that for each resource there is an element which contains all the properties of that resource and the element does not contain properties of other resources. The techniques of the previous section provide the necessary element tags for these assumptions. In this section, only attributes are added to the Collection Index documents. No new elements are added. There is also a dependancy between this section and the section on XHTML because XML well-formedness, as provided by XHTML, is a pre-requisite for XLink.

Here is where there should be example Collection Index pages which include the XLink global attributes. Also, there should be analysis of the costs, benefits, and limitations of adding those attributes to the pages. These are missing as of this release of this document.

9.1  Collection Indexing is the Physical Link Structure 

Collection Indexing is the Physical Link Structure

Separation of the physical links and the logical links. This is analogous to the XML 1.0 spec which makes a distinction between physical structure and logical structures. In this case, the distinction is between physical links and logical links. XML1.0 only addresses the physical structure of a single document. Collection Indexing addresses the physical link structure of a set of documents. An app's XML elements and links are the logical structure (think TopicMaps). The app's data physical structure is represented in the Collection Index links of an URL subtree. Collection Index links are about "where" not "what" or "why".

The set of collection index pages for an url subtree is the "spine" of the data. The spine's root is the app's topmost collection index page. The whole set is the physical map of an app's data. This set's info is equivalent to the manifest of archives.

By adding XLink attributes to Collection Index pages, a vanilla Web server can be used as a dynamic XLink storage system. That is to say, with Collection Indexing, vanilla Apache becomes an XLink application as defined in section 3.3, Application Conformance, of the XLink spec.

10  Miscellaneous Syntax Refinements 

This section briefly sketches several possibilities for further refinement of the Collection Indexing syntax.

10.1  Date and Time Formats 

Different servers format dates and times in Collection Index pages differently. Here is an example of how Apache does it.

18-Oct-2000 22:36

Here is an example of how Microsoft's IIS formats the same date/time.

Wednesday, October 18, 2000 10:36 PM

Interoperability would be desirable. Also, a format which accounts for internationalization would be a plus. For dates and times, XML Schema [XMLSchemaDatatypes] builds on the work of ISO 8601. This is how the same example date/time would be formatted a la XML Schema.

2000-10-18T22:36

If machine readable date/time formats go in attributes and human readable formats go in element content then this isn't a concern. If ISO8601 goes in element content then concerns of human readability need to be taken into account.

10.2  Media Type 

Apache 1.3.19's mod_autoindex pages do not explicitly denote the Content-Type of Resources listed. There is the possibility of an iconic indication representation via the various AddIcon* directives. There is also the possibility of a textual indication via the various AddAlt* directives. These latter directives are how the <img alt="..." values are determined. As seen in the examples in this paper, '[' and ']' are also added to the alt values. These brackets seem superfluous. If they were removed then the value of the alt could be the media type (for example, "text/html").

10.3  Less Dependencies Between the XML Element Ordering and XSLTs 

If table tags are used as argued in section 8.2, then each property value is contained in a td element. Consider the following snippet.

<tr>
<td><a href="foo.html">foo.html</a></td> 
<td>2000-10-18T22:36</td> 
<td>312k</td> 
</tr> 

The problem is that there is no markup which says "this is the date last modified" and "this is the content-length". The same element name, td, is ambiguously used for multiple property values. There could be an implicit convention based on element ordering. For example, "the first td is the Resource name, the second td is..." That seems weak though. Alternatively each property type could have a distinct element name wrapping the values. This would be visually akward when rendered as HTML though.

More explicit markup would removing the dependancy on element ordering. Here is one possible solution. (The attribute name http-equiv is used here only because of its familiarity from HTML's meta.)

<tr>
<td http-equiv="url" ><a href="foo.html">foo.html</a></td> 
<td http-equiv="last-modified" >2000-10-18T22:36</td> 
<td http-equiv="content-length" >312k</td> 
</tr> 

With these additional attributes the element ordering is no longer significant. Different servers could have different column ordering. For example, Apache lists (in order) name, date last modified, and size while Microsoft's IIS lists just date last modified and then name. But if both used the same markup to distinguish which columns correspond with which property values then one XSLT could be written which could handle both of these servers as well as others.

The above example skips over the issue of what namespace the imagined http-equiv attribute come from. It would be useful it there were a standard which defined a set of global attributes which would be "the Dublin Core of common RFC 822 headers." These properties such as "byte size" and "media type" occur in HTTP message headers and MIME message headers. There should be a standard which addresses how to recognise the same information within an XML document. Such a namespace could then be put to use in Collection Index pages and many other situation. Of interest to such an effort would be the HTTP/1.1 spec, section 19.4, "Differences Between HTTP Entities and MIME Entities."

This http-equiv spec would also hopefully pin down what units byte lengths should be expressed in. Some servers format size in units of byte while other use kilobytes. If the Collection Indexing pages were syntaxed such that human readable formats are used in element content and machine readable formats are used in attributes, then it would seem logical that the machine readable format would use bytes and not kilobytes as the unit.

Perhaps the column-to-property-value mapping could be done on a th element so that the attribute assignment would only need be expressed once per column instead of on each td. Therefore, there would be sufficent syntax to identify which Resource property each column enumerates yet no unnecessary repetition of the information.

One last point about XSLTs, perhaps there could be a mod_autoindex directive for specifying which XSLT should be applied to the pages mod_autoindex generates. This would use the syntax defined in the W3C Recommendation Associating Style Sheets with XML documents [AssociatingStyleSheets]. The abstract of that Recommendations says: "This document allows a style sheet to be associated with an XML document by including one or more processing instructions with a target of xml-stylesheet in the document's prolog."

10.4  Property Value Truncation 

Property Value Truncation

As mentioned previously, in Apache 1.3.19 mod_autoindex the column alignment of property values is realized via the fixed-pitch font implied by pre and whitespace padding. Sometimes property values are truncated in the interest of columnar formatting. This is demonstrated in the example document which has been used throughout this paper. Here the relevant snippet is repeated.

<IMG SRC="/img/unknown.gif" ALT="[   ]"> <A HREF="theSecretOfLifeIsTo.hmph">theSecretOfLifeIsTo...></A> 18-Oct-2000 22:26    40k  

The URL of the resource is /theSecretOfLifeIsTo.hmph but serveral of the tail characters are truncated. If the Resource descriptions were embedded within a table then truncation would not be required as column alignment for tables is implemented automatically by HTML browsers.

The same point could be made about sizes. Apache rounds size values to nearest kilobyte or .1 megabyte. IIS does not do any rounding. Rounding can be desirable for humans reading the Collection Index pages. This is another case where having both the human readable and machine readable formats may be a good thing.

Even though truncation would no longer be required for column alignment purposes, it may still be desirable. Perversely long property value strings could still cause wastfully sparse HTML layout. For example, consider a Collection which contains several Resources all having short names except for one which has a very long name. In the rendered HTML page the name column would be wide in order to accommodate the one long name yet all but one of the cells in the column would contain lots of blank space.

11  Implications and Broader Context 

This section considers the implications of the syntactical modifications introduced in previous sections. As each patch was introduced earlier, specific and immediate benefits were illustrated. This section discusses broader benefits of Collection Indexing which are not attributable to any specific syntactical modification of mod_autoindex's documents.

11.1  Implicit Freebie Benefits 

The major search engines (for example, Google, Altavista, and FAST) only care about HTML links and not about XML. Collection Index pages are a freebie discovery mechanism for search engines integration. So, Collection Index pages are a cheap and easy way to make XHTML-encoded application data accessable to the search engines.

Collection Indexing is a safe mechanism for multiple client data PUTs. On a vanilla, file system-based Web server, Collection Indexing is the only dynamic mechanism. When a new Resource is added to a Collection, the Web server automatically updates the Collection Index page. Collecting Index pages are "written to" only by the Web server i.e. only one party so no race conditions. To do something like clients repeatedly updating index.html invites a race condition. This way a dynamic XML based application can be hosted on the simplest of Web servers.

11.2  Collection Indexing without Web Servers 

This paper has only considered Collection Indexing in the context of a Web server. Collection Indexing can be applied to other situations as well. Coming from an HTTP protocol perspective it could be argued that Collection Indexing in superfluous. All this Collection Indexing information could have been gleaned from Apache 1.3.19 without any patches. For example, for a given slash-terminated URL a client could:

  1. Perform a GET request on the URL.
  2. Rip up the response document looking for HTTP URLs in <a href= ... </a>.
  3. Perform a HEAD request on each URL discovered.
This would result in the client having the same information as provided by a Collection Index page. Granted it would cost more HTTP round trips to get the information but no changes to deployed software is required.

The above arguement is a protocol centric one. Collection Indexing is primarily concerned with XML content, not the HTTP protocol. That is, the core focus is sets of interlinked XML documents, not sets of HTTP Resources. The following examples illustrate the use of Collection Indexing in non-HTTP server contexts.

An example of non-HTTP Collection Indexing is the case of an HTML browser rendering a slash-terminated file:// URL. In this situation, the browser uses the client OS file system APIs to find out what files are in a directory. It then auto generates an HTML page which represents the directory contents. Opera renders the directory listing as a table, Netscape renders as a big pre much like mod_autoindex, and MSIE5 has it's Web Views. If the browsers were to generate Collection Index pages for file:// URLs then an XML application could work if loaded from an http:// URL or a file:// URL.

Another example of how Collection Indexing could come into play in a non-HTTP situation is the case of rfc2387 - The MIME Multipart/Related Content-type. Every level of the multipart message has a start entity. These start entities could be Collection Index pages. In this way a Web site could be archived to a multipart/related file.

A final example of Collection Indexing is caching for off line use. A browser-based JavaScript application could use the Collection Indexing page set to enumerate its data store and load it into the browsers cache so that all the data will be available later if the original storage system is not available to the JavaScript application.

As discussed previously, one benefit of Collection Indexing is that Apache's FancyIndexing can be implemented client-side. The next section demonstrates software which does just that and more.

12  Wrench 

This section introduces wrench. The name "wrench" is an acronym for "Web Resource Explorer for Navigation Collection Hierarchies". Wrench is a JavaScript1.1 client application which runs within the context of a JavaScript-enabled HTML browser.

Wrench can be thought of as a Web-only equivalent to Microsoft's Explorer. Explorer essentially is a hard drive file system navigator. Explorer can also be used to navigate networked file system including a WebDAV server's URL space. In contrast, wrench only reads Collection Index pages served up over HTTP via GETs.

Like Explorer, wrench's user interface consists mainly of a tree on the left and a table on the right. The tree provides visual context by displaying a hierarchy of Collections. When the user selects a specific node in the tree, the table on the right is loaded with the properties of the Resources in the corresponding Collection. The column headers in the table can be clicked on to cause the Resources in the table to be sorted by that column. Resources can be sorted by name, size, date of last modification, and MIME type. In this way wrench implements the same behavior as Apache's FancyIndexing. The relative advantage of wrench over FancyIndexing is that the sorting is performed on the client without returning to the server to load the same information sorted in a different order.

In the interest of full disclosure it should be mentioned that although wrench is implemented in JavaScript1.1, it needs a non-JavaScript mechanism which actually performs HTTP GET requests. Wrench starts from within an HTML page. The page is loaded into a browser and runs just like regular JavaScript embedded within an HTML page. The distinction between wrench and most other JavaScript is that wrench can read multiple Collection Index pages without itself being reloaded. Usually, when a page is read by a browser, it replaces the previous page rendered in the browser. With wrench, though, the Collection Index pages are not directly rendered by the browser. That is, the browser parses the HTML document which contains wrench's JavaScript code but after that wrench's itself parses the XML of the Collection Index pages.

The point is that early implementations of JavaScript in Web browsers have no built in way of providing the content of URLs to running bits of JavaScript code. So, wrench needs a support mechanism in order to process Collection Index pages. The support mechanism could be a Java Applet or MSIE's XMLHTTP object. The support mechanism sends a HTTP GET request message to a Web server and provides the contents of the response message to the JavaScript.

Wrench has nice user interface properties. But it is more than NUT with a context tree. Wrench builds a model of the URL subtree. NUT just knows about a single collection at a time while Wrench remembers the tree. So, Collection Indexing pages are the basis for browser-based applications which can model URL subtrees not just individual nodes in the tree. This in concert with a traditional Web search engine can be very powerful.

13  Call for Standardization 

Collection Indexing is widely implemented. For example, IIS has "directory browsing" and Apache has mod_automindex "FancyIndexing". Historically, Collection Index pages have primarily been used to assist humans in navigating Web server URL trees. Additionally, they have been used by Web crawling robots such as search engines. There is no standard which defines the structure of and information contained in Collection Index pages. This has lead to divergent implementations. Yet that has not caused much of a problem to date. This is because the pages were marked up well enough for humans to understand. Also, crawlers just ripped up the HTML looking for <a href=".... But, XML clients cannot use the current index pages.

The following statement seems uncontroversial:

The vast majority of currently deployed Web servers are capable of generating HTML pages in response to GET requests on URLs ending in '/'. These generated pages contain links to resources avaiable at URLs which are prefixed with the URL to which the GET request message was addressed. The pages may also enumerate properties of those resources such as byte length, MIME type, and last modified date, etc.

This paper has only examined Apache's implementation of directory listing but lots of different Web servers are all doing the same thing yet each vendor's implementation produces pages which express the information in sightly different ways. A situation like this is a prime candidate for standardization. Further, the simplicity of the proposal and the trivial nature of the code modification make it all the easier to adopt.

Dispite the seemingly uncontroversial nature of Collection Indexing, things get complicated when labels (such as "children" or "collection") are applied to this commonly implemented behavior. And the controversy becomes even more heated when comparisions are made to existing relevant standards. See, [CollectionIndexing] for a reference to a paper which discusses the possibilities for Collection Indexing standardization in the context of such specs as HTTP, WebDAV, and Relative URLs.

Hopefully, this paper has demonstrated how simple it is to add XML technologies to the directory listings of the current crop of Web servers. Even without a relevant standard, Collection Indexing is a simple yet powerful idea which makes Web servers more useful to XML applications. Apache could be made XML client friendly immediately using some or all of the techniques presented in this paper. Even better would be an interoperability increasing standards defining Collection Indexing such that any XML client could work with any Web server's Collection Index pages as implemented by Apache or other miscellaneous offerings.

14  Further Information 

For more information on collection indexing, pleas visit [CollectionIndexing]

Email about Collection Indexing can be sent to john.tigue@tigue.com.

Code patches to mod_autoindex and related XSL Transforms are available at [CollectionIndexing]

15  Bibliography 

[CollectionIndexing]
A collection of Tigue's documents on Collection Indexing including the code samples from this presentation.
http://www.tigue.com/collection-indexing/
[XLink]
XML Linking Language (XLink) Version 1.0
http://www.w3.org/TR/xlink/
[RFC1808]
RFC 1808 Relative Uniform Resource Locators
http://www.ietf.org/rfc/rfc1808.txt
Note that RFC 1808 was updated by [RFC2396]
[RFC2396]
RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax
http://www.ietf.org/rfc/rfc2396.txt
Note that RFC 2396 was updated by [RFC2732]
[RFC2732]
RFC 2732 Format for Literal IPv6 Addresses in URL's
http://www.ietf.org/rfc/rfc2732.txt
[RFC2616]
Hypertext Transfer Protocol -- HTTP/1.1
http://www.ietf.org/rfc/rfc2616.txt
[RFC2518]
HTTP Extensions for Distributed Authoring -- WEBDAV
http://www.ietf.org/rfc/rfc2518.txt
[XMLSchemaDatatypes]
W3C Recommendation XML Schema Part 2: Datatypes
http://www.w3.org/TR/xmlschema-2/
[DateAndTimeFormats]
W3C Note: Date and Time Formats
http://www.w3.org/TR/NOTE-datetime
This document is a profile of ISO 8601 : 1988 (E), "Data elements and interchange formats - Information interchange - Representation of dates and times".
[XHTML]
W3C Recommendation XHTML 1.0: The Extensible HyperText Markup Language
http://www.w3.org/TR/xhtml1/
[XML]
W3C Recommendation Extensible Markup Language (XML) 1.0 (Second Edition)
http://www.w3.org/TR/2000/REC-xml-20001006
[AssociatingStyleSheets]
W3C Recommendation Associating Style Sheets with XML documents Version 1.0
http://www.w3.org/TR/xml-stylesheet/

Copyright ® 1999-2001 John Tigue Inc. All rights reserved.