The Hyperties Database Format William J. Weiland Catherine Plaisant-Schwenn Ben Shneiderman Human-Computer Interaction Laboratory Department of Computer Science University of Maryland College Park, MD 20742 March 1, 1988 Introduction Hyperties has been in use for some time as a pure hypertext system running under MS-DOS. The original Hyperties is highly specialized to this environment and to a particular set of capabilities and interface features. As a result, the system architecture is "closed" and difficult to extend. There have been ongoing efforts to explore a variety of extended capabilities: large, bitmapped displays and multiple windows (on the SUN workstation); alternate input devices (touchscreen); graphics and other media; global textual search; and numerous possible interface variations. From these efforts have sprung a number of divergent Hyperties systems, each customized for a particular investigative task, but based on the original Hyperties architecture. A more modular and extensible Hyperties architecture is proposed which, it is hoped, will facilitate future research by providing an open framework for experimentation and revision, will provide for maximal reuse of software, and will ultimately allow current work to be integrated into one system. This paper describes the proposed database structure; others will examine the organization and implementation of the system software, the adaptation of the core of the system to different interface systems (e.g., NeWS, X-Windows) and use in a networked environment, and the document-description (formatting) language. General Considerations The Hyperties database system is implemented in the standard file systems available in most operating systems (though UNIX and MS-DOS are of primary concern). In general, a document (i.e. a node in the hypertext network) is represented by a collection of files, one of which serves as the primary description of the document's format and content. Other, associated files contain graphical images (or other data, such as digitized sound, that will not normally be represented as human-readable text) and information used to allow rapid presentation of the document at browsing time. One special file, the master index, exists to provide the association between the document files and the database as a whole. The new Hyperties database format permits great freedom in the organization of a database's component files. An author can impose a hierarchical structure (or any other that is suitably mnemonic), using the features of the standard filesystem found in UNIX and MS-DOS, to simplify access to components of the database when a fully-integrated authoring environment is not available, and to "flatten" the structure of a database to optimize use of storage when authoring is fully supported. Also, the new format will allow spreading the component files of a database across multiple devices, allowing larger databases to be used on floppy-based systems. Several simplifying assumptions have been made, which impose limits on the browser interface. These, however, are not inherent in the database structure, but have been introduced out of considerations of efficiency in implementation. In future versions, some of these restrictions may be relaxed. First, it is assumed that the articles are to be displayed in windows whose size is fixed ahead of browsing time -- the current notion is that construction of a database (and its index), and formatting of the component articles is to be done ahead of time, in a separate "compilation" phase. Furthermore, articles are to be viewed by paging rather than scrolling, so that formatting produces a collection of page images. On more powerful systems, the formatting could be performed at browsing time for resizable and/or scrolling windows. Finally, a Hyperties database is assumed to be static, i.e. the contents of a database will not change during browsing. With a more powerful database management system to handle file requests and arbitrate access conflicts, dynamic databases should be possible as well. Database Components In general, the files of a Hyperties database fall into two categories, portable (system-independent) files, and non-portable (system-specific) files. Of the portable files, the most basic file type is the storyboard, which contains a pure text description of an individual article in the system. Pictures and targets may also be portable, provided that the host browser supports the file format, and can compensate for differences in screen aspect ratio, etc. Non-portable files are the master index, which gives the correspondence between identifiers (see below) and system-specific file names, and the display files, which describe the images of displayed pages in their formatted form. The latter are non-portable since they are binary files, and make use of system-specific coordinate systems and window sizes. Since a Hyperties database will consist of several types of component files, there will be a set of standard 3-letter extensions used to identify file types to the system. These extensions will identify the nature of the contents of a file, its system-specificity, and a general version number, to permit some evolution of the internal file formats without "orphaning" existing databases. For example, the initial portable storyboard files will have the extension ".sb0", and a later version of the (non-portable) PS/2 display file may be ".ps3". A further convention adopted is that all database components are referred to by an identifier, which is a user-supplied name of arbitrary length, and that components contain their own names, so that they may be identified and collected automatically, without requiring user intervention. These identifiers may contain embedded whitespace, but not leading or trailing whitespace (it will be removed automatically); sequences of embedded whitespace will treated as single, "generic" whitespace characters for purposes of comparison. In addition to the implicit type specified by the extension, the display files contain the images of (typed) internal system objects, whose methods permit these objects to be loaded according to their explicitly-stated types. These images may contain pointers to other object images (on disk), which are simply an encoded filename and offset. These pointers, and the handling of these objects is described in greater detail in the object-system document. Format of Components As mentioned above, the storyboard files are pure text descriptions of the individual documents (nodes in the hypertext network) of a Hyperties database. These files are broken into five sections, containing the title (the identifier of the document), synonyms (alternate identifiers by which the document may be referenced), description (a brief textual, graphical, etc. piece, usually used to summarize the document), content (a lengthy multi-media piece -- the document proper), and notes (an optional collection of textual remarks maintained by the author). The content (and possibly the description) may contain references to other documents, which are simply the titles or synonyms of those documents, embedded in the text (or connected to a graphic), and marked by the author. Since the storyboard is a pure text document, it uses a special formatting language (see document) to indicate page layout, font types, and to specify the inclusion of graphics. The master index provides the mapping of identifiers to actual file names in a file system. It thus defines the collection of files that constitute a database. The master index is constructed automatically by the database compiler, but is a human-readable text file, so that an author may make use of it to locate database files in the absence of an authoring environment, which would automatically provide reference by identifier. The format of this file is as follows: 1) a header line, indicating that document records follow, followed by a blank line 2) a collection of document records, each consisting of: - a title, surrounded by quotes, with "" for embedded quotes - a pathname, relative to the database directory (or common ancestor of all database directories), no extension - (on following lines) synonyms, formatted like the title, without pathname 3) a header line, indicating that picture records follow, preceded and followed by blank lines 4) a collection of picture records, each consisting of: - identifier (like title) - pathname with extension - byte offset (since one file may contain several pictures) 5) a header line, indicating that target records follow, preceded and followed by blank lines 6) a collection of target records, identical to picture records The display files are created when individual storyboards are formatted for a specific system and configuration. These files, essentially, contain nothing but formatting information -- the actual text of a document comes from the associated storyboard. In this way, formatting on-the-fly is eliminated, allowing a document to be displayed very rapidly, but without having to keep two copies of every document (the formatted and unformatted versions). As mentioned above, the display files are binary images of the internal objects used to represent components to be displayed. Each record of this file contains a type ("class") identifier that is used by the object system to invoke the appropriate functions to read (or skip) the record. In this way, the format of the display file need not remain fixed as the capabilities of the browser are extended; such extensions are incorporated by creating new classes of objects, while the old classes continue to be supported. These display files are organized as follows: Display file: - DocumentHeader type code, header info (title, no. of pages, etc) - Description type code, description data object(s) - Page type code, page data object(s) Description and page data objects: - Text type code, font info, # of strings, string specifier(s) or - TextTarget type code, font info, # of strings, string specifier(s) or - Picture type code, file name specifier or - PictureTarget type code, target name specifier, reference name specifier String and name specifiers: (actual string data contained in storyboard) - x, y, length (in characters), offset in storyboard file A Hyperties database will also contain a number of auxiliary files, which allow images (and associated targets), generated by various graphics editors, to be included. These exist simply to provide an identifier, and description of the format, for such images, without having to modify the image files themselves (so they remain editable). These auxiliary files, which are human-readable, consist of an arbitrary number of records, each of the following form: - type (identifies picture or target, image file format, etc) - identifier - name of picture or target data file Database Compilation The process of compiling a Hyperties database consists of several phases. First, the given directory or directories which contain the database are scanned for files with the appropriate extensions, and the three parts of the index (storyboards, pictures, and targets) are created by examining these files for their identifiers. This phase also involves some checks for name conflicts. Then, the individual storyboards are processed by the formatter to create the display files and extract references to other documents in the database. At this point, all references are checked against the index -- if any identifiers are missing from the index, the user is requested to supply their location if possible. The formatting process generates the display files, using information contained in the current environment description. Last, any uncorrectable errors in the database are signalled to the user, and various cross-reference tables may be produced, if requested. Ultimately, the compiler will support partial compilation, for cases where only a small number of documents changed. This is a necessity for large databases, where the cost of recompiling an entire database could be substantial. The database structure and compilation process described above allow for a variety of authoring aids, short of a full authoring environment, to be constructed simply. One possibility is a simple file selection utility, which would allow the author to select a document or picture by its identifier (a long, mnemonic name), and, using information contained in the master index, automatically invoke a word or graphics processor on the appropriate file(s), and recompile the database as changes are made. This could very likely be written using shell scripts or batch files in a UNIX or MS-DOS environment. Another tool of great utility would be an improved facility for selecting targets in graphical images.