896 lines
42 KiB
Plaintext
896 lines
42 KiB
Plaintext
|
|
+ Page 1 +
|
|
|
|
-----------------------------------------------------------------
|
|
The Public-Access Computer Systems Review
|
|
|
|
Volume 5, Number 3 (1994) ISSN 1048-6542
|
|
-----------------------------------------------------------------
|
|
|
|
To retrieve an article file as an e-mail message, send the GET
|
|
command given after the article information to
|
|
listserv@uhupvm1.uh.edu. (Files are also available from the
|
|
University of Houston Libraries' Gopher server: info.lib.uh.edu,
|
|
port 70.)
|
|
|
|
CONTENTS
|
|
|
|
COMMUNICATIONS
|
|
|
|
Using the World-Wide Web to Deliver Complex Electronic Documents:
|
|
Implications for Libraries
|
|
|
|
By John Price-Wilkin (pp. 5-21)
|
|
|
|
To retrieve this file: GET PRICEWIL PRV5N3 F=MAIL
|
|
|
|
The World-Wide Web (also called the Web) is a very promising tool
|
|
for libraries to use to explore the delivery of rich and complex
|
|
documents. Nevertheless, there are many limitations in the Web's
|
|
HTML markup language and the ability of Web servers to deliver
|
|
structured information. This paper explores the benefits and
|
|
limitations of the Web in the context of several projects taking
|
|
place at the University of Virginia, both in the Library and in
|
|
the University's Institute for Advanced Technology in the
|
|
Humanities. A gateway between the Web and the SGML-based PAT
|
|
system that helps to overcome the Web's inherent limitations is
|
|
also described.
|
|
|
|
+ Page 2 +
|
|
|
|
-----------------------------------------------------------------
|
|
The Public-Access Computer Systems Review
|
|
-----------------------------------------------------------------
|
|
|
|
Editor-in-Chief
|
|
|
|
Charles W. Bailey, Jr.
|
|
University Libraries
|
|
University of Houston
|
|
Houston, TX 77204-2091
|
|
(713) 743-9804
|
|
Internet: lib3@uhupvm1.uh.edu
|
|
|
|
Associate Editors
|
|
|
|
Columns: Leslie Pearse, OCLC
|
|
Communications: Dana Rooks, University of Houston
|
|
|
|
Editorial Board
|
|
|
|
Ralph Alberico, University of Texas, Austin
|
|
George H. Brett II, Clearinghouse for Networked Information
|
|
Discovery and Retrieval
|
|
Priscilla Caplan, University of Chicago
|
|
Steve Cisler, Apple Computer, Inc.
|
|
Walt Crawford, Research Libraries Group
|
|
Lorcan Dempsey, University of Bath
|
|
Pat Ensor, University of Houston
|
|
Nancy Evans, Pennsylvania State University, Ogontz
|
|
Charles Hildreth, READ, Ltd.
|
|
Ronald Larsen, University of Maryland
|
|
Clifford Lynch, Division of Library Automation,
|
|
University of California
|
|
David R. McDonald, Tufts University
|
|
R. Bruce Miller, University of California, San Diego
|
|
Paul Evan Peters, Coalition for Networked Information
|
|
Mike Ridley, University of Waterloo
|
|
Peggy Seiden, Skidmore College
|
|
Peter Stone, University of Sussex
|
|
John E. Ulmschneider, North Carolina State University
|
|
|
|
+ Page 3 +
|
|
|
|
Technical Support
|
|
|
|
Tahereh Jafari, University of Houston
|
|
|
|
Publication Information
|
|
|
|
Published on an irregular basis by the University Libraries,
|
|
University of Houston. Technical support is provided by the
|
|
Information Technology Division, University of Houston.
|
|
Circulation: 8,202 subscribers in 65 countries (PACS-L) and 2,562
|
|
subscribers in 52 countries (PACS-P).
|
|
|
|
Back issues are available from listserv@uhupvm1.uh.edu. To
|
|
retrieve a cumulative index to the journal, send the following e-
|
|
mail message to the list server: GET INDEX PR F=MAIL.
|
|
|
|
Back issues are also available from the University of Houston
|
|
Libraries' Gopher server. Point your Gopher client at
|
|
info.lib.uh.edu, port 70, and follow this menu path:
|
|
|
|
Looking for Articles
|
|
Electronic Journals
|
|
University of Houston Libraries E-Journals
|
|
The Public-Access Computer Systems Review
|
|
|
|
The journal's URL is gopher://info.lib.uh.edu:70/11/articles/e-
|
|
journals/uhlibrary/pacsreview.
|
|
|
|
The first three volumes of The Public-Access Computer Systems
|
|
Review are also available in book form from the American Library
|
|
Association's Library and Information Technology Association
|
|
(LITA). The price of each volume is $17 for LITA members and $20
|
|
for non-LITA members. All three volumes can be ordered as a set
|
|
for $45 (indicate that you want the PACS Review set, order number
|
|
7712-X). To order, contact: ALA Publishing Services, Order
|
|
Department, 50 East Huron Street, Chicago, IL 60611-2729, (800)
|
|
545-2433.
|
|
|
|
+ Page 4 +
|
|
|
|
-----------------------------------------------------------------
|
|
The Public-Access Computer Systems Review is an electronic
|
|
journal that is distributed on the Internet and on other computer
|
|
networks. There is no subscription fee.
|
|
To subscribe, send an e-mail message to
|
|
listserv@uhupvm1.uh.edu that says: SUBSCRIBE PACS-P First Name
|
|
Last Name.
|
|
The Public-Access Computer Systems Review is Copyright (C)
|
|
1994 by the University Libraries, University of Houston. All
|
|
Rights Reserved.
|
|
Copying is permitted for noncommercial use by academic
|
|
computer centers, computer conferences, individual scholars, and
|
|
libraries. Libraries are authorized to add the journal to their
|
|
collection, in electronic or printed form, at no charge. This
|
|
message must appear on all copied material. All commercial use
|
|
requires permission.
|
|
-----------------------------------------------------------------
|
|
|
|
+ Page 5 +
|
|
|
|
-----------------------------------------------------------------
|
|
Price-Wilkin, John. "Using the World-Wide Web to Deliver Complex
|
|
Electronic Documents: Implications for Libraries." The Public-
|
|
Access Computer Systems Review 5, no. 3 (1994): 5-21. To
|
|
retrieve this file, send the following e-mail message to
|
|
listserv@uhupvm1.uh.edu: GET PRICEWIL PRV5N3 F=MAIL. (The file
|
|
is also available from the University of Houston Libraries'
|
|
Gopher server: info.lib.uh.edu, port 70.)
|
|
-----------------------------------------------------------------
|
|
|
|
1.0 Introduction
|
|
|
|
The World-Wide Web (also called the Web) is a very promising tool
|
|
for libraries to use to explore the delivery of rich and complex
|
|
documents. [1] Nevertheless, there are many limitations in the
|
|
Web's HTML markup language and the ability of Web servers to
|
|
deliver structured information. This paper explores the benefits
|
|
and limitations of the Web in the context of several projects
|
|
taking place at the University of Virginia, both in the Library
|
|
and in the University's Institute for Advanced Technology in the
|
|
Humanities. A gateway between the Web and the SGML-based PAT
|
|
system that helps to overcome the Web's inherent limitations is
|
|
also described.
|
|
|
|
2.0 SGML and TEI
|
|
|
|
The most worthwhile products that libraries can buy are ones that
|
|
conform to standards and are not tied to a specific software
|
|
package or operating system. These are the only products with
|
|
enduring value. Certainly, there are exciting electronic
|
|
resources being produced for specific software packages and
|
|
operating systems, but the extent to which libraries can build
|
|
collections of hypertext resources that are usable in the future
|
|
will depend entirely on the conformance of their resources to
|
|
true national and international standards. The most important
|
|
standard for this discussion is SGML, a standard designed to
|
|
express the organization of documents and to accommodate even the
|
|
most complex multimedia materials.
|
|
|
|
+ Page 6 +
|
|
|
|
A brief (and admittedly superficial) discussion of SGML and
|
|
the Text Encoding Initiative may be helpful. SGML (Standard
|
|
Generalized Markup Language, ISO 8879) is a standard approved by
|
|
the ISO for the descriptive markup of documents. The language of
|
|
SGML is sufficiently flexible that the sense of "document" has
|
|
been expanded to include coordinated time-based elements of
|
|
hypermedia (e.g., animated dance, music, and character-based
|
|
score and choreography moving in synchrony at a pace controllable
|
|
by the user). SGML is not a tag set: there are no pre-set tags.
|
|
Instead, SGML is a set of rules (or a grammar) for articulating
|
|
that vocabulary. These rules are sufficiently rigorous so that
|
|
specialized software can check the validity or conformance of a
|
|
document. The specification of that grammar is a DTD (Document
|
|
Type Definition); the DTD can also function to document many
|
|
decisions about the organization of a text. Without that
|
|
validity--i.e., without being parsed against a DTD--the document
|
|
is not SGML encoded, although it may share many of the
|
|
characteristics of SGML.
|
|
For our work at Virginia, the most notable of these
|
|
characteristics has been the descriptive nature of the tagging.
|
|
Rather than saying that an element of the text appears in bold,
|
|
17 point Helvetica, centered at the top of a new page, we use the
|
|
tags to define the function of a textual element (e.g., a title).
|
|
The tag set used must necessarily elaborate the elements of the
|
|
texts we see in an academic environment: a tag set designed for
|
|
articles or documentation, for example, will omit important
|
|
elements needed for encoding poetry. To serve those needs, the
|
|
Text Encoding Initiative (or TEI) has published a set of
|
|
guidelines for the application of SGML to texts in the
|
|
humanities. Functions of the text or hypertext, expressed
|
|
descriptively and with a standard language, are freed from the
|
|
constraints of a specific software package or application. SGML-
|
|
encoded works can serve a variety of functions, depending on the
|
|
user's needs and available software.
|
|
|
|
+ Page 7 +
|
|
|
|
3.0 The Potential of the Web
|
|
|
|
The Web uses a client/server architecture. Sophisticated Web
|
|
clients, such as Mosaic, offer an exciting sense of the
|
|
possibilities of electronic publishing on the network. Several
|
|
revolutionary concepts that have been awaited with anticipation
|
|
are incipient in all aspects of that relationship between client,
|
|
server, and publication. These characteristics are:
|
|
|
|
o Open systems--the ability to make resources available
|
|
to a variety of operating systems and a variety of
|
|
applications is evident throughout the Web. Computers
|
|
running X Windows, Microsoft Windows, and the Macintosh
|
|
System 7 all participate equally. In addition to
|
|
Mosaic, other clients, such as Cello and OmniWeb, are
|
|
available. Multimedia tools, such as image viewers,
|
|
are a matter of personal choice.
|
|
|
|
o Standards--given the Web's use of HTML, the importance
|
|
of standards is heightened, and HTML is inexorably
|
|
moving toward greater expressiveness and greater
|
|
conformance to the SGML standard.
|
|
|
|
o Distributed information--the notion of a universe of
|
|
distributed information, scattered throughout the
|
|
Internet while being conceptually linked to other
|
|
information, is becoming a reality through the use
|
|
of the Web.
|
|
|
|
4.0 Representative Web Projects
|
|
|
|
Over the past two years at the University of Virginia, faculty
|
|
and staff involved in several projects began to develop a variety
|
|
of electronic materials using the SGML standard. Partly this was
|
|
to serve already apparent needs, but it was also to take
|
|
advantage of the potentials of electronic publishing. While the
|
|
Library's Electronic Text Center and, later, its Digital Image
|
|
Center began to develop skills in creating electronic materials
|
|
in standard formats for networked access, scholars at the
|
|
Institute for Advanced Technology in the Humanities undertook the
|
|
daunting task of composing advanced, standards-based electronic
|
|
research materials without having the tools with which to publish
|
|
these materials. With the introduction of Mosaic, the Web was
|
|
quickly seen as a way to deliver these materials, and, with
|
|
relative ease, large bodies of SGML-encoded material were
|
|
converted to HTML for Web access. In order to focus on
|
|
particular aspects of those projects, the following example
|
|
projects are divided into sections on editions, history, image
|
|
archives, and instruction.
|
|
|
|
+ Page 8 +
|
|
|
|
4.1 Editions
|
|
|
|
In general, the Web offers creators of editions of literary or
|
|
other works the ability to represent a vast, interconnected web
|
|
of scholarly resources in a variety of different ways. The user
|
|
might view the resources simply, as in an edition of a work
|
|
without the introduction of a critical apparatus. A more complex
|
|
approach is also possible, with the user following the critical
|
|
apparatus at every turn. And finally, a rich and scholarly
|
|
approach is possible, allowing the user to view manuscript (or
|
|
printing) evidence or to examine the editor's assessment of the
|
|
evidence by comparing high-quality scans of original pages to the
|
|
marked-up transcriptions. With proper markup, an edition can be
|
|
viewed in as many ways as the reader desires. It can be a
|
|
variorum, a study edition, a critical edition, or historical
|
|
evidence. The form the edition takes is defined by the user's
|
|
needs or preferences.
|
|
|
|
4.1.1 British Poetry
|
|
|
|
The British Poetry Archive documents are perhaps the simplest of
|
|
those discussed here. (The project's URL is http://
|
|
www.lib.virginia.edu/etext/britpo/britpo.html.)
|
|
The two texts now available were transcribed by students in
|
|
Jerome McGann's graduate courses. In addition to the SGML-
|
|
encoded text itself, each work includes material such as
|
|
introductions, notes, and glosses as well as high-quality digital
|
|
facsimiles of pages from the original editions. The materials
|
|
are freely available on the Internet, and Mr. McGann hopes that
|
|
others will contribute to the archive. These texts represent the
|
|
simplest of the hypertext editions available on the University of
|
|
Virginia's Web, with supporting materials providing potential
|
|
deviations from an otherwise linear progression. The texts were
|
|
encoded in TEI-conformant SGML with the assistance of the
|
|
Library's Electronic Text Center, and they were then converted to
|
|
HTML for the purpose of making them available on the Web.
|
|
|
|
+ Page 9 +
|
|
|
|
4.1.2 Dante Gabriel Rossetti
|
|
|
|
To date, the most fully developed project is Jerome McGann's
|
|
ongoing edition--or archive--of the works of Dante Gabriel
|
|
Rossetti. (The project's URL is http://
|
|
jefferson.village.virginia.edu/rossetti/rossetti.html.)
|
|
According to McGann, the Rossetti archive is:
|
|
|
|
a hypermedia environment for studying the works of the
|
|
Pre-Raphaelite poet and painter D. G. Rossetti (1828-1882).
|
|
The archive is a structured database holding digitized
|
|
images of Rossetti's works in their original documentary
|
|
forms. Rossetti's poetical manuscripts, early printed texts
|
|
--including proofs and first editions--as well as his
|
|
drawings and paintings are stored in the archive, in full
|
|
color as needed. The materials are marked up for electronic
|
|
search and analysis, and they are supplied with full
|
|
scholarly annotations and notes. [2]
|
|
|
|
The organization of the archive is designed to capitalize on the
|
|
uniquely intertwined nature of Rossetti's artistic process,
|
|
linking image to text and text to image. When Rossetti
|
|
accompanied a painting by sonnets, the poems are included in the
|
|
archive along with an image of the painting. When Rossetti
|
|
illustrated a poem with a painting, an image of the painting is
|
|
included. Since Rossetti frequently designed his own editions,
|
|
electronic versions of his print works, with linked text and
|
|
images, are also available. McGann describes the difficulty of
|
|
studying Rossetti's works in a traditional print environment, and
|
|
then sets about trying to overcome those difficulties by melding
|
|
the resources in a way that allows the reader to follow the
|
|
threads of art, poetry, or translations without losing access to
|
|
the other materials.
|
|
|
|
4.1.3 Piers Plowman
|
|
|
|
The third project was begun in the 1994-95 academic year by one
|
|
of the most recent Institute fellows, Hoyt Duggan. (The
|
|
project's URL is http://jefferson.village.virginia.edu/piers
|
|
/archive.goals.html.)
|
|
|
|
+ Page 10 +
|
|
|
|
Mr. Duggan, an accomplished editor of Middle English texts,
|
|
created an edition of the Piers Plowman B text using the Web.
|
|
More in the model of the traditional scholarly edition, Mr.
|
|
Duggan's project brings together transcription and facsimile to
|
|
resolve vexing editorial problems. When the scribe uses an
|
|
abbreviation to represent a letter combination (e.g., a barred
|
|
"p" for "pre"), the reader typically wants the editor's best
|
|
judgement in rendering what was intended (i.e., "pre"). Many of
|
|
those decisions deal with unambiguous evidence, and some with
|
|
less certain evidence. Through SGML, both the suspension or
|
|
abbreviation is registered as well as the reading of the
|
|
character.
|
|
To the greatest extent possible, digital facsimiles of all
|
|
seventeen surviving manuscripts will be included. With facsimile
|
|
evidence, it is always possible to return to something resembling
|
|
the original document to evaluate the editor's decision. Duggan
|
|
has also found that it is possible to create extremely
|
|
high-resolution images that, with enlargement and other digital
|
|
treatments, can reveal important new information about the
|
|
original composition.
|
|
|
|
4.2 History
|
|
|
|
With new technological tools, historians are offered both
|
|
challenges and opportunities. Electronic resources allow them to
|
|
blend evidence and interpretation in ways that help both student
|
|
and researcher. A simple approach in using the materials is
|
|
possible, where the reader follows the argument without examining
|
|
evidence. It is also possible for the reader to examine the
|
|
methodology of the researcher, either to scrutinize the research
|
|
or to be instructed in the methodology of research. The process
|
|
of bringing evidence and interpretation together brings
|
|
challenges of immense proportions. For example, the role
|
|
geography plays in defining an event can be brought to bear on
|
|
the problem, but it may involve the use of sophisticated systems
|
|
of geographic analysis. Two projects at the Institute have used
|
|
many diverse resources to explore their topics, incorporating
|
|
nineteenth Census data, geographic models, and animated
|
|
sequences.
|
|
|
|
4.2.1 Ayers (Valley of the Shadow)
|
|
|
|
Edward Ayers, a historian of the Civil War and the
|
|
Reconstruction, was one of the Institute's first two fellows.
|
|
(The project's URL is http://jefferson.village.virginia.edu/
|
|
vshadow/vshadow.html.) According to Ayers, the project:
|
|
|
|
+ Page 11 +
|
|
|
|
interweaves the histories of places on both sides of the
|
|
Mason-Dixon line. It is the story of two communities
|
|
relatively close to one another, sharing considerable prewar
|
|
characteristics and similar experiences in the war itself.
|
|
There was one area in the United States for which that was
|
|
most clearly the case: the Great Valley that stretched from
|
|
Pennsylvania, through Maryland and Virginia, into Tennessee.
|
|
[3]
|
|
|
|
Ayers focuses on two towns--Staunton, Virginia and Chambersburg,
|
|
Pennsylvania--as representative communities from that Valley that
|
|
served as such an important economic, cultural, and military
|
|
locus of the War. The Web serves the historical ends by
|
|
balancing narrative--a filtering or interpretation of evidence--
|
|
with the presentation of that evidence. Ayers has described one
|
|
dilemma of the historian as a tight-rope act between providing
|
|
access to evidence and creating an organizing argument that does
|
|
not also obscure that evidence. His approach, providing the
|
|
deepening layers of evidence as "rhizomes" beneath the surface of
|
|
narrative, has been well-supported by the Web.
|
|
|
|
4.2.2 Dobbins (The Forum at Pompeii)
|
|
|
|
Dobbins, a classical archaeologist, reconstructs Pompeii from
|
|
archaeological evidence in a virtual space to advance his
|
|
argument. (The project's URL is http://
|
|
jefferson.village.virginia.edu/pompeii/page-1.html.)
|
|
He uses computer-aided design (CAD) tools to bring precision
|
|
to his reconstruction. Animation is being added to the CAD
|
|
representations to provide a three-dimensional perspective of
|
|
buildings and space. Structures that are normally seen in
|
|
isolation from each other are assembled in a total vision of
|
|
Pompeii that may suggest a degree of planning and coordination.
|
|
|
|
4.3 Image Archives
|
|
|
|
The Digital Image Center's image collections can be seen as
|
|
passive collections of standards-based images. (The project's
|
|
URL is http://www.lib.virginia.edu/dic/class/arh102.)
|
|
The image collections are organized to reflect the focus of
|
|
an individual class or an art exhibit. All of the images are
|
|
TIFF files subjected to JPEG compression. As such, they can be
|
|
examined with a variety of image tools, ranging from simple
|
|
viewers to software with analytical capabilities. Most
|
|
importantly, the tool used is largely the choice of the user. As
|
|
a result of planning and philosophy, all images are durable
|
|
enough to stand close scrutiny: they were scanned in 24-bit color
|
|
at a sufficiently high resolution to be enlarged several times
|
|
without significant degradation.
|
|
|
|
+ Page 12 +
|
|
|
|
The most developed collection is representative of this
|
|
archival philosophy. William Westphal's graduate architectural
|
|
history course on urban form includes hundreds of architectural
|
|
images, primarily from the Italian Renaissance, organized around
|
|
his lectures. Students can access these resources at all times
|
|
over the network as well as in a closed classroom environment
|
|
designed to efficiently access the images. Since they were
|
|
scanned at high resolutions, the images compare favorably with
|
|
the original slides, and they can be examined closely on screen.
|
|
The original slides have frequently degraded or had imperfections
|
|
that were corrected in the scanning process.
|
|
|
|
4.4 Instruction
|
|
|
|
The final project demonstrates the instructional capabilities of
|
|
the Web. (The project's URL is http://www.lib.virginia.edu/
|
|
etext/scanner.html.)
|
|
Using the Web to provide access to training materials has
|
|
many strengths. It gives variation to what would otherwise be a
|
|
flat, linear document. The document is dynamic and can easily
|
|
accommodate other elements as they are created by staff.
|
|
Scanning text is one of the most repetitive training
|
|
operations provided in the Electronic Text Center. Unlike
|
|
searching electronic texts, where every research need may entail
|
|
a different approach and different training needs, many of the
|
|
scanning decisions are generalizable and can be represented in a
|
|
training document. The project's instructional Web pages on
|
|
scanning were designed to reduce the amount of staff intervention
|
|
and give a greater degree of freedom to users.
|
|
|
|
4.5 Evaluation of the Projects
|
|
|
|
While the majority of the projects discussed here could be
|
|
supported by numerous stand-alone, operating-system specific
|
|
hypertext products, the Web has several advantages.
|
|
The projects' electronic resources are widely available on
|
|
the Internet, and users can access them on a variety of computer
|
|
platforms, regardless of the fact that the Web server is running
|
|
on a UNIX computer. (Attractive graphical Web clients, such as
|
|
Mosaic and OmniWeb, are available for Macintoshes, IBM-compatible
|
|
computers using Microsoft Windows, UNIX computers with X Windows,
|
|
and NeXTs.)
|
|
|
|
+ Page 13 +
|
|
|
|
Another key advantage is that the source material for the
|
|
editions either conforms to or is in the process of being
|
|
composed using international standards; it is marked up to
|
|
suggest the functional characteristics of the collections, rather
|
|
than their representational characteristics. Elements, such as
|
|
titles, quotations, and headings, are marked to suggest their
|
|
functional role in the document, rather than any presumed display
|
|
value. Displays depend instead on the capabilities of the user's
|
|
software, which utilizes the functional characteristics of the
|
|
elements to determine how to present the information.
|
|
This reliance on functional--not representational--
|
|
characteristics means that the same materials can be used in a
|
|
variety of different ways, supporting the creation of editions
|
|
with other software packages (e.g., Electronic Book Technology's
|
|
DynaText), use with different analytical tools (e.g.,
|
|
morphological parsers), and access through different database
|
|
schemes (e.g., text-specific systems or relational database
|
|
managers designed for images). A high degree of flexibility,
|
|
viability, and multi-platform access can be maintained.
|
|
Each of the mentioned editions and historical analyses was
|
|
first composed in a very rich SGML format that was designed to
|
|
discriminate between the functional characteristics of low-level
|
|
elements. They were subsequently converted (as automatically as
|
|
possible) to static HTML versions for use with the Web.
|
|
Elements, such as discrete descriptive bibliographic
|
|
characteristics, become simple list items, and most complex prose
|
|
and verse elements are reduced to paragraphs and line breaks.
|
|
After this conversion, it was discouraging to see that richness
|
|
disappear, but the original document remained unchanged.
|
|
There is a continued expectation by the scholars who created
|
|
these resources that better tools will be developed to tap the
|
|
inherent complexity of these materials. The standards-based
|
|
format of the materials ensures that these scholars will be able
|
|
to take advantage of these new tools when they become available.
|
|
|
|
5.0 The Web as an Authoring and Document Delivery Environment
|
|
|
|
The authoring and document delivery capabilities of the Web are
|
|
significantly limited for documents of even moderate complexity.
|
|
Authoring for the Web is usually done in HTML. HTML has many
|
|
virtues, not least of which is its striving for expressiveness
|
|
and SGML validity. It is, however, an impoverished tag set with
|
|
little ability to reflect the complexities of most of the
|
|
documents discussed earlier, despite their being offered through
|
|
the Web. It is important to note that the Web is a limited
|
|
document delivery environment. Its inability to recognize or use
|
|
structural features of documents forces unpleasant administrative
|
|
decisions that will likely restrict the later use of these
|
|
documents.
|
|
|
|
+ Page 14 +
|
|
|
|
5.1 HTML's Lack of Expressiveness
|
|
|
|
The range of HTML tags available to users is limited. In
|
|
contrast to the hundreds of tags made available by the TEI
|
|
guidelines, roughly two dozen tags are made available in HTML.
|
|
While HTML will be expanded with HTML+ to give greater precision
|
|
in areas such as tabular data, HTML+ cannot be expected to
|
|
provide the breadth needed to support literary and historical
|
|
documents, or even to support standard journal literature.
|
|
This lack of expressiveness and insufficient breadth of tags
|
|
also leads to the author's inability to differentiate important
|
|
elements with HTML. In HTML, the same small set of tags is
|
|
necessarily used for diverse sets of elements. For example, the
|
|
<BR> code (line break) is used for verse lines, table elements,
|
|
stanza divisions, dramatis personae, and many features. Authors
|
|
are also left with little ability to represent the structural
|
|
organization of a document. Where the author wishes to define a
|
|
bounded segment of text, such as a stanza or chapter, no tag is
|
|
available for this purpose. Instead, authors rely extensively on
|
|
dividing documents into files representing major structural
|
|
divisions. Elements that are normally defined as structural tags
|
|
in SGML, such as the paragraph (or <P>) tag, are not defined by
|
|
HTML in a way that reliably defines the contents of a paragraph.
|
|
This paucity of tags in HTML results in the author of any
|
|
document of moderate complexity using many tags to effect a
|
|
desired appearance, rather than to characterize the content.
|
|
This type of tagging confuses function and appearance.
|
|
The inability of HTML to represent complexity is often
|
|
closely linked to the inability of Web servers to provide access
|
|
to complex representations of documents. This inability is
|
|
fundamentally linked to the notion of structure. Where
|
|
structural distinctions exist in the markup language, there is no
|
|
inherent ability in the Web to deliver that individual element.
|
|
So, for example, HTML defines glossaries and glossary entries,
|
|
but, in order to provide access to an individual glossary entry
|
|
from a hypertext link, the server must send the entire file
|
|
(i.e., the file containing the glossary) to the user. Smaller
|
|
glossaries cause few problems, but this makes providing access to
|
|
individual "glossary" entries in a document such as the Oxford
|
|
English Dictionary, where all 500 MB would be transferred across
|
|
the network, effectively impossible. While Web browsers are
|
|
intelligent enough to move automatically within the file to the
|
|
chosen glossary entry, the file transfer paradigm is impractical
|
|
for large-scale information delivery. Given this, it must also
|
|
be pointed out that there are very few HTML tags that define
|
|
structural relationships. Structures such as chapters, sections,
|
|
or poems are not represented.
|
|
|
|
+ Page 15 +
|
|
|
|
The Web's deficiency with regard to structural features
|
|
leads to decisions with serious negative administrative
|
|
consequences. Because the Web does not include structure
|
|
awareness in its protocol and because HTML markup provides so
|
|
little support for structural representation of features, the
|
|
author and the administrator are forced to fragment documents
|
|
into a sets of reasonably sized components. In converting the
|
|
ARL book University Libraries and Scholarly Communication (URL:
|
|
http://www.lib.virginia.edu/mellon/mellon.html) to HTML, I found
|
|
that, using the Web and HTML alone, it was necessary to divide
|
|
the dozen chapters into separate files. While this may not sound
|
|
onerous, extending this practice to a large collection of
|
|
documents--or even a small collection of large documents--would
|
|
be very difficult. An HTML version of the OED would become a set
|
|
of 300,000 files. Chadwyck-Healey's English Poetry Database
|
|
would become either 2,500 files (if the administrator wished to
|
|
provide access at the volume level) or 65,000 files (if access to
|
|
individual poems were supported). Even this severe approach does
|
|
not solve needs that might arise for substructures, such as
|
|
quotations and definitions within the OED or specific stanzas
|
|
within a poem.
|
|
|
|
5.2 Overall Limitations of HTML
|
|
|
|
For documents of limited complexity, HTML is an effective
|
|
authoring environment; however, it seriously limits the ways in
|
|
which a more complex document or a set of documents can be used.
|
|
No differentiation of important elements (e.g., stanzas and
|
|
subdivisions of prose) can take place, and it will be necessary
|
|
to upgrade the coding of HTML documents within the year.
|
|
The Web also lacks inherent document management or document
|
|
access capabilities. In part because of the limitations of the
|
|
markup language and in part because of the design of the
|
|
protocol, there is a paucity of structure represented and no
|
|
structure recognized. I emphasize "inherent," however, because
|
|
the Web also provides a gateway capability that can more than
|
|
compensate for this deficiency.
|
|
|
|
6.0 Exploring Alternatives
|
|
|
|
I have been developing a gateway from the Web to an indexed
|
|
collection of texts in an SGML-aware system to take advantage of
|
|
the complexity of the documents and yet make them available
|
|
through the Web. The texts are nearly all in fully validated
|
|
SGML tag sets, each with significant expressiveness. In contrast
|
|
to an HTML collection, potentially consisting of many files
|
|
representing the many component parts of the collection, each
|
|
text is a single file with as many as hundreds of thousands of
|
|
structural components.
|
|
|
|
+ Page 16 +
|
|
|
|
6.1 Collections
|
|
|
|
Three diverse examples are provided to help understand the nature
|
|
of the collections used in the gateway.
|
|
|
|
6.1.1 University of Virginia Middle English Collection
|
|
|
|
The Middle English collection assembled by the University of
|
|
Virginia's Electronic Text Center is approximately thirty texts
|
|
in a single file. (The collection's URL is http://
|
|
etext.virginia.edu/Mideng.query.html.)
|
|
Texts vary in size from several dozen pages to several
|
|
hundred pages. One of the Library's smaller collections is
|
|
approximately 11 MB of raw text, but it grows as new materials
|
|
become available. The markup language used is SGML complying
|
|
with the Oxford Text Archive's DTD, a tag set that will
|
|
eventually represent a valid subset of the TEI DTD. The tags
|
|
differentiate major structural elements, such as tales in the
|
|
Canterbury Tales, bibliographic elements, and elements of
|
|
composition (e.g., verse lines, stanzas, and paragraphs). Markup
|
|
is rich enough to support a wide range of analytical
|
|
requirements, and the texts have been made available for the
|
|
purpose of analysis to the University of Virginia community for
|
|
much of the past two years. With the permission of Open Text,
|
|
the Oxford Text Archive, and creators of individual texts, access
|
|
to this collection is unrestricted. It can be accessed in a
|
|
variety of ways, including the Web.
|
|
|
|
6.1.2 Chadwyck-Healey English Poetry Database
|
|
|
|
The Chadwyck-Healey English Poetry Database is purchased on tape
|
|
from the publisher and made available indexed by PAT. Access to
|
|
this collection is restricted to a consortium of five
|
|
universities in Virginia. As yet incomplete, the collection
|
|
currently consists of nearly 1,600 works with more than 64,000
|
|
poems and 233,000 pages. The raw text is relatively large (340
|
|
MB), but, indexed with PAT, searches usually yield results in
|
|
less than one second. The SGML used with the English Poetry
|
|
Database is a very rich set of tags designed in consultation with
|
|
a TEI representative. It is more than adequately expressive
|
|
about the poems, including structural markup for poems, poem
|
|
divisions such as stanzas, lineation, and attributes such as
|
|
whether rhyme is used.
|
|
|
|
+ Page 17 +
|
|
|
|
6.1.3 Oxford English Dictionary
|
|
|
|
The Oxford English Dictionary is the largest and arguably the
|
|
most complex resource made available through this service. The
|
|
570 MB document contains approximately 300,000 entries, many with
|
|
more than fifty subelements. Strictly speaking, it is not in
|
|
SGML form because it has not been validated against a DTD. The
|
|
electronic version was, however, designed to take advantage of
|
|
SGML's characteristics, and it significantly benefits from the
|
|
file's structural and descriptive markup.
|
|
|
|
6.2 Web to PAT Gateway
|
|
|
|
I have constructed a gateway between the Web and the more
|
|
sophisticated SGML texts using the Web's CGI (Common Gateway
|
|
Interface) and PAT, an SGML-aware text retrieval program. Text
|
|
is returned from PAT to the Web in the richer SGML, and it is
|
|
converted on the fly to HTML, primarily using HTML to control the
|
|
appearance of the text on the screen. This gateway is being
|
|
documented elsewhere (URL: http://sansfoy.lib.virginia.edu/pub
|
|
/www-to-pat/), but several facets are relevant to this
|
|
discussion.
|
|
|
|
6.2.1 Expressive Representation of Text is Retained
|
|
|
|
The original unmodified texts are accessed through the gateway
|
|
without compromising the expressiveness of the original markup.
|
|
Although the sophisticated SGML markup is dynamically rendered as
|
|
HTML as the user retrieves results, the text remains in the
|
|
original rich SGML form behind the Web representation. Decisions
|
|
about the way that the fuller tag set maps to HTML are registered
|
|
in filters, and, as HTML becomes more expressive, a better match
|
|
between the original tags and the HTML can be made.
|
|
|
|
6.2.2 Simple Queries and Simple Access
|
|
|
|
Users need not be familiar with PAT's query language to search
|
|
texts and take advantage of the structural characteristics of the
|
|
more expressive markup. A word or phrase search returns
|
|
keywords-in-context (KWIC) views to the user, from which a view
|
|
of larger context is possible. Eventually, this process may lead
|
|
the user to retrieval of entire sections (e.g., chapters or
|
|
acts). All expanded views are made from hypertext links that
|
|
initiate structural retrievals such as "the chapter that includes
|
|
this search result."
|
|
|
|
+ Page 18 +
|
|
|
|
6.2.3 Menu-Driven Structural Queries
|
|
|
|
It is possible to facilitate complex queries through menus. For
|
|
example, in the OED, the word lookup function facilitated by the
|
|
Web includes queries such as: "give me entries that include my
|
|
word within the Lookup field of the Headword Group field," or
|
|
"give me entries that include my word in the Variant Form field."
|
|
The user is not aware of the complexity of the query taking
|
|
place, but can modify the type of query by selecting different
|
|
variations on the search menus. Boolean queries that ask for the
|
|
intersection of document structures have been challenging to
|
|
users employing command-line and analytically oriented
|
|
interfaces. However, through simple fill-out forms and menu
|
|
selections, queries such as "(stanzas including [word/phrase])
|
|
INTERSECT (stanzas including [word/phrase])" are executed without
|
|
the user needing to understand the system's command syntax.
|
|
While we also offer access through several complex, analytical
|
|
interfaces (PatMotif and PowerSearch from Open Text as well as a
|
|
locally developed VT 100 interface), most users can avoid these
|
|
more complicated interfaces.
|
|
|
|
6.2.4 Access to Structure
|
|
|
|
Finally, the administrator of a collection need not resort to
|
|
fragmenting files to make it possible to provide access to the
|
|
component parts of a collection. As mentioned earlier, an HTML
|
|
approach to the OED would require us to divide it into 300,000
|
|
files. I was recently able to represent the dozens of parts,
|
|
chapters, sections, and subsections of a voluminous SGML
|
|
technical document through this strategy, making hypertext links
|
|
and each component accessible by utilizing the fairly rich
|
|
markup; however, the document remained a single file. Resource
|
|
management is made more reasonable through a system cognizant of
|
|
a file's structure.
|
|
|
|
+ Page 19 +
|
|
|
|
6.2.5 Future Approaches
|
|
|
|
This strategy has many possibilities. Journal literature coded
|
|
in SGML may be successfully accessed through this sort of
|
|
strategy. For example, a journal run marked up according to the
|
|
more elaborate Association of American Publishers DTD could
|
|
return articles to the user through PAT queries. Another
|
|
approach would facilitate browsing by recognizing the structural
|
|
relationship of author and abstract to article, article to issue,
|
|
issue to volume, and volume to collection. Throughout, the
|
|
collection would exist as a single file, searchable across all
|
|
articles by a single query. The collection would not need to be
|
|
compromised by converting the articles to HTML, but would instead
|
|
continue to remain in the more expressive AAP SGML format,
|
|
filtered for display in the process of retrieving information.
|
|
Through this strategy, the Web can be an effective means of
|
|
accessing the original files in a fuller SGML, without resorting
|
|
to fragmenting the material into files corresponding to the
|
|
individual articles or even parts of articles. Similar
|
|
strategies for books and documentation are possible.
|
|
|
|
7.0 What Does the Web Offer Libraries?
|
|
|
|
The Web is a complex system with great potential and serious
|
|
limitations. We should use caution as we consider composing in
|
|
HTML: it is a short-term coding strategy. Documents composed in
|
|
HTML will have limited expressiveness, and, because HTML is not
|
|
yet stable, they are likely to need continuing enhancement to be
|
|
used in the Web. There is much to be excited about with the Web:
|
|
it is a viable system that suggests what electronic publishing on
|
|
the Internet can be. We have lacked credible, demonstrable
|
|
examples of standards-based, networked hypertext in the past, and
|
|
the Web has changed that. There is a great deal of untapped
|
|
potential in the Web. By exploiting the Web's ability to talk to
|
|
other more sophisticated programs, we can begin to take advantage
|
|
of that potential and make tomorrow's promise real today.
|
|
|
|
+ Page 20 +
|
|
|
|
A subtext of this article has been the importance of
|
|
standards--both employing them in creating hypertexts and
|
|
extending the Web to take greater advantage of them. Standards
|
|
have been attractive to libraries because they help ensure long-
|
|
term viability. However, as Jefferson remarked in 1790,
|
|
standards are also an important key to information being
|
|
generally useful, regardless of context:
|
|
|
|
Measures, weights and coins, thus referred to standards
|
|
unchangeable in their nature . . . will themselves be
|
|
unchangeable. These standards, too, are such as to be
|
|
accessible to all persons, in all times and places. The
|
|
measures and weights derived from them . . . are within the
|
|
calculation of every one who possesses the first elements of
|
|
arithmetic, and of easy comparison, both for foreigners and
|
|
citizens, with the measures, weights, and coins of other
|
|
countries. [4]
|
|
|
|
Notes
|
|
|
|
1. A version of this article was presented as a paper at the Yale
|
|
Hypertext Conference, May 1994. An HTML version of the original
|
|
speech, with active links to the resources discussed, is
|
|
available via the World-Wide Web; URL: http://
|
|
sansfoy.lib.virginia.edu/pub/yale.html.
|
|
|
|
2. Jerome McGann, The Complete Writings and Pictures of Dante
|
|
Gabriel Rossetti: A Hypermedia Research Archive (Charlottesville,
|
|
VA: Institute for Advanced Technology in the Humanities,
|
|
University of Virginia, 1994). (Electronic document available
|
|
via the World-Wide Web; URL: http://
|
|
jefferson.village.virginia.edu/rossetti/rossetti.html.)
|
|
|
|
3. Edward Ayers, The Valley of the Shadow: Living the Civil War
|
|
in Pennsylvania and Virginia (Charlottesville, VA: Institute for
|
|
Advanced Technology in the Humanities, University of Virginia,
|
|
1994). (Electronic document available via the World-Wide Web;
|
|
URL: http://jefferson.village.virginia.edu/vshadow/vshadow.html.)
|
|
|
|
4. Thomas Jefferson, "Public Papers," in Writings (New York:
|
|
Literary Classics of the U.S., 1984), 410.
|
|
|
|
|
|
About the Author
|
|
|
|
John Price-Wilkin, Systems Librarian for Information Services,
|
|
Alderman Library, University of Virginia, Charlottesville, VA
|
|
22903. Internet: jpw@virginia.edu.
|
|
|
|
+ Page 21 +
|
|
|
|
-----------------------------------------------------------------
|
|
The Public-Access Computer Systems Review is an electronic
|
|
journal that is distributed on the Internet and on other computer
|
|
networks. There is no subscription fee.
|
|
To subscribe, send an e-mail message to
|
|
listserv@uhupvm1.uh.edu that says: SUBSCRIBE PACS-P First Name
|
|
Last Name.
|
|
This article is Copyright (C) 1994 by John Price-Wilkin.
|
|
All Rights Reserved.
|
|
The Public-Access Computer Systems Review is Copyright (C)
|
|
1994 by the University Libraries, University of Houston. All
|
|
Rights Reserved.
|
|
Copying is permitted for noncommercial use by academic
|
|
computer centers, computer conferences, individual scholars, and
|
|
libraries. Libraries are authorized to add the journal to their
|
|
collection, in electronic or printed form, at no charge. This
|
|
message must appear on all copied material. All commercial use
|
|
requires permission.
|
|
-----------------------------------------------------------------
|