475 lines
19 KiB
Plaintext
475 lines
19 KiB
Plaintext
"FTXT" IFF Formatted Text
|
|
|
|
Date: November 15, 1985
|
|
From: Steve Shaw and Jerry Morrison, Electronic Arts and
|
|
Bob "Kodiak" Burns, Commodore-Amiga
|
|
Status: Draft 2.6
|
|
|
|
DRAFT DRAFT DRAFT
|
|
DRAFT DRAFT DRAFT
|
|
|
|
1. Introduction
|
|
|
|
This memo is the IFF supplement for FORM FTXT. An FTXT is an IFF "data
|
|
section" or "FORM type" which can be an IFF file or a part of one containing
|
|
a stream of text plus optional formatting information."EA IFF 85"
|
|
is Electronic Arts' standard for interchange format files. (See the
|
|
IFF reference.)
|
|
|
|
An FTXT is an archival and interchange representation designed for
|
|
three uses. The simplest use is for a "console device" or "glass teletype"
|
|
(the minimal 2-D text layout means): a stream of "graphic" ("printable")
|
|
characters plus positioning characters "space" ("SP") and line terminator
|
|
("LF"). This is not intended for cursor movements on a screen although
|
|
it does not conflict with standard cursor-moving characters. The second
|
|
use is text that has explicit formatting information (or "looks")
|
|
such as font family and size, typeface, etc. The third use is as the
|
|
lowest layer of a structured document that also has "inherited" styles
|
|
to implicitly control character looks. For that use, FORMs FTXT would
|
|
be embedded within a future document FORM type. The beauty of FTXT
|
|
is that these three uses are interchangeable, that is, a program written
|
|
for one purpose can read and write the others' files. So a word processor
|
|
does not have to write a separate plain text file to communicate with
|
|
other programs.
|
|
|
|
Text is stored in one or more "CHRS" chunks inside an FTXT. Each CHRS
|
|
contains a stream of 8-bit text compatible with ISO and ANSI data
|
|
interchange standards. FTXT uses just the central character set from
|
|
the ISO/ANSI standards. (These two standards are henceforth called
|
|
"ISO/ANSI" as in "see the ISO/ANSI reference".)
|
|
|
|
Since it's possible to extract just the text portions from future
|
|
document FORM types, programs can exchange data without having to
|
|
save both plain text and formatted text representations.
|
|
|
|
Character looks are stored as embedded control sequences within CHRS
|
|
chunks. This document specifies which class of control sequences to
|
|
use: the CSI group. This document does not yet specify their meanings,
|
|
e.g. which one means "turn on italic face". Consult ISO/ANSI.
|
|
|
|
Section 2 defines the chunk types character stream "CHRS" and font
|
|
specifier "FONS". These are the "standard" chunks. Specialized chunks
|
|
for private or future needs can be added later. Section 3 outlines
|
|
an FTXT reader program that strips a document down to plain unformatted
|
|
text. Appendix A is a code table for the 8-bit ISO/ANSI character
|
|
set used here. Appendix B is an example FTXT shown as a box diagram.
|
|
Appendix C is a racetrack diagram of the syntax of ISO/ANSI control
|
|
sequences.
|
|
|
|
|
|
Reference:
|
|
|
|
Amiga[tm] is a trademark of Commodore-Amiga, Inc.
|
|
|
|
Electronic Arts[tm] is a trademark of Electronic Arts.
|
|
|
|
IFF: "EA IFF 85" Standard for Interchange Format Files describes the
|
|
underlying conventions for all IFF files.
|
|
|
|
ISO/ANSI: ISO/DIS 6429.2 and ANSI X3.64-1979. International Organization
|
|
for Standardization (ISO) and American National Standards Institute
|
|
(ANSI) data-interchange standards. The relevant parts of these two
|
|
standards documents are identical. ISO standard 2022 is also relevant.
|
|
|
|
|
|
2. Standard Data and Property Chunks
|
|
|
|
The main contents of a FORM FTXT is in its character stream "CHRS"
|
|
chunks. Formatting property chunks may also appear. The only formatting
|
|
property yet defined is "FONS", a font specifier. A FORM FTXT with
|
|
no CHRS represents an empty text stream. A FORM FTXT may contain nested
|
|
IFF FORMs, LISTs, or CATs, although a "stripping" reader (see section
|
|
3) will ignore them.
|
|
|
|
Character Set
|
|
|
|
FORM FTXT uses the core of the 8-bit character set defined by the
|
|
ISO/ANSI standards cited at the start of this document. (See Appendix
|
|
A for a character code table.) This character set is divided into
|
|
two "graphic" groups plus two "control" groups. Eight of the control
|
|
characters begin ISO/ANSI standard control sequences. (See "Control
|
|
Sequences", below.) Most control sequences and control characters
|
|
are reserved for future use and for compatibility with ISO/ANSI. Current
|
|
reader programs should skip them.
|
|
|
|
% C0 is the group of control characters in the range NUL (hex
|
|
0) through hex 1F. Of these, only LF (hex 0A) and ESC (hex 1B) are
|
|
significant. ESC begins a control sequence. LF is the line terminator,
|
|
meaning "go to the first horizontal position of the next line". All
|
|
other C0 characters are not used. In particular, CR (hex 0D) is not
|
|
recognized as a line terminator.
|
|
|
|
% G0 is the group of graphic characters in the range hex 20 through
|
|
hex 7F. SP (hex 20) is the space character. DEL (hex 7F) is the delete
|
|
character which is not used. The rest are the standard ASCII printable
|
|
characters "!" (hex 21) through "~" (hex 7E).
|
|
|
|
% C1 is the group of extended control characters in the range
|
|
hex 80 through hex 9F. Some of these begin control sequences. The
|
|
control sequence starting with CSI (hex 9B) is used for FTXT formatting.
|
|
All other control sequences and C1 control characters are unused.
|
|
|
|
% G1 is the group of extended graphic characters in the range
|
|
NBSP (hex A0) through "X" (hex FF). It is one of the alternate graphic
|
|
groups proposed for ISO/ANSI standardization.
|
|
|
|
Control Sequences
|
|
|
|
Eight of the control characters begin ISO/ANSI standard "control sequences"
|
|
(or "escape sequences"). These sequences are described below and diagrammed
|
|
in Appendix C.
|
|
|
|
G0 ::= (SP through DEL)
|
|
G1 ::= (NBSP through "X")
|
|
|
|
ESC-Seq ::= ESC (SP through "/")* ("0" through "~")
|
|
ShiftToG2 ::= SS2 G0
|
|
ShiftToG3 ::= SS3 G0
|
|
CSI-Seq ::= CSI (SP through "?")* ("@" through "~")
|
|
DCS-Seq ::= (DCS | OSC | PM | APC) (SP through "~" | G1)* ST
|
|
|
|
"ESC-Seq" is the control sequence ESC (hex 1B), followed by zero or
|
|
more characters in the range SP through "/S (hex 20 through hex 2F),
|
|
followed by a character in the range "0" through "~" (hex 30 through
|
|
hex 7E). These sequences are reserved for future use and should be
|
|
skipped by current FTXT reader programs.
|
|
|
|
SS2 (hex 8E) and SS3 (hex 8F) shift the single following G0 character
|
|
into yet-to-be-defined graphic sets G2 and G3, respectively. These
|
|
sequences should not be used until the character sets G2 and G3 are
|
|
standardized. A reader may simply skip the SS2 or SS3 (taking the
|
|
following character as a corresponding G0 character) or replace the
|
|
two-character sequence with a character like "?" to mean "absent".
|
|
|
|
FTXT uses "CSI-Seq" control sequences to store character formatting
|
|
(font selection by number, type face, and text size) and perhaps layout
|
|
information (position and rotation). "CSI-Seq" control sequences start
|
|
with CSI (the "control sequence introducer", hex 9B). Syntactically,
|
|
the sequence includes zero or more characters in the range SP through
|
|
"?" (hex 20 through hex 3F) and a concluding character in the range
|
|
"@" through "~" (hex 40 through hex 7E). These sequences may be skipped
|
|
by a minimal FTXT reader, i.e. one that ignores formatting information.
|
|
|
|
Note: A future FTXT standardization document will explain the uses
|
|
of CSI-Seq sequences for setting character face (light weight vs.
|
|
medium vs. bold, italic vs. upright, height, pitch, position, and
|
|
rotation). For now, consult the ISO/ANSI references.
|
|
|
|
"DCS-Seq" is the control sequences starting with DCS (hex 90), OSC
|
|
(hex 9D), PM (hex 9E), or APC (hex 9F), followed by zero or more characters
|
|
each of which is in the range SP through "~" (hex 20 through hex 7E)
|
|
or else a G1 character, and terminated by an ST (hex 9C). These sequences
|
|
are reserved for future use and should be skipped by current FTXT
|
|
reader programs.
|
|
|
|
Data Chunk CHRS
|
|
|
|
A CHRS chunk contains a sequence of 8-bit characters abiding by the
|
|
ISO/ANSI standards cited at the start of this document. This includes
|
|
the character set and control sequences as described above and summarized
|
|
in Appendicies A and C.
|
|
|
|
A FORM FTXT may contain any number of CHRS chunks. Taken together,
|
|
they represent a single stream of textual information. That is, the
|
|
contents of CHRS chunks are effectively concatenated except that (1)
|
|
each control sequence must be completely within a single CHRS chunk,
|
|
and (2) any formatting property chunks appearing between two CHRS
|
|
chunks affects the formatting of the latter chunk's text. Any formatting
|
|
settings set by control sequences inside a CHRS carry over to the
|
|
next CHRS in the same FORM FTXT. All formatting properties stop at
|
|
the end of the FORM since IFF specifies that adjacent FORMs are independent
|
|
of each other (although not independent of any properties inherited
|
|
from an enclosing LIST or FORM).
|
|
|
|
Property Chunk FONS
|
|
|
|
The optional property "FONS" holds a FontSpecifier as defined in the
|
|
C declaration below. It assignes a font to a numbered "font register"
|
|
so it can be referenced by number within subsequent CHRS chunks. (This
|
|
function is not provided within the ISO and ANSI standards.) The font
|
|
specifier gives both a name and a description for the font so the
|
|
recipient program can do font substitution.
|
|
|
|
By default, CHRS text uses font 1 until it selects another font. A
|
|
minimal text reader always uses font 1. If font 1 hasn't been specified,
|
|
the reader may use the local system font as font 1.
|
|
|
|
typedef struct {
|
|
UBYTE id;
|
|
/* 0 through 9 is a font id number referenced by an
|
|
* SGR control sequence selective parameter of 10
|
|
* through 19. Other values are reserved for future
|
|
* standardization.
|
|
*/
|
|
UBYTE pad1; /* reserved for future use; store 0 here */
|
|
UBYTE proportional;
|
|
/* proportional font? 0 = unknown, 1 = no, 2 = yes */
|
|
UBYTE serif;
|
|
/* serif font? 0 = unknown, 1 = no, 2 = yes */
|
|
char name[];
|
|
/* A NULL-terminated string naming preferred font. */
|
|
} FontSpecifier;
|
|
|
|
Fields are filed in the order shown. The UBYTE fields are byte-packed
|
|
(2 per 16-bit word). The field pad1 is reserved for future standardization.
|
|
Programs should store 0 there for now.
|
|
|
|
The field proportional indicates if the desired font is proportional
|
|
width as opposed to fixed width. The field serif indicates if the
|
|
desired font is serif as opposed to sans serif. [Issue: Discuss font
|
|
substitution!]
|
|
|
|
Future Properties
|
|
|
|
New optional property chunks may be defined in the future to store
|
|
additional formatting information. They will be used to represent
|
|
formatting not encoded in standard ISO/ANSI control sequences and
|
|
for "inherited" formatting in structured documents. Text orientation
|
|
might be one example.
|
|
|
|
Positioning Units
|
|
|
|
Unless otherwise specified, position and size units used in FTXT formatting
|
|
properties and control sequences are in decipoints (720 decipoints/inch).
|
|
This is ANSI/ISO Positioning Unit Mode (PUM) 2. While a metric standard
|
|
might be nice, decipoints allow the existing U.S.A. typographic units
|
|
to be encoded easily, e.g. "12 points" is "120 decipoints".
|
|
|
|
|
|
3. FTXT Stripper
|
|
|
|
An FTXT reader program can read the text and ignore all formatting
|
|
and structural information in a document FORM that uses FORMs FTXT
|
|
for the leaf nodes. This amounts to stripping a document down to a
|
|
stream of plain text. It would do this by skipping over all chunks
|
|
except FTXT.CHRS (CHRS chunks found inside a FORM FTXT) and within
|
|
the FTXT.CHRS chunks skipping all control characters and control sequences.
|
|
(Appendix C diagrams this text scanner.) It may also read FTXT.FONS
|
|
chunks to find a description for font 1.
|
|
|
|
Here's a Pascal-ish program for an FTXT stripper. Given a FORM (a
|
|
document of some kind), it scans for all FTXT.CHRS chunks. This would
|
|
likely be applied to the first FORM in an IFF file.
|
|
|
|
PROCEDURE ReadFORM4CHRS(); {Read an IFF FORM for FTXT.CHRS chunks.}
|
|
BEGIN
|
|
IF the FORM's subtype = "FTXT"
|
|
THEN ReadFTXT4CHRS()
|
|
ELSE WHILE something left to read in the FORM DO BEGIN
|
|
read the next chunk header;
|
|
CASE the chunk's ID OF
|
|
"LIST", "CAT ": ReadCAT4CHRS();
|
|
"FORM": ReadFORM4CHRS();
|
|
OTHERWISE skip the chunk's body;
|
|
END
|
|
END
|
|
END;
|
|
|
|
{Read a LIST or CAT for all FTXT.CHRS chunks.}
|
|
PROCEDURE ReadCAT4CHRS();
|
|
BEGIN
|
|
WHILE something left to read in the LIST or CAT DO BEGIN
|
|
read the next chunk header;
|
|
CASE the chunk's ID OF
|
|
"LIST", "CAT ": ReadCAT4CHRS();
|
|
"FORM": ReadFORM4CHRS();
|
|
"PROP": IF we're reading a LIST AND the PROP's subtype =
|
|
"FTXT"
|
|
THEN read the PROP for "FONS" chunks;
|
|
OTHERWISE error--malformed IFF file;
|
|
END
|
|
END
|
|
END;
|
|
|
|
PROCEDURE ReadFTXT4CHRS(); {Read a FORM FTXT for CHRS chunks.}
|
|
BEGIN
|
|
WHILE something left to read in the FORM FTXT DO BEGIN
|
|
read the next chunk header;
|
|
CASE the chunk's ID OF
|
|
"CHRS": ReadCHRS();
|
|
"FONS": BEGIN
|
|
read the chunk's contents into a FontSpecifier variable;
|
|
IF the font specifier's id = 1 THEN use this font;
|
|
END;
|
|
OTHERWISE skip the chunk's body;
|
|
END
|
|
END
|
|
END;
|
|
|
|
{Read an FTXT.CHRS. Skip all control sequences and unused control
|
|
chars.}
|
|
PROCEDURE ReadCHRS();
|
|
BEGIN
|
|
WHILE something left to read in the CHRS chunk DO
|
|
CASE read the next character OF
|
|
LF: start a new output line;
|
|
ESC: SkipControl([' '..'/'], ['0'..'~']);
|
|
IN [' '..'~'], IN [NBSP..'X']: output the character;
|
|
SS2, SS3: ; {Just handle the following G0 character
|
|
directly, ignoring the shift to G2 or G3.}
|
|
CSI: SkipControl([' '..'?'], ['@'..'~']);
|
|
DCS, OSC, PM, APC: SkipControl([' '..'~'] + [NBSP..'X'], [ST]);
|
|
END
|
|
END;
|
|
|
|
{Skip a control sequence of the format (rSet)* (tSet), i.e. any number
|
|
of characters in the set rSet followed by a character in the set tSet.}
|
|
PROCEDURE SkipControl(rSet, tSet);
|
|
VAR c: CHAR;
|
|
BEGIN
|
|
REPEAT c := read the next character
|
|
UNTIL c NOT IN rSet;
|
|
IF c NOT IN tSet
|
|
THEN put character c back into the input stream;
|
|
END
|
|
|
|
The following program is an optimized version of the above routines
|
|
ReadFORM4CHRS and ReadCAT4CHRS for the case where you're ignoring
|
|
fonts as well as formatting. It takes advantage of certain facts of
|
|
the IFF format to read a document FORM and its nested FORMs, LISTs,
|
|
and CATs without a stack. In other words, it's a hack that ignores
|
|
all fonts and faces to cheaply get to the plain text of the document.
|
|
|
|
{Cheap scan of an IFF FORM for FTXT.CHRS chunks.}
|
|
PROCEDURE ScanFORM4CHRS();
|
|
BEGIN
|
|
IF the document FORM's subtype = "FTXT"
|
|
THEN ReadFTXT4CHRS()
|
|
ELSE WHILE something left to read in the FORM DO BEGIN
|
|
read the next chunk header;
|
|
IF it's a group chunk (LIST, FORM, PROP, or CAT)
|
|
THEN read its subtype ID;
|
|
CASE the chunk's ID OF
|
|
"LIST", "CAT ":; {NOTE: See explanation below.*}
|
|
"FORM": IF this FORM's subtype = "FTXT" THEN
|
|
ReadFTXT4CHRS()
|
|
ELSE; {NOTE: See explanation below.*}
|
|
OTHERWISE skip the chunk's body;
|
|
END
|
|
END
|
|
END;
|
|
|
|
*Note: This implementation is subtle. After reading a group header
|
|
other than FORM FTXT it just continues reading. This amounts to reading
|
|
all the chunks inside that group as if they weren't nested in a group.
|
|
|
|
|
|
Appendix A: Character Code Table
|
|
|
|
This table corresponds to the ISO/DIS 6429.2 and ANSI X3.64-1979 8-bit
|
|
character set standards. Only the core character set of those standards
|
|
is used in FTXT.
|
|
|
|
Two G1 characters aren't defined in the standards and are shown as
|
|
dark gray entries in this table. Light gray shading denotes control
|
|
characters. (DEL is a control character although it belongs to the
|
|
graphic group G0.) The following five rare G1 characters are left
|
|
blank in the table below due to limitations of available fonts: hex
|
|
A8, D0, DE, F0, and FE.
|
|
|
|
|
|
|
|
ISO/DIS 6429.2 and ANSI X3.64-1979 Character Code Table
|
|
|
|
|
|
(figure named "TextTable", viewable by ShowILBM or SeeILBM)
|
|
|
|
|
|
[_____] [_______________________] [_____] [____________________________]
|
|
Control Grapic Group Control Graphic Group
|
|
Group G0 Group G1
|
|
C0 C1
|
|
|
|
"NBSP" is a "non-breaking space"
|
|
"SHY" is a "soft-hyphen"
|
|
|
|
|
|
|
|
Appendix B. FTXT Example
|
|
|
|
Here's a box diagram for a simple example: "The quick brown fox jumped.Four
|
|
score and seven", written in a proportional serif font named "Roman".
|
|
|
|
|
|
+-----------------------------------+
|
|
|'FORM' 24070 | FORM 24070 ILBM
|
|
+-----------------------------------+
|
|
|'ILBM' |
|
|
+-----------------------------------+
|
|
| +-------------------------------+ |
|
|
| | 'BMHD' 20 | | .BMHD 20
|
|
| | 320, 200, 0, 0, 3, 0, 0, ... | |
|
|
| | ------------------------------+ |
|
|
| | 'CMAP' 21 | | .CMAP 21
|
|
| | 0, 0, 0; 32, 0, 0; 64,0,0; .. | |
|
|
| +-------------------------------+ |
|
|
| +-------------------------------+ |
|
|
| |'BODY' 24000 | | .BODY 24000
|
|
| |0, 0, 0, ... | |
|
|
| +-------------------------------+ |
|
|
+-----------------------------------+
|
|
|
|
The "0" after the CMAP chunk is a pad byte.
|
|
|
|
|
|
|
|
|
|
Appendix B. Standards Committee
|
|
|
|
The following people contributed to the design of this IFF standard:
|
|
|
|
Bob "Kodiak" Burns, Commodore-Amiga
|
|
R. J. Mical, Commodore-Amiga
|
|
Jerry Morrison, Electronic Arts
|
|
Greg Riker, Electronic Arts
|
|
Steve Shaw, Electronic Arts
|
|
Barry Walsh, Commodore-Amiga
|
|
|
|
|
|
|
|
Appendix C. ISO/ANSI Control Sequences
|
|
|
|
This is a racetrack diagram of the ISO/ANSI characters and control
|
|
sequences as used in FTXT CHRS chunks.
|
|
|
|
line terminator
|
|
-----+-------------------> LF --------------------------------------->
|
|
| ESC-Seq
|
|
+-------------------> ESC ---+>----------------+--> 0 thru ~ --->
|
|
| | |
|
|
| +-- SP thru / <---+
|
|
| printable
|
|
+---------------+---> SP thru ~ --+->--------------------------->
|
|
| | |
|
|
| +---> G1 -------->+
|
|
| shift to G2
|
|
+-------------------> SS2 ----> G0 ---> (produces a G2 character)
|
|
| shift to G3
|
|
+-------------------> SS3 ----> G0 ---> (produces a G3 character)
|
|
| CSI-Seq
|
|
+-------------------> CSI ---+>----------------+--> @ thru ~ --->
|
|
| | |
|
|
| +-- SP thru ? <---+
|
|
| DCS-Seq
|
|
+----------> DCS,OSC,PM,or APC --+>-------------+--+-> ST -+---->
|
|
| | | | |
|
|
| +- SP thru ~ <-+ +-> G1 -+
|
|
| discard
|
|
+----------> any other character ------------------------------->
|
|
|
|
|
|
|
|
Of the various control sequences, only CSI-Seq is used for FTXT character
|
|
formatting information. The others are reserved for future use and
|
|
for compatibility with ISO/ANSI standards. Certain character sequences
|
|
are syntactically malformed, e.g. CSI followed by a C0, C1, or G1
|
|
character. Writer programs should not generate reserved or malformed
|
|
sequences and reader programs should skip them.
|
|
|
|
Consult the ISO/ANSI standards for the meaning of the CSI-Seq control
|
|
sequences.
|
|
|
|
The two character set shifts SS2 and SS3 may be used when the graphic
|
|
character groups G2 and G3 become standardized.
|
|
|