333 lines
17 KiB
Perl
333 lines
17 KiB
Perl
|
[To be published in Computer Music Journal 18:4 (Winter 94)]
|
||
|
|
||
|
Examples of ZIPI Applications
|
||
|
|
||
|
Matthew Wright
|
||
|
Center for New Music and Audio Technologies (CNMAT)
|
||
|
Department of Music, University of California, Berkeley
|
||
|
1750 Arch St.
|
||
|
Berkeley, California 94720 USA
|
||
|
Matt@CNMAT.Berkeley.edu
|
||
|
|
||
|
In this article, we will list some sample applications of ZIPI: alternate
|
||
|
controllers, transfers of large amounts of data, and some common
|
||
|
applications from the MIDI world. We show how they would be implemented in
|
||
|
ZIPI, and, where appropriate, evaluate ZIPI's performance in each case.
|
||
|
Basic knowledge of ZIPI is assumed.
|
||
|
|
||
|
|
||
|
A ZIPI Guitar
|
||
|
=============
|
||
|
|
||
|
Zeta Music will soon release a combination ZIPI controller and synthesizer.
|
||
|
The controller will be a general-purpose sound-to-ZIPI converter, taking
|
||
|
arbitrary sound sources as its input and producing ZIPI control data. The
|
||
|
synthesizer portion is the opposite, taking incoming ZIPI control data and
|
||
|
synthesizing sounds.
|
||
|
|
||
|
The controller will work with any instrument, but for now assume it is
|
||
|
connected to an guitar with a hexaphonic pickup. ("Hexaphonic" means one
|
||
|
output channel for each of the six guitar strings, instead of the usual case
|
||
|
in which the sound of all six guitar strings comes out of one output.) The
|
||
|
controller will track pitch, loudness, even/odd balance, pitched/unpitched
|
||
|
balance, and brightness information for each of the six strings of a guitar
|
||
|
in real time, updating each parameter every 8-10 msec. It will also produce
|
||
|
articulation information, noting when trigger and release events occur on
|
||
|
each string.
|
||
|
|
||
|
The built-in synthesizer will use sample playback and will be able to vary
|
||
|
the pitch, loudness, even/odd balance, pitched/unpitched balance, and
|
||
|
brightness of a sound. Thus, the sounds produced by the synthesis will be
|
||
|
able to follow all these nuances of the acoustic sound, not just its pitch
|
||
|
and volume.
|
||
|
|
||
|
The idea is to give an instrumentalist---a guitar player---more expressive
|
||
|
control over the sounds produced by a synthesizer. When the guitarist picks
|
||
|
closer to the bridge, a synthesizer producing an organ sound would make it
|
||
|
brighter. When the guitarist plays a 12th-fret harmonic, a trumpet sound
|
||
|
would go up an octave and change to a quieter, more delicate timbre. When
|
||
|
the guitarist partially mutes the strings with his or her palm, a saxophone
|
||
|
sound would have a more breathy, noisy character. When the guitarist
|
||
|
completely mutes the strings with his or her left hand and scratches
|
||
|
rhythmically, a piano patch would produce just the sound of hammers hitting
|
||
|
strings, with no pitched content.
|
||
|
|
||
|
This instrument will use ZIPI to control an external synthesizer with the
|
||
|
same high-bandwidth continuous control information that it uses internally.
|
||
|
As the default configuration, the controller will send the information from
|
||
|
each guitar string to a separate MPDL note address. It would be useful for
|
||
|
these six MPDL notes to be in the same instrument, to make it possible to
|
||
|
control all of the guitar's synthesized sound with a single message, e.g.,
|
||
|
pan or amplitude.
|
||
|
|
||
|
How much bandwidth does this require? Every 10 msec the controller will send
|
||
|
a ZIPI packet that looks like this:
|
||
|
|
||
|
Address of note 1;
|
||
|
pitch, loudness, brightness, even/odd balance, noise balance;
|
||
|
new address: note 2;
|
||
|
pitch, loudness, brightness, even/odd balance, noise balance;
|
||
|
and so on for the other four notes.
|
||
|
|
||
|
Pitch and loudness are 3-byte messages (including the note descriptor ID);
|
||
|
the other three are 2-byte messages. So we have 12 bytes (3 + 3 + 2 + 2 + 2)
|
||
|
of data per string, and 72 data bytes (6 * 12) altogether. The note address
|
||
|
at the beginning of the frame is three bytes; the other five note addresses
|
||
|
must be specified with 5-byte "new address" messages, for a total of 28
|
||
|
bytes (3 + 5*5) of addresses. Including the seven overhead bytes for each
|
||
|
ZIPI packet, that's a total of 107 bytes per update.
|
||
|
|
||
|
107 eight-bit bytes every 10 msec requires a bandwidth of 85.6 kBaud---just
|
||
|
over 1/3 of ZIPI's bandwidth at the slowest speed, and over 2.5 times the
|
||
|
bandwidth of MIDI. This does not include articulation messages for
|
||
|
triggering and releasing notes, but these would be much more than 10 msec
|
||
|
apart on average, and would not take up any appreciable amount of bandwidth.
|
||
|
|
||
|
What happens to the bandwidth if we add in many other ZIPI synthesizers to
|
||
|
layer the sound? Nothing. ZIPI synthesis modules will typically only listen
|
||
|
to the network and not send their own messages. So there could be one or ten
|
||
|
synthesizers connected to this controller, with the same network performance
|
||
|
in either case.
|
||
|
|
||
|
|
||
|
A ZIPI Keyboard
|
||
|
===============
|
||
|
|
||
|
ZIPI keyboards, like MIDI keyboards, would send messages whenever a key is
|
||
|
pressed or released. Probably, a ZIPI keyboard would pre-allocate as many
|
||
|
ZIPI notes as it has keys, sending each of them a pitch message only once as
|
||
|
part of a setup routine. (For a keyboard, a note's address truly is the same
|
||
|
as its pitch, so MIDI's model, in which the address is the same as pitch,
|
||
|
applies.)
|
||
|
|
||
|
A "note-on" packet would have a loudness note descriptor computed from the
|
||
|
key's velocity, followed by an articulation note descriptor ("trigger"). A
|
||
|
"note-off" packet would just consist of a single articulation note
|
||
|
descriptor---"release." Seven overhead bytes, plus a 3-byte address, three
|
||
|
bytes for the loudness note descriptor, and two bytes for articulation is 15
|
||
|
bytes. At 250 kBaud, this takes 480 msec to transmit.
|
||
|
|
||
|
Like all ZIPI controllers, a ZIPI keyboard should be able to send raw
|
||
|
controller measurements instead of mapping those measurements onto
|
||
|
parameters such as loudness and pitch. In this mode, key number would
|
||
|
replace pitch, and key velocity would replace loudness.
|
||
|
|
||
|
It is possible that a ZIPI keyboard controller would want to implement its
|
||
|
own note-stealing algorithm rather than to rely on that of the synthesizer
|
||
|
being controlled. In that case, it could allocate as many notes as the
|
||
|
receiving synthesizer has voices of polyphony. In each packet that contains
|
||
|
a trigger message, there would also be a pitch message. Now, the keyboard
|
||
|
knows that any note it asks the synthesizer to play will be played, and if a
|
||
|
note needs to be turned off, the keyboard can choose which one. The keyboard
|
||
|
controller could even control multiple synthesizers of the same type,
|
||
|
allocating notes among them according to the polyphony capabilities.
|
||
|
|
||
|
On a ZIPI keyboard, alternate tunings would be a feature of the keyboard
|
||
|
controller, not the synthesizer. Remember that pitch in ZIPI is a 16-bit
|
||
|
quantity. The keyboard already has a mapping between key numbers and
|
||
|
pitches; for example, it knows that the middle C key has the ZIPI pitch with
|
||
|
the hexadecimal value 7900. If the musician wanted alternate tunings, the
|
||
|
keyboard could just use a different mapping, perhaps associating the middle
|
||
|
C key with ZIPI pitch hexadecimal 78E2 or hex 792A.
|
||
|
|
||
|
|
||
|
How to do Multis
|
||
|
================
|
||
|
|
||
|
Most MIDI timbre modules are "multi-timbral," meaning that the same
|
||
|
synthesizer can produce pitches with different timbres on different MIDI
|
||
|
channels. Since MIDI has only 16 channels, it is important to choose which
|
||
|
16 timbres to use at a time. Therefore, many synthesizers have the concept
|
||
|
of a "multi," which is a collection of up to 16 timbres. Sometimes it is
|
||
|
possible to select a whole multi all at once, which is the equivalent of
|
||
|
sending program change messages on all 16 MIDI channels. What is the
|
||
|
equivalent mechanism in ZIPI?
|
||
|
|
||
|
In ZIPI, there are 8001 instruments in 63 families, so it is easy to set up
|
||
|
an instrument with every timbre you are ever going to need, and then choose
|
||
|
timbre just by selecting a particular instrument to trigger.
|
||
|
|
||
|
However, if you like the idea of "swapping in" a set of instruments, i.e.,
|
||
|
changing 16 timbres at once, it would still be easy via ZIPI. The ZIPI
|
||
|
controller would have a data structure similar to a MIDI multi, but of any
|
||
|
size, and possibly spanning multiple synthesizers. At the push of a button,
|
||
|
the controller could send program change messages to all the appropriate
|
||
|
instruments of all the appropriate synthesizers.
|
||
|
|
||
|
|
||
|
Y-splitters and Mergers
|
||
|
=======================
|
||
|
|
||
|
In MIDI, two common tools are a Y-splitter and a merger. The Y-splitter has
|
||
|
one MIDI input and multiple MIDI outputs, all of which are copies of the
|
||
|
input signal. (Thus, it looks like the letter "Y.") It is useful to send the
|
||
|
control information from one computer or musical instrument to multiple
|
||
|
synthesizers, e.g., to make them play in unison. The merger performs the
|
||
|
opposite function, taking some number of MIDI inputs and combining the MIDI
|
||
|
messages logically into one single stream of MIDI data at the output. It is
|
||
|
useful for controlling the same synthesizer or computer with two different
|
||
|
MIDI instruments.
|
||
|
|
||
|
Neither of these is required in ZIPI. Any collection of ZIPI devices can be
|
||
|
on the same network, and any of them can send a message to any other. If two
|
||
|
devices want to send messages to the same synthesizer, they both send data
|
||
|
to the appropriate network device address. Likewise, if a device wants to
|
||
|
send a message to two synthesizers, nothing needs to change in the ring
|
||
|
configuration. The sending device can just send two messages addressed to
|
||
|
the two devices, and both will reach their destinations. If the message is
|
||
|
intended for all devices in the ring it can be broadcast, meaning that every
|
||
|
device sees the same message.
|
||
|
|
||
|
|
||
|
A ZIPI Vocoder
|
||
|
==============
|
||
|
|
||
|
A vocoder is a musical instrument that applies the frequency spectrum of one
|
||
|
sound source to a second sound source. The first input signal is analyzed
|
||
|
for its frequency content, with the result used as the parameters of a
|
||
|
filter applied to the second input signal. See, for example, (Moore 1990) or
|
||
|
(Buchla 1974) for an explanation of this technique.
|
||
|
|
||
|
A vocoder might consist of the components shown in Figure 1. The first input
|
||
|
signal passes through the analysis filter bank and envelope followers, which
|
||
|
produce a set of continuously varying control signals. These signals
|
||
|
represent the amplitude of the control signal in each frequency region. The
|
||
|
control signals are then used to control the gains of the components of a
|
||
|
filter bank.
|
||
|
|
||
|
[Figure 1 would go here if this weren't the ASCII version]
|
||
|
|
||
|
One interesting variant on vocoding is applying some transformation to the
|
||
|
control signals between the time when they are produced by the analysis
|
||
|
portion and when they are used to control the output filter bank (Buchla
|
||
|
1974). For example, one might want to accentuate the effect of the first
|
||
|
input signal, e.g., by squaring the control signals. Another interesting
|
||
|
transformation would be to have the overall amplitude of the first input
|
||
|
signal control the brightness imposed by the output filter bank, by
|
||
|
selectively increasing the control signals for the higher-frequency filters.
|
||
|
Yet another possibility would be to frequency-shift the control signals, so
|
||
|
that each control signal would control the gain of the next higher filter.
|
||
|
|
||
|
These kinds of transformations are easy if the two halves of the vocoder
|
||
|
communicate via ZIPI. Suppose we have a vocoder with 20 filter segments,
|
||
|
spaced in some reasonable way across the frequency spectrum. We would then
|
||
|
use ZIPI to send the 20 control signals. For the usual vocoder behavior, we
|
||
|
would simply connect the control ZIPI stream directly to the synthesis
|
||
|
filter bank. For more elaborate behaviors, we would apply some
|
||
|
transformation to the ZIPI signal, presumably with a computer program.
|
||
|
|
||
|
Exactly what would we send over ZIPI? In the best case, the analysis filter
|
||
|
bank would produce logarithmic amplitudes and the synthesis filter banks
|
||
|
would expect a logarithmic control signal, so we could get by with 8-bit
|
||
|
data. Since all 20 of these numbers would be produced at once, there is no
|
||
|
need to set them individually, so we would probably send them as a single
|
||
|
20-data-byte note descriptor. We would need to pick an undefined n-byte note
|
||
|
descriptor, e.g., hexadecimal C9 for this. We might sample the envelope
|
||
|
followers in the analysis bank every 5 msec, so we would update this note
|
||
|
descriptor every 5 msec. The update packet would have seven bytes of
|
||
|
overhead, a 3-byte address, a 1-byte note descriptor ID, another two bytes
|
||
|
to indicate 20 data bytes, and the data bytes themselves, for a total of 33
|
||
|
bytes. So at 250 kBaud, it would take just over a millisecond to send this
|
||
|
packet, thus using about 1/5 of ZIPI's bandwidth.
|
||
|
|
||
|
If the filter banks used linear control signals instead of logarithmic ones,
|
||
|
eight bits would not be enough resolution, so we would switch to 16-bit
|
||
|
control signals. Thus, our packets would be 53 data bytes, taking 1.7 msec
|
||
|
to send. In the worst case, the two filter banks would use different units,
|
||
|
requiring that there be some mapping function, e.g., linear-to-logarithmic
|
||
|
conversion.
|
||
|
|
||
|
|
||
|
Sample Dumps
|
||
|
============
|
||
|
|
||
|
ZIPI will have a separate application layer for audio samples, with a fixed
|
||
|
header format for specifying the sampling rate, number of channels, etc.
|
||
|
This header is likely to be the same as an existing sound file standard (van
|
||
|
Rossum 1993), but might be specially designed for ZIPI.
|
||
|
|
||
|
In most cases, sample dumps are a type of packet that would be sent with
|
||
|
"guaranteed delivery," meaning that the receiving device has to acknowledge
|
||
|
successful receipt of the packet.
|
||
|
|
||
|
The longest legal ZIPI packet has 4096 data bytes, plus seven overhead
|
||
|
bytes, so samples will have to be broken up into multiple ZIPI packets.
|
||
|
After getting one of these packets, the receiving device would have to send
|
||
|
back an acknowledgment of around 8 bytes. Therefore, it takes 4111 bytes
|
||
|
over the network to reliably transmit 4096 data bytes---99.64 percent
|
||
|
efficiency. ZIPI's lower network levels do a hardware CRC checksum, so if
|
||
|
there is any corruption of the data the receiving device will know it. (Of
|
||
|
course, if there is data corruption it will be necessary to re-transmit
|
||
|
those data bytes, slowing down the process. But that is better than a
|
||
|
garbled sample file!)
|
||
|
|
||
|
Assuming 1-bit monophonic PCM data, 4096 bytes is 2048 samples. At a 44.1
|
||
|
kHz sampling rate, that would be about 46 msec of sound. In a ZIPI network
|
||
|
running at 1 MBaud, it would take about 33 msec to transmit 4111 bytes, so
|
||
|
it would be about 40 percent faster than real time to transmit CD-quality
|
||
|
samples. Of course, a ZIPI network running at 20 MBaud would be 20 times
|
||
|
faster, transmitting those 4111 bytes in 1.64 msec---28 times faster than
|
||
|
real-time.
|
||
|
|
||
|
|
||
|
Real-time Digital Audio
|
||
|
=======================
|
||
|
|
||
|
Real-time digital audio has something in common with sample dumps---PCM audio
|
||
|
data is sent at high speeds over ZIPI---but there are also other issues that
|
||
|
must be resolved. First of all, obviously, the network must be able to send
|
||
|
the data at faster than real-time!
|
||
|
|
||
|
Variations in network latency also add complication to real-time digital
|
||
|
audio. To solve this, the sending device should time-tag all outgoing audio
|
||
|
messages. The receiving device can then buffer data before playing it,
|
||
|
imposing a small but fixed delay on the audio stream. Anderson and Kuivila
|
||
|
(1986) give an explanation of this process.
|
||
|
|
||
|
For example, consider the case of transmitting CD-quality monophonic digital
|
||
|
audio over a 1 MBaud ZIPI network. As mentioned above, there's more than
|
||
|
enough bandwidth, by 40 percent, for this. What are the expected and
|
||
|
worst-case latencies? A maximum-length ZIPI packet and its acknowledgment
|
||
|
require 33 msec to transmit, so that is the expected (and best-case)
|
||
|
latency, but if a packet is lost, it will require another 33 msec to
|
||
|
re-transmit. To be conservative, let's say that the worst-case latency is
|
||
|
200 msec.
|
||
|
|
||
|
The receiving device would allocate a buffer of 8820 samples, i.e., 200
|
||
|
msec. When it receives a packet of samples, it copies them into the buffer
|
||
|
at the appropriate location, based on the packet's time stamp. It would read
|
||
|
data out of the buffer 200 msec after the device sends it, so the buffer
|
||
|
will usually be close to full as long as no packets are lost and there is no
|
||
|
unexpected network activity. If a packet is lost, the buffer will continue
|
||
|
to be emptied by the playing device, but will temporarily stop being filled
|
||
|
by the sending device. However, we expect that the sending device will get
|
||
|
some data through before the 200 msec buffer empties entirely. Since the
|
||
|
network can transmit at 40 percent faster than real-time, the sending
|
||
|
synthesizer can use that 40 percent to catch up, eventually nearly filling
|
||
|
the buffer again.
|
||
|
|
||
|
Note that the maximum packet size of 4096 bytes is not the only possible
|
||
|
packet size for digital audio. Smaller packet sizes would require more
|
||
|
bandwidth (since the amount of overhead per packet is constant), but would
|
||
|
give less latency. 256-byte packets would reduce efficiency to 94.5 percent,
|
||
|
but would reduce the expected latency to 2.168 msec. In an extreme case,
|
||
|
8-data-byte packets would reduce efficiency to under 35 percent, but would
|
||
|
give 0.184 msec latency.
|
||
|
|
||
|
|
||
|
References
|
||
|
==========
|
||
|
|
||
|
Anderson, D., and R. Kuivila. 1986. "Accurately Timed Generation of Discrete
|
||
|
Musical Events." *Computer Music Journal* 10(3): 48-56.
|
||
|
|
||
|
Buchla, D. 1974. *Model 296 Spectral Processor Manual.* Berkeley,
|
||
|
California: Buchla and Associates.
|
||
|
|
||
|
Moore, F. R. 1990. *Elements of Computer Music*. Englewood Cliffs, New
|
||
|
Jersey: Prentice Hall.
|
||
|
|
||
|
van Rossum, Guido. 1993. *FAQ: Audio File Formats (version 2.10)*. From
|
||
|
Internet server mitpress.mit.edu, file
|
||
|
/pub/Computer-Music-Journal/Documents/SoundFiles/AudioFormats2.10.t.
|