698 lines
35 KiB
Plaintext
698 lines
35 KiB
Plaintext
|
||
NFS Tracing By Passive Network Monitoring
|
||
|
||
Matt Blaze
|
||
|
||
Department of Computer Science Princeton University mab@cs.princeton.edu
|
||
|
||
ABSTRACT
|
||
|
||
Traces of filesystem activity have proven to be useful for a wide variety of
|
||
purposes, ranging from quantitative analysis of system behavior to
|
||
trace-driven simulation of filesystem algorithms. Such traces can be
|
||
difficult to obtain, however, usually entailing modification of the
|
||
filesystems to be monitored and runtime overhead for the period of the
|
||
trace. Largely because of these difficulties, a surprisingly small number of
|
||
filesystem traces have been conducted, and few sample workloads are
|
||
available to filesystem researchers.
|
||
|
||
This paper describes a portable toolkit for deriving approximate traces of
|
||
NFS [1] activity by non-intrusively monitoring the Ethernet traffic to and
|
||
from the file server. The toolkit uses a promiscuous Ethernet listener
|
||
interface (such as the Packetfilter[2]) to read and reconstruct NFS-related
|
||
RPC packets intended for the server. It produces traces of the NFS activity
|
||
as well as a plausible set of corresponding client system calls. The tool is
|
||
currently in use at Princeton and other sites, and is available via
|
||
anonymous ftp.
|
||
|
||
1. Motivation
|
||
|
||
Traces of real workloads form an important part of virtually all analysis of
|
||
computer system behavior, whether it is program hot spots, memory access
|
||
patterns, or filesystem activity that is being studied. In the case of
|
||
filesystem activity, obtaining useful traces is particularly challenging.
|
||
Filesystem behavior can span long time periods, often making it necessary to
|
||
collect huge traces over weeks or even months. Modification of the
|
||
filesystem to collect trace data is often difficult, and may result in
|
||
unacceptable runtime overhead. Distributed filesystems exa cerbate these
|
||
difficulties, especially when the network is composed of a large number of
|
||
heterogeneous machines. As a result of these difficulties, only a relatively
|
||
small number of traces of Unix filesystem workloads have been conducted,
|
||
primarily in computing research environments. [3], [4] and [5] are examples
|
||
of such traces.
|
||
|
||
Since distributed filesystems work by transmitting their activity over a
|
||
network, it would seem reasonable to obtain traces of such systems by
|
||
placing a "tap" on the network and collecting trace data based on the
|
||
network traffic. Ethernet[6] based networks lend themselves to this approach
|
||
particularly well, since traffic is broadcast to all machines connected to a
|
||
given subnetwork. A number of general-purpose network monitoring tools are
|
||
avail able that "promiscuously" listen to the Ethernet to which they are
|
||
connected; Sun's etherfind[7] is an example of such a tool. While these
|
||
tools are useful for observing (and collecting statistics on) specific types
|
||
of packets, the information they provide is at too low a level to be useful
|
||
for building filesystem traces. Filesystem operations may span several
|
||
packets, and may be meaningful only in the context of other, previous
|
||
operations.
|
||
|
||
Some work has been done on characterizing the impact of NFS traffic on
|
||
network load. In [8], for example, the results of a study are reported in
|
||
which Ethernet traffic was monitored and statistics gathered on NFS
|
||
activity. While useful for understanding traffic patterns and developing a
|
||
queueing model of NFS loads, these previous stu dies do not use the network
|
||
traffic to analyze the file access traffic patterns of the system, focusing
|
||
instead on developing a statistical model of the individual packet sources,
|
||
destinations, and types.
|
||
|
||
This paper describes a toolkit for collecting traces of NFS file access
|
||
activity by monitoring Ethernet traffic. A "spy" machine with a promiscuous
|
||
Ethernet interface is connected to the same network as the file server. Each
|
||
NFS-related packet is analyzed and a trace is produced at an appropriate
|
||
level of detail. The tool can record the low level NFS calls themselves or
|
||
an approximation of the user-level system calls (open, close, etc.) that
|
||
triggered the activity.
|
||
|
||
We partition the problem of deriving NFS activity from raw network traffic
|
||
into two fairly distinct subprob lems: that of decoding the low-level NFS
|
||
operations from the packets on the network, and that of translating these
|
||
low-level commands back into user-level system calls. Hence, the toolkit
|
||
consists of two basic parts, an "RPC decoder" (rpcspy) and the "NFS
|
||
analyzer" (nfstrace). rpcspy communicates with a low-level network
|
||
monitoring facility (such as Sun's NIT [9] or the Packetfilter [2]) to read
|
||
and reconstruct the RPC transactions (call and reply) that make up each NFS
|
||
command. nfstrace takes the output of rpcspy and reconstructs the sys tem
|
||
calls that occurred as well as other interesting data it can derive about
|
||
the structure of the filesystem, such as the mappings between NFS file
|
||
handles and Unix file names. Since there is not a clean one-to-one mapping
|
||
between system calls and lower-level NFS commands, nfstrace uses some simple
|
||
heuristics to guess a reasonable approximation of what really occurred.
|
||
|
||
1.1. A Spy's View of the NFS Protocols
|
||
|
||
It is well beyond the scope of this paper to describe the protocols used by
|
||
NFS; for a detailed description of how NFS works, the reader is referred to
|
||
[10], [11], and [12]. What follows is a very brief overview of how NFS
|
||
activity translates into Ethernet packets.
|
||
|
||
An NFS network consists of servers, to which filesystems are physically
|
||
connected, and clients, which per form operations on remote server
|
||
filesystems as if the disks were locally connected. A particular machine can
|
||
be a client or a server or both. Clients mount remote server filesystems in
|
||
their local hierarchy just as they do local filesystems; from the user's
|
||
perspective, files on NFS and local filesystems are (for the most part)
|
||
indistinguishable, and can be manipulated with the usual filesystem calls.
|
||
|
||
The interface between client and server is defined in terms of 17 remote
|
||
procedure call (RPC) operations. Remote files (and directories) are referred
|
||
to by a file handle that uniquely identifies the file to the server. There
|
||
are operations to read and write bytes of a file (read, write), obtain a
|
||
file's attributes (getattr), obtain the contents of directories (lookup,
|
||
readdir), create files (create), and so forth. While most of these
|
||
operations are direct analogs of Unix system calls, notably absent are open
|
||
and close operations; no client state information is maintained at the
|
||
server, so there is no need to inform the server explicitly when a file is
|
||
in use. Clients can maintain buffer cache entries for NFS files, but must
|
||
verify that the blocks are still valid (by checking the last write time with
|
||
the getattr operation) before using the cached data.
|
||
|
||
An RPC transaction consists of a call message (with arguments) from the
|
||
client to the server and a reply mes sage (with return data) from the server
|
||
to the client. NFS RPC calls are transmitted using the UDP/IP connection
|
||
less unreliable datagram protocol[13]. The call message contains a unique
|
||
transaction identifier which is included in the reply message to enable the
|
||
client to match the reply with its call. The data in both messages is
|
||
encoded in an "external data representation" (XDR), which provides a
|
||
machine-independent standard for byte order, etc.
|
||
|
||
Note that the NFS server maintains no state information about its clients,
|
||
and knows nothing about the context of each operation outside of the
|
||
arguments to the operation itself.
|
||
|
||
2. The rpcspy Program
|
||
|
||
rpcspy is the interface to the system-dependent Ethernet monitoring
|
||
facility; it produces a trace of the RPC calls issued between a given set of
|
||
clients and servers. At present, there are versions of rpcspy for a number
|
||
of BSD-derived systems, including ULTRIX (with the Packetfilter[2]), SunOS
|
||
(with NIT[9]), and the IBM RT running AOS (with the Stanford enet filter).
|
||
|
||
For each RPC transaction monitored, rpcspy produces an ASCII record
|
||
containing a timestamp, the name of the server, the client, the length of
|
||
time the command took to execute, the name of the RPC command executed, and
|
||
the command- specific arguments and return data. Currently, rpcspy
|
||
understands and can decode the 17 NFS RPC commands, and there are hooks to
|
||
allow other RPC services (for example, NIS) to be added reasonably easily.
|
||
|
||
The output may be read directly or piped into another program (such as
|
||
nfstrace) for further analysis; the for mat is designed to be reasonably
|
||
friendly to both the human reader and other programs (such as nfstrace or
|
||
awk).
|
||
|
||
Since each RPC transaction consists of two messages, a call and a reply,
|
||
rpcspy waits until it receives both these components and emits a single
|
||
record for the entire transaction. The basic output format is 8 vertical-bar
|
||
separated fields:
|
||
|
||
timestamp | execution-time | server | client | command-name | arguments |
|
||
reply-data
|
||
|
||
where timestamp is the time the reply message was received, execution-time
|
||
is the time (in microseconds) that elapsed between the call and reply,
|
||
server is the name (or IP address) of the server, client is the name (or IP
|
||
address) of the client followed by the userid that issued the command,
|
||
command-name is the name of the particular program invoked (read, write,
|
||
getattr, etc.), and arguments and reply-data are the command dependent
|
||
arguments and return values passed to and from the RPC program,
|
||
respectively.
|
||
|
||
The exact format of the argument and reply data is dependent on the specific
|
||
command issued and the level of detail the user wants logged. For example, a
|
||
typical NFS command is recorded as follows:
|
||
|
||
690529992.167140 | 11717 | paramount | merckx.321 | read |
|
||
{"7b1f00000000083c", 0, 8192} | ok, 1871
|
||
|
||
In this example, uid 321 at client "merckx" issued an NFS read command to
|
||
server "paramount". The reply was issued at (Unix time) 690529992.167140
|
||
seconds; the call command occurred 11717 microseconds earlier. Three
|
||
arguments are logged for the read call: the file handle from which to read
|
||
(represented as a hexadecimal string), the offset from the beginning of the
|
||
file, and the number of bytes to read. In this example, 8192 bytes are
|
||
requested starting at the beginning (byte 0) of the file whose handle is
|
||
"7b1f00000000083c". The command completed successfully (status "ok"), and
|
||
1871 bytes were returned. Of course, the reply message also included the
|
||
1871 bytes of data from the file, but that field of the reply is not logged
|
||
by rpcspy.
|
||
|
||
rpcspy has a number of configuration options to control which hosts and RPC
|
||
commands are traced, which call and reply fields are printed, which Ethernet
|
||
interfaces are tapped, how long to wait for reply messages, how long to run,
|
||
etc. While its primary function is to provide input for the nfstrace program
|
||
(see Section 3), judi cious use of these options (as well as such programs
|
||
as grep, awk, etc.) permit its use as a simple NFS diag nostic and
|
||
performance monitoring tool. A few screens of output give a surprisingly
|
||
informative snapshot of current NFS activity; we have identified quickly
|
||
using the program several problems that were otherwise difficult to
|
||
pinpoint. Similarly, a short awk script can provide a breakdown of the most
|
||
active clients, servers, and hosts over a sampled time period.
|
||
|
||
2.1. Implementation Issues
|
||
|
||
The basic function of rpcspy is to monitor the network, extract those
|
||
packets containing NFS data, and print the data in a useful format. Since
|
||
each RPC transaction consists of a call and a reply, rpcspy maintains a
|
||
table of pending call packets that are removed and emitted when the matching
|
||
reply arrives. In normal operation on a reasonably fast workstation, this
|
||
rarely requires more than about two megabytes of memory, even on a busy net
|
||
work with unusually slow file servers. Should a server go down, however, the
|
||
queue of pending call messages (which are never matched with a reply) can
|
||
quickly become a memory hog; the user can specify a maximum size the table
|
||
is allowed to reach before these "orphaned" calls are searched out and
|
||
reclaimed.
|
||
|
||
File handles pose special problems. While all NFS file handles are a fixed
|
||
size, the number of significant bits varies from implementation to
|
||
implementation; even within a vendor, two different releases of the same
|
||
operating system might use a completely different internal handle format. In
|
||
most Unix implementations, the handle contains a filesystem identifier and
|
||
the inode number of the file; this is sometimes augmented by additional
|
||
information, such as a version number. Since programs using rpcspy output
|
||
generally will use the handle as a unique file identifier, it is important
|
||
that there not appear to be more than one handle for the same file.
|
||
Unfortunately, it is not sufficient to simply consider the handle as a
|
||
bitstring of the maximum handle size, since many operating systems do not
|
||
zero out the unused extra bits before assigning the handle. Fortunately,
|
||
most servers are at least consistent in the sizes of the handles they
|
||
assign. rpcspy allows the user to specify (on the command line or in a
|
||
startup file) the handle size for each host to be monitored. The handles
|
||
from that server are emitted as hexadecimal strings truncated at that
|
||
length. If no size is specified, a guess is made based on a few common
|
||
formats of a reasonable size.
|
||
|
||
It is usually desirable to emit IP addresses of clients and servers as their
|
||
symbolic host names. An early ver sion of the software simply did a
|
||
nameserver lookup each time this was necessary; this quickly flooded the
|
||
network with a nameserver request for each NFS transaction. The current
|
||
version maintains a cache of host names; this requires a only a modest
|
||
amount of memory for typical networks of less than a few hundred hosts. For
|
||
very large networks or those where NFS service is provided to a large number
|
||
of remote hosts, this could still be a potential problem, but as a last
|
||
resort remote name resolution could be disabled or rpcspy configured to not
|
||
translate IP addresses.
|
||
|
||
UDP/IP datagrams may be fragmented among several packets if the datagram is
|
||
larger than the maximum size of a single Ethernet frame. rpcspy looks only
|
||
at the first fragment; in practice, fragmentation occurs only for the data
|
||
fields of NFS read and write transactions, which are ignored anyway.
|
||
|
||
3. nfstrace: The Filesystem Tracing Package
|
||
|
||
Although rpcspy provides a trace of the low-level NFS commands, it is not,
|
||
in and of itself, sufficient for obtaining useful filesystem traces. The
|
||
low-level commands do not by themselves reveal user-level activity. Furth
|
||
ermore, the volume of data that would need to be recorded is potentially
|
||
enormous, on the order of megabytes per hour. More useful would be an
|
||
abstraction of the user-level system calls underlying the NFS activity.
|
||
|
||
nfstrace is a filter for rpcspy that produces a log of a plausible set of
|
||
user level filesystem commands that could have triggered the monitored
|
||
activity. A record is produced each time a file is opened, giving a summary
|
||
of what occurred. This summary is detailed enough for analysis or for use as
|
||
input to a filesystem simulator.
|
||
|
||
The output format of nfstrace consists of 7 fields:
|
||
|
||
timestamp | command-time | direction | file-id | client | transferred | size
|
||
|
||
where timestamp is the time the open occurred, command-time is the length of
|
||
time between open and close, direc tion is either read or write (mkdir and
|
||
readdir count as write and read, respectively). file-id identifies the
|
||
server and the file handle, client is the client and user that performed the
|
||
open, transferred is the number of bytes of the file actually read or
|
||
written (cache hits have a 0 in this field), and size is the size of the
|
||
file (in bytes).
|
||
|
||
An example record might be as follows:
|
||
|
||
690691919.593442 | 17734 | read | basso:7b1f00000000400f | frejus.321 | 0 |
|
||
24576
|
||
|
||
Here, userid 321 at client frejus read file 7b1f00000000400f on server
|
||
basso. The file is 24576 bytes long and was able to be read from the client
|
||
cache. The command started at Unix time 690691919.593442 and took 17734
|
||
microseconds at the server to execute.
|
||
|
||
Since it is sometimes useful to know the name corresponding to the handle
|
||
and the mode information for each file, nfstrace optionally produces a map
|
||
of file handles to file names and modes. When enough information (from
|
||
lookup and readdir commands) is received, new names are added. Names can
|
||
change over time (as files are deleted and renamed), so the times each
|
||
mapping can be considered valid is recorded as well. The mapping infor
|
||
mation may not always be complete, however, depending on how much activity
|
||
has already been observed. Also, hard links can confuse the name mapping,
|
||
and it is not always possible to determine which of several possible names a
|
||
file was opened under.
|
||
|
||
What nfstrace produces is only an approximation of the underlying user
|
||
activity. Since there are no NFS open or close commands, the program must
|
||
guess when these system calls occur. It does this by taking advantage of the
|
||
observation that NFS is fairly consistent in what it does when a file is
|
||
opened. If the file is in the local buffer cache, a getattr call is made on
|
||
the file to verify that it has not changed since the file was cached.
|
||
Otherwise, the actual bytes of the file are fetched as they are read by the
|
||
user. (It is possible that part of the file is in the cache and part is not,
|
||
in which case the getattr is performed and only the missing pieces are
|
||
fetched. This occurs most often when a demand-paged executable is loaded).
|
||
nfstrace assumes that any sequence of NFS read calls on the same file issued
|
||
by the same user at the same client is part of a single open for read. The
|
||
close is assumed to have taken place when the last read in the sequence
|
||
completes. The end of a read sequence is detected when the same client reads
|
||
the beginning of the file again or when a timeout with no reading has
|
||
elapsed. Writes are handled in a similar manner.
|
||
|
||
Reads that are entirely from the client cache are a bit harder; not every
|
||
getattr command is caused by a cache read, and a few cache reads take place
|
||
without a getattr. A user level stat system call can sometimes trigger a
|
||
getattr, as can an ls -l command. Fortunately, the attribute caching used by
|
||
most implementations of NFS seems to eliminate many of these extraneous
|
||
getattrs, and ls commands appear to trigger a lookup command most of the
|
||
time. nfstrace assumes that a getattr on any file that the client has read
|
||
within the past few hours represents a cache read, otherwise it is ignored.
|
||
This simple heuristic seems to be fairly accurate in practice. Note also
|
||
that a getattr might not be performed if a read occurs very soon after the
|
||
last read, but the time threshold is generally short enough that this is
|
||
rarely a problem. Still, the cached reads that nfstrace reports are, at
|
||
best, an estimate (generally erring on the side of over-reporting). There is
|
||
no way to determine the number of bytes actually read for cache hits.
|
||
|
||
The output of nfstrace is necessarily produced out of chronological order,
|
||
but may be sorted easily by a post-processor.
|
||
|
||
nfstrace has a host of options to control the level of detail of the trace,
|
||
the lengths of the timeouts, and so on. To facilitate the production of very
|
||
long traces, the output can be flushed and checkpointed at a specified inter
|
||
val, and can be automatically compressed.
|
||
|
||
4. Using rpcspy and nfstrace for Filesystem Tracing
|
||
|
||
Clearly, nfstrace is not suitable for producing highly accurate traces;
|
||
cache hits are only estimated, the timing information is imprecise, and data
|
||
from lost (and duplicated) network packets are not accounted for. When such
|
||
a highly accurate trace is required, other approaches, such as modification
|
||
of the client and server kernels, must be employed.
|
||
|
||
The main virtue of the passive-monitoring approach lies in its simplicity.
|
||
In [5], Baker, et al, describe a trace of a distributed filesystem which
|
||
involved low-level modification of several different operating system
|
||
kernels. In contrast, our entire filesystem trace package consists of less
|
||
than 5000 lines of code written by a single programmer in a few weeks,
|
||
involves no kernel modifications, and can be installed to monitor multiple
|
||
heterogeneous servers and clients with no knowledge of even what operating
|
||
systems they are running.
|
||
|
||
The most important parameter affecting the accuracy of the traces is the
|
||
ability of the machine on which rpcspy is running to keep up with the
|
||
network traffic. Although most modern RISC workstations with reasonable
|
||
Ethernet interfaces are able to keep up with typical network loads, it is
|
||
important to determine how much informa tion was lost due to packet buffer
|
||
overruns before relying upon the trace data. It is also important that the
|
||
trace be, indeed, non-intrusive. It quickly became obvious, for example,
|
||
that logging the traffic to an NFS filesystem can be problematic.
|
||
|
||
Another parameter affecting the usefulness of the traces is the validity of
|
||
the heuristics used to translate from RPC calls into user-level system
|
||
calls. To test this, a shell script was written that performed ls -l, touch,
|
||
cp and wc commands randomly in a small directory hierarchy, keeping a record
|
||
of which files were touched and read and at what time. After several hours,
|
||
nfstrace was able to detect 100% of the writes, 100% of the uncached reads,
|
||
and 99.4% of the cached reads. Cached reads were over-reported by 11%, even
|
||
though ls com mands (which cause the "phantom" reads) made up 50% of the
|
||
test activity. While this test provides encouraging evidence of the accuracy
|
||
of the traces, it is not by itself conclusive, since the particular workload
|
||
being monitored may fool nfstrace in unanticipated ways.
|
||
|
||
As in any research where data are collected about the behavior of human
|
||
subjects, the privacy of the individu als observed is a concern. Although
|
||
the contents of files are not logged by the toolkit, it is still possible to
|
||
learn something about individual users from examining what files they read
|
||
and write. At a minimum, the users of a mon itored system should be informed
|
||
of the nature of the trace and the uses to which it will be put. In some
|
||
cases, it may be necessary to disable the name translation from nfstrace
|
||
when the data are being provided to others. Commercial sites where filenames
|
||
might reveal something about proprietary projects can be particularly
|
||
sensitive to such concerns.
|
||
|
||
5. A Trace of Filesystem Activity in the Princeton C.S. Department
|
||
|
||
A previous paper[14] analyzed a five-day long trace of filesystem activity
|
||
conducted on 112 research worksta tions at DEC-SRC. The paper identified a
|
||
number of file access properties that affect filesystem caching perfor
|
||
mance; it is difficult, however, to know whether these properties were
|
||
unique artifacts of that particular environment or are more generally
|
||
applicable. To help answer that question, it is necessary to look at similar
|
||
traces from other computing environments.
|
||
|
||
It was relatively easy to use rpcspy and nfstrace to conduct a week long
|
||
trace of filesystem activity in the Princeton University Computer Science
|
||
Department. The departmental computing facility serves a community of
|
||
approximately 250 users, of which about 65% are researchers (faculty,
|
||
graduate students, undergraduate researchers, postdoctoral staff, etc), 5%
|
||
office staff, 2% systems staff, and the rest guests and other "external"
|
||
users. About 115 of the users work full-time in the building and use the
|
||
system heavily for electronic mail, netnews, and other such communication
|
||
services as well as other computer science research oriented tasks (editing,
|
||
compiling, and executing programs, formatting documents, etc).
|
||
|
||
The computing facility consists of a central Auspex file server (fs) (to
|
||
which users do not ordinarily log in directly), four DEC 5000/200s (elan,
|
||
hart, atomic and dynamic) used as shared cycle servers, and an assortment of
|
||
dedicated workstations (NeXT machines, Sun workstations, IBM-RTs, Iris
|
||
workstations, etc.) in indi vidual offices and laboratories. Most users log
|
||
in to one of the four cycle servers via X window terminals located in
|
||
offices; the terminals are divided evenly among the four servers. There are
|
||
a number of Ethernets throughout the building. The central file server is
|
||
connected to a "machine room network" to which no user terminals are
|
||
directly connected; traffic to the file server from outside the machine room
|
||
is gatewayed via a Cisco router. Each of the four cycle servers has a local
|
||
/, /bin and /tmp filesystem; other filesystems, including /usr, /usr/local,
|
||
and users' home directories are NFS mounted from fs. Mail sent from local
|
||
machines is delivered locally to the (shared) fs:/usr/spool/mail; mail from
|
||
outside is delivered directly on fs.
|
||
|
||
The trace was conducted by connecting a dedicated DEC 5000/200 with a local
|
||
disk to the machine room net work. This network carries NFS traffic for all
|
||
home directory access and access to all non-local cycle-server files
|
||
(including the most of the actively-used programs). On a typical weekday,
|
||
about 8 million packets are transmitted over this network. nfstrace was
|
||
configured to record opens for read and write (but not directory accesses or
|
||
individual reads or writes). After one week (wednesday to wednesday),
|
||
342,530 opens for read and 125,542 opens for write were recorded, occupying
|
||
8 MB of (compressed) disk space. Most of this traffic was from the four
|
||
cycle servers.
|
||
|
||
No attempt was made to "normalize" the workload during the trace period.
|
||
Although users were notified that file accesses were being recorded, and
|
||
provided an opportunity to ask to be excluded from the data collection, most
|
||
users seemed to simply continue with their normal work. Similarly, no
|
||
correction is made for any anomalous user activity that may have occurred
|
||
during the trace.
|
||
|
||
5.1. The Workload Over Time
|
||
|
||
Intuitively, the volume of traffic can be expected to vary with the time of
|
||
day. Figure 1 shows the number of reads and writes per hour over the seven
|
||
days of the trace; in particular, the volume of write traffic seems to
|
||
mirror the general level of departmental activity fairly closely.
|
||
|
||
An important metric of NFS performance is the client buffer cache hit rate.
|
||
Each of the four cycle servers allocates approximately 6MB of memory for the
|
||
buffer cache. The (estimated) aggregate hit rate (percentage of reads served
|
||
by client caches) as seen at the file server was surprisingly low: 22.2%
|
||
over the entire week. In any given hour, the hit rate never exceeded 40%.
|
||
Figure 2 plots (actual) server reads and (estimated) cache hits per hour
|
||
over the trace week; observe that the hit rate is at its worst during
|
||
periods of the heaviest read activity.
|
||
|
||
Past studies have predicted much higher hit rates than the aggregate
|
||
observed here. It is probable that since most of the traffic is generated by
|
||
the shared cycle servers, the low hit rate can be attributed to the large
|
||
number of users competing for cache space. In fact, the hit rate was
|
||
observed to be much higher on the single-user worksta tions monitored in the
|
||
study, averaging above 52% overall. This suggests, somewhat
|
||
counter-intuitively, that if more computers were added to the network (such
|
||
that each user had a private workstation), the server load would decrease
|
||
considerably. Figure 3 shows the actual cache misses and estimated cache
|
||
hits for a typical private works tation in the study.
|
||
|
||
Thu 00:00 Thu 06:00 Thu 12:00 Thu 18:00 Fri 00:00 Fri 06:00 Fri 12:00
|
||
Fri 18:00 Sat 00:00 Sat 06:00 Sat 12:00 Sat 18:00 Sun 00:00 Sun 06:00 Sun
|
||
12:00 Sun 18:00 Mon 00:00 Mon 06:00 Mon 12:00 Mon 18:00 Tue 00:00 Tue 06:00
|
||
Tue 12:00 Tue 18:00 Wed 00:00 Wed 06:00 Wed 12:00 Wed 18:00
|
||
|
||
1000
|
||
|
||
2000
|
||
|
||
3000
|
||
|
||
4000
|
||
|
||
5000
|
||
|
||
6000
|
||
|
||
Reads/Writes per hour
|
||
|
||
Writes
|
||
|
||
Reads (all)
|
||
|
||
Figure 1 - Read and Write Traffic Over Time
|
||
|
||
5.2. File Sharing
|
||
|
||
One property observed in the DEC-SRC trace is the tendency of files that are
|
||
used by multiple workstations to make up a significant proportion of read
|
||
traffic but a very small proportion of write traffic. This has important
|
||
implications for a caching strategy, since, when it is true, files that are
|
||
cached at many places very rarely need to be invalidated. Although the
|
||
Princeton computing facility does not have a single workstation per user, a
|
||
similar metric is the degree to which files read by more than one user are
|
||
read and written. In this respect, the Princeton trace is very similar to
|
||
the DEC-SRC trace. Files read by more than one user make up more than 60% of
|
||
read traffic, but less than 2% of write traffic. Files shared by more than
|
||
ten users make up less than .2% of write traffic but still more than 30% of
|
||
read traffic. Figure 3 plots the number of users who have previously read
|
||
each file against the number of reads and writes.
|
||
|
||
5.3. File "Entropy"
|
||
|
||
Files in the DEC-SRC trace demonstrated a strong tendency to "become"
|
||
read-only as they were read more and more often. That is, the probability
|
||
that the next operation on a given file will overwrite the file drops off
|
||
shar ply in proportion to the number of times it has been read in the past.
|
||
Like the sharing property, this has implications for a caching strategy,
|
||
since the probability that cached data is valid influences the choice of a
|
||
validation scheme. Again, we find this property to be very strong in the
|
||
Princeton trace. For any file access in the trace, the probability that it
|
||
is a write is about 27%. If the file has already been read at least once
|
||
since it was last written to, the write probability drops to 10%. Once the
|
||
file has been read at least five times, the write probability drops below
|
||
1%. Fig ure 4 plots the observed write probability against the number of
|
||
reads since the last write.
|
||
|
||
Thu 00:00 Thu 06:00 Thu 12:00 Thu 18:00 Fri 00:00 Fri 06:00 Fri 12:00
|
||
Fri 18:00 Sat 00:00 Sat 06:00 Sat 12:00 Sat 18:00 Sun 00:00 Sun 06:00 Sun
|
||
12:00 Sun 18:00 Mon 00:00 Mon 06:00 Mon 12:00 Mon 18:00 Tue 00:00 Tue 06:00
|
||
Tue 12:00 Tue 18:00 Wed 00:00 Wed 06:00 Wed 12:00 Wed 18:00
|
||
|
||
1000
|
||
|
||
2000
|
||
|
||
3000
|
||
|
||
4000
|
||
|
||
5000
|
||
|
||
Total reads per hour
|
||
|
||
Cache Hits (estimated)
|
||
|
||
Cache Misses (actual)
|
||
|
||
Figure 2 - Cache Hits and Misses Over Time
|
||
|
||
6. Conclusions
|
||
|
||
Although filesystem traces are a useful tool for the analysis of current and
|
||
proposed systems, the difficulty of collecting meaningful trace data makes
|
||
such traces difficult to obtain. The performance degradation introduced by
|
||
the trace software and the volume of raw data generated makes traces over
|
||
long time periods and outside of comput ing research facilities particularly
|
||
hard to conduct.
|
||
|
||
Although not as accurate as direct, kernel-based tracing, a passive network
|
||
monitor such as the one described in this paper can permit tracing of
|
||
distributed systems relatively easily. The ability to limit the data
|
||
collected to a high-level log of only the data required can make it
|
||
practical to conduct traces over several months. Such a long term trace is
|
||
presently being conducted at Princeton as part of the author's research on
|
||
filesystem caching. The non-intrusive nature of the data collection makes
|
||
traces possible at facilities where kernel modification is impracti cal or
|
||
unacceptable.
|
||
|
||
It is the author's hope that other sites (particularly those not doing
|
||
computing research) will make use of this toolkit and will make the traces
|
||
available to filesystem researchers.
|
||
|
||
7. Availability
|
||
|
||
The toolkit, consisting of rpcspy, nfstrace, and several support scripts,
|
||
currently runs under several BSD-derived platforms, including ULTRIX 4.x,
|
||
SunOS 4.x, and IBM-RT/AOS. It is available for anonymous ftp over the
|
||
Internet from samadams.princeton.edu, in the compressed tar file
|
||
nfstrace/nfstrace.tar.Z.
|
||
|
||
Thu 00:00 Thu 06:00 Thu 12:00 Thu 18:00 Fri 00:00 Fri 06:00 Fri 12:00
|
||
Fri 18:00 Sat 00:00 Sat 06:00 Sat 12:00 Sat 18:00 Sun 00:00 Sun 06:00 Sun
|
||
12:00 Sun 18:00 Mon 00:00 Mon 06:00 Mon 12:00 Mon 18:00 Tue 00:00 Tue 06:00
|
||
Tue 12:00 Tue 18:00 Wed 00:00 Wed 06:00 Wed 12:00 Wed 18:00 0
|
||
|
||
100
|
||
|
||
200
|
||
|
||
300
|
||
|
||
Reads per hour
|
||
|
||
Cache Hits (estimated)
|
||
|
||
Cache Misses (actual)
|
||
|
||
Figure 3 - Cache Hits and Misses Over Time - Private Workstation
|
||
|
||
0 5 10 15 20
|
||
|
||
n (readers)
|
||
|
||
0
|
||
|
||
20
|
||
|
||
40
|
||
|
||
60
|
||
|
||
80
|
||
|
||
100
|
||
|
||
% of Reads and Writes used by > n users
|
||
|
||
Reads
|
||
|
||
Writes
|
||
|
||
Figure 4 - Degree of Sharing for Reads and Writes
|
||
|
||
0 5 10 15 20
|
||
|
||
Reads Since Last Write
|
||
|
||
0.0
|
||
|
||
0.1
|
||
|
||
0.2
|
||
|
||
P(next operation is write)
|
||
|
||
Figure 5 - Probability of Write Given >= n Previous Reads
|
||
|
||
8. Acknowledgments
|
||
|
||
The author would like to gratefully acknowledge Jim Roberts and Steve Beck
|
||
for their help in getting the trace machine up and running, Rafael Alonso
|
||
for his helpful comments and direction, and the members of the pro gram
|
||
committee for their valuable suggestions. Jim Plank deserves special thanks
|
||
for writing jgraph, the software which produced the figures in this paper.
|
||
|
||
9. References
|
||
|
||
[1] Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., & Lyon, B. "Design
|
||
and Implementation of the Sun Net work File System." Proc. USENIX, Summer,
|
||
1985.
|
||
|
||
[2] Mogul, J., Rashid, R., & Accetta, M. "The Packet Filter: An Efficient
|
||
Mechanism for User-Level Network Code." Proc. 11th ACM Symp. on Operating
|
||
Systems Principles, 1987.
|
||
|
||
[3] Ousterhout J., et al. "A Trace-Driven Analysis of the Unix 4.2 BSD File
|
||
System." Proc. 10th ACM Symp. on Operating Systems Principles, 1985.
|
||
|
||
[4] Floyd, R. "Short-Term File Reference Patterns in a UNIX Environment,"
|
||
TR-177 Dept. Comp. Sci, U. of Rochester, 1986.
|
||
|
||
[5] Baker, M. et al. "Measurements of a Distributed File System," Proc. 13th
|
||
ACM Symp. on Operating Systems Principles, 1991.
|
||
|
||
[6] Metcalfe, R. & Boggs, D. "Ethernet: Distributed Packet Switching for
|
||
Local Computer Networks," CACM July, 1976.
|
||
|
||
[7] "Etherfind(8) Manual Page," SunOS Reference Manual, Sun Microsystems,
|
||
1988.
|
||
|
||
[8] Gusella, R. "Analysis of Diskless Workstation Traffic on an Ethernet,"
|
||
TR-UCB/CSD-87/379, University Of California, Berkeley, 1987.
|
||
|
||
[9] "NIT(4) Manual Page," SunOS Reference Manual, Sun Microsystems, 1988.
|
||
|
||
[10] "XDR Protocol Specification," Networking on the Sun Workstation, Sun
|
||
Microsystems, 1986.
|
||
|
||
[11] "RPC Protocol Specification," Networking on the Sun Workstation, Sun
|
||
Microsystems, 1986.
|
||
|
||
[12] "NFS Protocol Specification," Networking on the Sun Workstation, Sun
|
||
Microsystems, 1986.
|
||
|
||
[13] Postel, J. "User Datagram Protocol," RFC 768, Network Information
|
||
Center, 1980.
|
||
|
||
[14] Blaze, M., and Alonso, R., "Long-Term Caching Strategies for Very Large
|
||
Distributed File Systems," Proc. Summer 1991 USENIX, 1991.
|
||
|
||
Matt Blaze is a Ph.D. candidate in Computer Science at Princeton University,
|
||
where he expects to receive his degree in the Spring of 1992. His research
|
||
interests include distributed systems, operating systems, databases, and
|
||
programming environments. His current research focuses on caching in very
|
||
large distributed filesystems. In 1988 he received an M.S. in Computer
|
||
Science from Columbia University and in 1986 a B.S. from Hunter College. He
|
||
can be reached via email at mab@cs.princeton.edu or via US mail at Dept. of
|
||
Computer Science, Princeton University, 35 Olden Street, Princeton NJ
|
||
08544.
|
||
|
||
|