698 lines
35 KiB
Plaintext
698 lines
35 KiB
Plaintext
|
|
|||
|
NFS Tracing By Passive Network Monitoring
|
|||
|
|
|||
|
Matt Blaze
|
|||
|
|
|||
|
Department of Computer Science Princeton University mab@cs.princeton.edu
|
|||
|
|
|||
|
ABSTRACT
|
|||
|
|
|||
|
Traces of filesystem activity have proven to be useful for a wide variety of
|
|||
|
purposes, ranging from quantitative analysis of system behavior to
|
|||
|
trace-driven simulation of filesystem algorithms. Such traces can be
|
|||
|
difficult to obtain, however, usually entailing modification of the
|
|||
|
filesystems to be monitored and runtime overhead for the period of the
|
|||
|
trace. Largely because of these difficulties, a surprisingly small number of
|
|||
|
filesystem traces have been conducted, and few sample workloads are
|
|||
|
available to filesystem researchers.
|
|||
|
|
|||
|
This paper describes a portable toolkit for deriving approximate traces of
|
|||
|
NFS [1] activity by non-intrusively monitoring the Ethernet traffic to and
|
|||
|
from the file server. The toolkit uses a promiscuous Ethernet listener
|
|||
|
interface (such as the Packetfilter[2]) to read and reconstruct NFS-related
|
|||
|
RPC packets intended for the server. It produces traces of the NFS activity
|
|||
|
as well as a plausible set of corresponding client system calls. The tool is
|
|||
|
currently in use at Princeton and other sites, and is available via
|
|||
|
anonymous ftp.
|
|||
|
|
|||
|
1. Motivation
|
|||
|
|
|||
|
Traces of real workloads form an important part of virtually all analysis of
|
|||
|
computer system behavior, whether it is program hot spots, memory access
|
|||
|
patterns, or filesystem activity that is being studied. In the case of
|
|||
|
filesystem activity, obtaining useful traces is particularly challenging.
|
|||
|
Filesystem behavior can span long time periods, often making it necessary to
|
|||
|
collect huge traces over weeks or even months. Modification of the
|
|||
|
filesystem to collect trace data is often difficult, and may result in
|
|||
|
unacceptable runtime overhead. Distributed filesystems exa cerbate these
|
|||
|
difficulties, especially when the network is composed of a large number of
|
|||
|
heterogeneous machines. As a result of these difficulties, only a relatively
|
|||
|
small number of traces of Unix filesystem workloads have been conducted,
|
|||
|
primarily in computing research environments. [3], [4] and [5] are examples
|
|||
|
of such traces.
|
|||
|
|
|||
|
Since distributed filesystems work by transmitting their activity over a
|
|||
|
network, it would seem reasonable to obtain traces of such systems by
|
|||
|
placing a "tap" on the network and collecting trace data based on the
|
|||
|
network traffic. Ethernet[6] based networks lend themselves to this approach
|
|||
|
particularly well, since traffic is broadcast to all machines connected to a
|
|||
|
given subnetwork. A number of general-purpose network monitoring tools are
|
|||
|
avail able that "promiscuously" listen to the Ethernet to which they are
|
|||
|
connected; Sun's etherfind[7] is an example of such a tool. While these
|
|||
|
tools are useful for observing (and collecting statistics on) specific types
|
|||
|
of packets, the information they provide is at too low a level to be useful
|
|||
|
for building filesystem traces. Filesystem operations may span several
|
|||
|
packets, and may be meaningful only in the context of other, previous
|
|||
|
operations.
|
|||
|
|
|||
|
Some work has been done on characterizing the impact of NFS traffic on
|
|||
|
network load. In [8], for example, the results of a study are reported in
|
|||
|
which Ethernet traffic was monitored and statistics gathered on NFS
|
|||
|
activity. While useful for understanding traffic patterns and developing a
|
|||
|
queueing model of NFS loads, these previous stu dies do not use the network
|
|||
|
traffic to analyze the file access traffic patterns of the system, focusing
|
|||
|
instead on developing a statistical model of the individual packet sources,
|
|||
|
destinations, and types.
|
|||
|
|
|||
|
This paper describes a toolkit for collecting traces of NFS file access
|
|||
|
activity by monitoring Ethernet traffic. A "spy" machine with a promiscuous
|
|||
|
Ethernet interface is connected to the same network as the file server. Each
|
|||
|
NFS-related packet is analyzed and a trace is produced at an appropriate
|
|||
|
level of detail. The tool can record the low level NFS calls themselves or
|
|||
|
an approximation of the user-level system calls (open, close, etc.) that
|
|||
|
triggered the activity.
|
|||
|
|
|||
|
We partition the problem of deriving NFS activity from raw network traffic
|
|||
|
into two fairly distinct subprob lems: that of decoding the low-level NFS
|
|||
|
operations from the packets on the network, and that of translating these
|
|||
|
low-level commands back into user-level system calls. Hence, the toolkit
|
|||
|
consists of two basic parts, an "RPC decoder" (rpcspy) and the "NFS
|
|||
|
analyzer" (nfstrace). rpcspy communicates with a low-level network
|
|||
|
monitoring facility (such as Sun's NIT [9] or the Packetfilter [2]) to read
|
|||
|
and reconstruct the RPC transactions (call and reply) that make up each NFS
|
|||
|
command. nfstrace takes the output of rpcspy and reconstructs the sys tem
|
|||
|
calls that occurred as well as other interesting data it can derive about
|
|||
|
the structure of the filesystem, such as the mappings between NFS file
|
|||
|
handles and Unix file names. Since there is not a clean one-to-one mapping
|
|||
|
between system calls and lower-level NFS commands, nfstrace uses some simple
|
|||
|
heuristics to guess a reasonable approximation of what really occurred.
|
|||
|
|
|||
|
1.1. A Spy's View of the NFS Protocols
|
|||
|
|
|||
|
It is well beyond the scope of this paper to describe the protocols used by
|
|||
|
NFS; for a detailed description of how NFS works, the reader is referred to
|
|||
|
[10], [11], and [12]. What follows is a very brief overview of how NFS
|
|||
|
activity translates into Ethernet packets.
|
|||
|
|
|||
|
An NFS network consists of servers, to which filesystems are physically
|
|||
|
connected, and clients, which per form operations on remote server
|
|||
|
filesystems as if the disks were locally connected. A particular machine can
|
|||
|
be a client or a server or both. Clients mount remote server filesystems in
|
|||
|
their local hierarchy just as they do local filesystems; from the user's
|
|||
|
perspective, files on NFS and local filesystems are (for the most part)
|
|||
|
indistinguishable, and can be manipulated with the usual filesystem calls.
|
|||
|
|
|||
|
The interface between client and server is defined in terms of 17 remote
|
|||
|
procedure call (RPC) operations. Remote files (and directories) are referred
|
|||
|
to by a file handle that uniquely identifies the file to the server. There
|
|||
|
are operations to read and write bytes of a file (read, write), obtain a
|
|||
|
file's attributes (getattr), obtain the contents of directories (lookup,
|
|||
|
readdir), create files (create), and so forth. While most of these
|
|||
|
operations are direct analogs of Unix system calls, notably absent are open
|
|||
|
and close operations; no client state information is maintained at the
|
|||
|
server, so there is no need to inform the server explicitly when a file is
|
|||
|
in use. Clients can maintain buffer cache entries for NFS files, but must
|
|||
|
verify that the blocks are still valid (by checking the last write time with
|
|||
|
the getattr operation) before using the cached data.
|
|||
|
|
|||
|
An RPC transaction consists of a call message (with arguments) from the
|
|||
|
client to the server and a reply mes sage (with return data) from the server
|
|||
|
to the client. NFS RPC calls are transmitted using the UDP/IP connection
|
|||
|
less unreliable datagram protocol[13]. The call message contains a unique
|
|||
|
transaction identifier which is included in the reply message to enable the
|
|||
|
client to match the reply with its call. The data in both messages is
|
|||
|
encoded in an "external data representation" (XDR), which provides a
|
|||
|
machine-independent standard for byte order, etc.
|
|||
|
|
|||
|
Note that the NFS server maintains no state information about its clients,
|
|||
|
and knows nothing about the context of each operation outside of the
|
|||
|
arguments to the operation itself.
|
|||
|
|
|||
|
2. The rpcspy Program
|
|||
|
|
|||
|
rpcspy is the interface to the system-dependent Ethernet monitoring
|
|||
|
facility; it produces a trace of the RPC calls issued between a given set of
|
|||
|
clients and servers. At present, there are versions of rpcspy for a number
|
|||
|
of BSD-derived systems, including ULTRIX (with the Packetfilter[2]), SunOS
|
|||
|
(with NIT[9]), and the IBM RT running AOS (with the Stanford enet filter).
|
|||
|
|
|||
|
For each RPC transaction monitored, rpcspy produces an ASCII record
|
|||
|
containing a timestamp, the name of the server, the client, the length of
|
|||
|
time the command took to execute, the name of the RPC command executed, and
|
|||
|
the command- specific arguments and return data. Currently, rpcspy
|
|||
|
understands and can decode the 17 NFS RPC commands, and there are hooks to
|
|||
|
allow other RPC services (for example, NIS) to be added reasonably easily.
|
|||
|
|
|||
|
The output may be read directly or piped into another program (such as
|
|||
|
nfstrace) for further analysis; the for mat is designed to be reasonably
|
|||
|
friendly to both the human reader and other programs (such as nfstrace or
|
|||
|
awk).
|
|||
|
|
|||
|
Since each RPC transaction consists of two messages, a call and a reply,
|
|||
|
rpcspy waits until it receives both these components and emits a single
|
|||
|
record for the entire transaction. The basic output format is 8 vertical-bar
|
|||
|
separated fields:
|
|||
|
|
|||
|
timestamp | execution-time | server | client | command-name | arguments |
|
|||
|
reply-data
|
|||
|
|
|||
|
where timestamp is the time the reply message was received, execution-time
|
|||
|
is the time (in microseconds) that elapsed between the call and reply,
|
|||
|
server is the name (or IP address) of the server, client is the name (or IP
|
|||
|
address) of the client followed by the userid that issued the command,
|
|||
|
command-name is the name of the particular program invoked (read, write,
|
|||
|
getattr, etc.), and arguments and reply-data are the command dependent
|
|||
|
arguments and return values passed to and from the RPC program,
|
|||
|
respectively.
|
|||
|
|
|||
|
The exact format of the argument and reply data is dependent on the specific
|
|||
|
command issued and the level of detail the user wants logged. For example, a
|
|||
|
typical NFS command is recorded as follows:
|
|||
|
|
|||
|
690529992.167140 | 11717 | paramount | merckx.321 | read |
|
|||
|
{"7b1f00000000083c", 0, 8192} | ok, 1871
|
|||
|
|
|||
|
In this example, uid 321 at client "merckx" issued an NFS read command to
|
|||
|
server "paramount". The reply was issued at (Unix time) 690529992.167140
|
|||
|
seconds; the call command occurred 11717 microseconds earlier. Three
|
|||
|
arguments are logged for the read call: the file handle from which to read
|
|||
|
(represented as a hexadecimal string), the offset from the beginning of the
|
|||
|
file, and the number of bytes to read. In this example, 8192 bytes are
|
|||
|
requested starting at the beginning (byte 0) of the file whose handle is
|
|||
|
"7b1f00000000083c". The command completed successfully (status "ok"), and
|
|||
|
1871 bytes were returned. Of course, the reply message also included the
|
|||
|
1871 bytes of data from the file, but that field of the reply is not logged
|
|||
|
by rpcspy.
|
|||
|
|
|||
|
rpcspy has a number of configuration options to control which hosts and RPC
|
|||
|
commands are traced, which call and reply fields are printed, which Ethernet
|
|||
|
interfaces are tapped, how long to wait for reply messages, how long to run,
|
|||
|
etc. While its primary function is to provide input for the nfstrace program
|
|||
|
(see Section 3), judi cious use of these options (as well as such programs
|
|||
|
as grep, awk, etc.) permit its use as a simple NFS diag nostic and
|
|||
|
performance monitoring tool. A few screens of output give a surprisingly
|
|||
|
informative snapshot of current NFS activity; we have identified quickly
|
|||
|
using the program several problems that were otherwise difficult to
|
|||
|
pinpoint. Similarly, a short awk script can provide a breakdown of the most
|
|||
|
active clients, servers, and hosts over a sampled time period.
|
|||
|
|
|||
|
2.1. Implementation Issues
|
|||
|
|
|||
|
The basic function of rpcspy is to monitor the network, extract those
|
|||
|
packets containing NFS data, and print the data in a useful format. Since
|
|||
|
each RPC transaction consists of a call and a reply, rpcspy maintains a
|
|||
|
table of pending call packets that are removed and emitted when the matching
|
|||
|
reply arrives. In normal operation on a reasonably fast workstation, this
|
|||
|
rarely requires more than about two megabytes of memory, even on a busy net
|
|||
|
work with unusually slow file servers. Should a server go down, however, the
|
|||
|
queue of pending call messages (which are never matched with a reply) can
|
|||
|
quickly become a memory hog; the user can specify a maximum size the table
|
|||
|
is allowed to reach before these "orphaned" calls are searched out and
|
|||
|
reclaimed.
|
|||
|
|
|||
|
File handles pose special problems. While all NFS file handles are a fixed
|
|||
|
size, the number of significant bits varies from implementation to
|
|||
|
implementation; even within a vendor, two different releases of the same
|
|||
|
operating system might use a completely different internal handle format. In
|
|||
|
most Unix implementations, the handle contains a filesystem identifier and
|
|||
|
the inode number of the file; this is sometimes augmented by additional
|
|||
|
information, such as a version number. Since programs using rpcspy output
|
|||
|
generally will use the handle as a unique file identifier, it is important
|
|||
|
that there not appear to be more than one handle for the same file.
|
|||
|
Unfortunately, it is not sufficient to simply consider the handle as a
|
|||
|
bitstring of the maximum handle size, since many operating systems do not
|
|||
|
zero out the unused extra bits before assigning the handle. Fortunately,
|
|||
|
most servers are at least consistent in the sizes of the handles they
|
|||
|
assign. rpcspy allows the user to specify (on the command line or in a
|
|||
|
startup file) the handle size for each host to be monitored. The handles
|
|||
|
from that server are emitted as hexadecimal strings truncated at that
|
|||
|
length. If no size is specified, a guess is made based on a few common
|
|||
|
formats of a reasonable size.
|
|||
|
|
|||
|
It is usually desirable to emit IP addresses of clients and servers as their
|
|||
|
symbolic host names. An early ver sion of the software simply did a
|
|||
|
nameserver lookup each time this was necessary; this quickly flooded the
|
|||
|
network with a nameserver request for each NFS transaction. The current
|
|||
|
version maintains a cache of host names; this requires a only a modest
|
|||
|
amount of memory for typical networks of less than a few hundred hosts. For
|
|||
|
very large networks or those where NFS service is provided to a large number
|
|||
|
of remote hosts, this could still be a potential problem, but as a last
|
|||
|
resort remote name resolution could be disabled or rpcspy configured to not
|
|||
|
translate IP addresses.
|
|||
|
|
|||
|
UDP/IP datagrams may be fragmented among several packets if the datagram is
|
|||
|
larger than the maximum size of a single Ethernet frame. rpcspy looks only
|
|||
|
at the first fragment; in practice, fragmentation occurs only for the data
|
|||
|
fields of NFS read and write transactions, which are ignored anyway.
|
|||
|
|
|||
|
3. nfstrace: The Filesystem Tracing Package
|
|||
|
|
|||
|
Although rpcspy provides a trace of the low-level NFS commands, it is not,
|
|||
|
in and of itself, sufficient for obtaining useful filesystem traces. The
|
|||
|
low-level commands do not by themselves reveal user-level activity. Furth
|
|||
|
ermore, the volume of data that would need to be recorded is potentially
|
|||
|
enormous, on the order of megabytes per hour. More useful would be an
|
|||
|
abstraction of the user-level system calls underlying the NFS activity.
|
|||
|
|
|||
|
nfstrace is a filter for rpcspy that produces a log of a plausible set of
|
|||
|
user level filesystem commands that could have triggered the monitored
|
|||
|
activity. A record is produced each time a file is opened, giving a summary
|
|||
|
of what occurred. This summary is detailed enough for analysis or for use as
|
|||
|
input to a filesystem simulator.
|
|||
|
|
|||
|
The output format of nfstrace consists of 7 fields:
|
|||
|
|
|||
|
timestamp | command-time | direction | file-id | client | transferred | size
|
|||
|
|
|||
|
where timestamp is the time the open occurred, command-time is the length of
|
|||
|
time between open and close, direc tion is either read or write (mkdir and
|
|||
|
readdir count as write and read, respectively). file-id identifies the
|
|||
|
server and the file handle, client is the client and user that performed the
|
|||
|
open, transferred is the number of bytes of the file actually read or
|
|||
|
written (cache hits have a 0 in this field), and size is the size of the
|
|||
|
file (in bytes).
|
|||
|
|
|||
|
An example record might be as follows:
|
|||
|
|
|||
|
690691919.593442 | 17734 | read | basso:7b1f00000000400f | frejus.321 | 0 |
|
|||
|
24576
|
|||
|
|
|||
|
Here, userid 321 at client frejus read file 7b1f00000000400f on server
|
|||
|
basso. The file is 24576 bytes long and was able to be read from the client
|
|||
|
cache. The command started at Unix time 690691919.593442 and took 17734
|
|||
|
microseconds at the server to execute.
|
|||
|
|
|||
|
Since it is sometimes useful to know the name corresponding to the handle
|
|||
|
and the mode information for each file, nfstrace optionally produces a map
|
|||
|
of file handles to file names and modes. When enough information (from
|
|||
|
lookup and readdir commands) is received, new names are added. Names can
|
|||
|
change over time (as files are deleted and renamed), so the times each
|
|||
|
mapping can be considered valid is recorded as well. The mapping infor
|
|||
|
mation may not always be complete, however, depending on how much activity
|
|||
|
has already been observed. Also, hard links can confuse the name mapping,
|
|||
|
and it is not always possible to determine which of several possible names a
|
|||
|
file was opened under.
|
|||
|
|
|||
|
What nfstrace produces is only an approximation of the underlying user
|
|||
|
activity. Since there are no NFS open or close commands, the program must
|
|||
|
guess when these system calls occur. It does this by taking advantage of the
|
|||
|
observation that NFS is fairly consistent in what it does when a file is
|
|||
|
opened. If the file is in the local buffer cache, a getattr call is made on
|
|||
|
the file to verify that it has not changed since the file was cached.
|
|||
|
Otherwise, the actual bytes of the file are fetched as they are read by the
|
|||
|
user. (It is possible that part of the file is in the cache and part is not,
|
|||
|
in which case the getattr is performed and only the missing pieces are
|
|||
|
fetched. This occurs most often when a demand-paged executable is loaded).
|
|||
|
nfstrace assumes that any sequence of NFS read calls on the same file issued
|
|||
|
by the same user at the same client is part of a single open for read. The
|
|||
|
close is assumed to have taken place when the last read in the sequence
|
|||
|
completes. The end of a read sequence is detected when the same client reads
|
|||
|
the beginning of the file again or when a timeout with no reading has
|
|||
|
elapsed. Writes are handled in a similar manner.
|
|||
|
|
|||
|
Reads that are entirely from the client cache are a bit harder; not every
|
|||
|
getattr command is caused by a cache read, and a few cache reads take place
|
|||
|
without a getattr. A user level stat system call can sometimes trigger a
|
|||
|
getattr, as can an ls -l command. Fortunately, the attribute caching used by
|
|||
|
most implementations of NFS seems to eliminate many of these extraneous
|
|||
|
getattrs, and ls commands appear to trigger a lookup command most of the
|
|||
|
time. nfstrace assumes that a getattr on any file that the client has read
|
|||
|
within the past few hours represents a cache read, otherwise it is ignored.
|
|||
|
This simple heuristic seems to be fairly accurate in practice. Note also
|
|||
|
that a getattr might not be performed if a read occurs very soon after the
|
|||
|
last read, but the time threshold is generally short enough that this is
|
|||
|
rarely a problem. Still, the cached reads that nfstrace reports are, at
|
|||
|
best, an estimate (generally erring on the side of over-reporting). There is
|
|||
|
no way to determine the number of bytes actually read for cache hits.
|
|||
|
|
|||
|
The output of nfstrace is necessarily produced out of chronological order,
|
|||
|
but may be sorted easily by a post-processor.
|
|||
|
|
|||
|
nfstrace has a host of options to control the level of detail of the trace,
|
|||
|
the lengths of the timeouts, and so on. To facilitate the production of very
|
|||
|
long traces, the output can be flushed and checkpointed at a specified inter
|
|||
|
val, and can be automatically compressed.
|
|||
|
|
|||
|
4. Using rpcspy and nfstrace for Filesystem Tracing
|
|||
|
|
|||
|
Clearly, nfstrace is not suitable for producing highly accurate traces;
|
|||
|
cache hits are only estimated, the timing information is imprecise, and data
|
|||
|
from lost (and duplicated) network packets are not accounted for. When such
|
|||
|
a highly accurate trace is required, other approaches, such as modification
|
|||
|
of the client and server kernels, must be employed.
|
|||
|
|
|||
|
The main virtue of the passive-monitoring approach lies in its simplicity.
|
|||
|
In [5], Baker, et al, describe a trace of a distributed filesystem which
|
|||
|
involved low-level modification of several different operating system
|
|||
|
kernels. In contrast, our entire filesystem trace package consists of less
|
|||
|
than 5000 lines of code written by a single programmer in a few weeks,
|
|||
|
involves no kernel modifications, and can be installed to monitor multiple
|
|||
|
heterogeneous servers and clients with no knowledge of even what operating
|
|||
|
systems they are running.
|
|||
|
|
|||
|
The most important parameter affecting the accuracy of the traces is the
|
|||
|
ability of the machine on which rpcspy is running to keep up with the
|
|||
|
network traffic. Although most modern RISC workstations with reasonable
|
|||
|
Ethernet interfaces are able to keep up with typical network loads, it is
|
|||
|
important to determine how much informa tion was lost due to packet buffer
|
|||
|
overruns before relying upon the trace data. It is also important that the
|
|||
|
trace be, indeed, non-intrusive. It quickly became obvious, for example,
|
|||
|
that logging the traffic to an NFS filesystem can be problematic.
|
|||
|
|
|||
|
Another parameter affecting the usefulness of the traces is the validity of
|
|||
|
the heuristics used to translate from RPC calls into user-level system
|
|||
|
calls. To test this, a shell script was written that performed ls -l, touch,
|
|||
|
cp and wc commands randomly in a small directory hierarchy, keeping a record
|
|||
|
of which files were touched and read and at what time. After several hours,
|
|||
|
nfstrace was able to detect 100% of the writes, 100% of the uncached reads,
|
|||
|
and 99.4% of the cached reads. Cached reads were over-reported by 11%, even
|
|||
|
though ls com mands (which cause the "phantom" reads) made up 50% of the
|
|||
|
test activity. While this test provides encouraging evidence of the accuracy
|
|||
|
of the traces, it is not by itself conclusive, since the particular workload
|
|||
|
being monitored may fool nfstrace in unanticipated ways.
|
|||
|
|
|||
|
As in any research where data are collected about the behavior of human
|
|||
|
subjects, the privacy of the individu als observed is a concern. Although
|
|||
|
the contents of files are not logged by the toolkit, it is still possible to
|
|||
|
learn something about individual users from examining what files they read
|
|||
|
and write. At a minimum, the users of a mon itored system should be informed
|
|||
|
of the nature of the trace and the uses to which it will be put. In some
|
|||
|
cases, it may be necessary to disable the name translation from nfstrace
|
|||
|
when the data are being provided to others. Commercial sites where filenames
|
|||
|
might reveal something about proprietary projects can be particularly
|
|||
|
sensitive to such concerns.
|
|||
|
|
|||
|
5. A Trace of Filesystem Activity in the Princeton C.S. Department
|
|||
|
|
|||
|
A previous paper[14] analyzed a five-day long trace of filesystem activity
|
|||
|
conducted on 112 research worksta tions at DEC-SRC. The paper identified a
|
|||
|
number of file access properties that affect filesystem caching perfor
|
|||
|
mance; it is difficult, however, to know whether these properties were
|
|||
|
unique artifacts of that particular environment or are more generally
|
|||
|
applicable. To help answer that question, it is necessary to look at similar
|
|||
|
traces from other computing environments.
|
|||
|
|
|||
|
It was relatively easy to use rpcspy and nfstrace to conduct a week long
|
|||
|
trace of filesystem activity in the Princeton University Computer Science
|
|||
|
Department. The departmental computing facility serves a community of
|
|||
|
approximately 250 users, of which about 65% are researchers (faculty,
|
|||
|
graduate students, undergraduate researchers, postdoctoral staff, etc), 5%
|
|||
|
office staff, 2% systems staff, and the rest guests and other "external"
|
|||
|
users. About 115 of the users work full-time in the building and use the
|
|||
|
system heavily for electronic mail, netnews, and other such communication
|
|||
|
services as well as other computer science research oriented tasks (editing,
|
|||
|
compiling, and executing programs, formatting documents, etc).
|
|||
|
|
|||
|
The computing facility consists of a central Auspex file server (fs) (to
|
|||
|
which users do not ordinarily log in directly), four DEC 5000/200s (elan,
|
|||
|
hart, atomic and dynamic) used as shared cycle servers, and an assortment of
|
|||
|
dedicated workstations (NeXT machines, Sun workstations, IBM-RTs, Iris
|
|||
|
workstations, etc.) in indi vidual offices and laboratories. Most users log
|
|||
|
in to one of the four cycle servers via X window terminals located in
|
|||
|
offices; the terminals are divided evenly among the four servers. There are
|
|||
|
a number of Ethernets throughout the building. The central file server is
|
|||
|
connected to a "machine room network" to which no user terminals are
|
|||
|
directly connected; traffic to the file server from outside the machine room
|
|||
|
is gatewayed via a Cisco router. Each of the four cycle servers has a local
|
|||
|
/, /bin and /tmp filesystem; other filesystems, including /usr, /usr/local,
|
|||
|
and users' home directories are NFS mounted from fs. Mail sent from local
|
|||
|
machines is delivered locally to the (shared) fs:/usr/spool/mail; mail from
|
|||
|
outside is delivered directly on fs.
|
|||
|
|
|||
|
The trace was conducted by connecting a dedicated DEC 5000/200 with a local
|
|||
|
disk to the machine room net work. This network carries NFS traffic for all
|
|||
|
home directory access and access to all non-local cycle-server files
|
|||
|
(including the most of the actively-used programs). On a typical weekday,
|
|||
|
about 8 million packets are transmitted over this network. nfstrace was
|
|||
|
configured to record opens for read and write (but not directory accesses or
|
|||
|
individual reads or writes). After one week (wednesday to wednesday),
|
|||
|
342,530 opens for read and 125,542 opens for write were recorded, occupying
|
|||
|
8 MB of (compressed) disk space. Most of this traffic was from the four
|
|||
|
cycle servers.
|
|||
|
|
|||
|
No attempt was made to "normalize" the workload during the trace period.
|
|||
|
Although users were notified that file accesses were being recorded, and
|
|||
|
provided an opportunity to ask to be excluded from the data collection, most
|
|||
|
users seemed to simply continue with their normal work. Similarly, no
|
|||
|
correction is made for any anomalous user activity that may have occurred
|
|||
|
during the trace.
|
|||
|
|
|||
|
5.1. The Workload Over Time
|
|||
|
|
|||
|
Intuitively, the volume of traffic can be expected to vary with the time of
|
|||
|
day. Figure 1 shows the number of reads and writes per hour over the seven
|
|||
|
days of the trace; in particular, the volume of write traffic seems to
|
|||
|
mirror the general level of departmental activity fairly closely.
|
|||
|
|
|||
|
An important metric of NFS performance is the client buffer cache hit rate.
|
|||
|
Each of the four cycle servers allocates approximately 6MB of memory for the
|
|||
|
buffer cache. The (estimated) aggregate hit rate (percentage of reads served
|
|||
|
by client caches) as seen at the file server was surprisingly low: 22.2%
|
|||
|
over the entire week. In any given hour, the hit rate never exceeded 40%.
|
|||
|
Figure 2 plots (actual) server reads and (estimated) cache hits per hour
|
|||
|
over the trace week; observe that the hit rate is at its worst during
|
|||
|
periods of the heaviest read activity.
|
|||
|
|
|||
|
Past studies have predicted much higher hit rates than the aggregate
|
|||
|
observed here. It is probable that since most of the traffic is generated by
|
|||
|
the shared cycle servers, the low hit rate can be attributed to the large
|
|||
|
number of users competing for cache space. In fact, the hit rate was
|
|||
|
observed to be much higher on the single-user worksta tions monitored in the
|
|||
|
study, averaging above 52% overall. This suggests, somewhat
|
|||
|
counter-intuitively, that if more computers were added to the network (such
|
|||
|
that each user had a private workstation), the server load would decrease
|
|||
|
considerably. Figure 3 shows the actual cache misses and estimated cache
|
|||
|
hits for a typical private works tation in the study.
|
|||
|
|
|||
|
Thu 00:00 Thu 06:00 Thu 12:00 Thu 18:00 Fri 00:00 Fri 06:00 Fri 12:00
|
|||
|
Fri 18:00 Sat 00:00 Sat 06:00 Sat 12:00 Sat 18:00 Sun 00:00 Sun 06:00 Sun
|
|||
|
12:00 Sun 18:00 Mon 00:00 Mon 06:00 Mon 12:00 Mon 18:00 Tue 00:00 Tue 06:00
|
|||
|
Tue 12:00 Tue 18:00 Wed 00:00 Wed 06:00 Wed 12:00 Wed 18:00
|
|||
|
|
|||
|
1000
|
|||
|
|
|||
|
2000
|
|||
|
|
|||
|
3000
|
|||
|
|
|||
|
4000
|
|||
|
|
|||
|
5000
|
|||
|
|
|||
|
6000
|
|||
|
|
|||
|
Reads/Writes per hour
|
|||
|
|
|||
|
Writes
|
|||
|
|
|||
|
Reads (all)
|
|||
|
|
|||
|
Figure 1 - Read and Write Traffic Over Time
|
|||
|
|
|||
|
5.2. File Sharing
|
|||
|
|
|||
|
One property observed in the DEC-SRC trace is the tendency of files that are
|
|||
|
used by multiple workstations to make up a significant proportion of read
|
|||
|
traffic but a very small proportion of write traffic. This has important
|
|||
|
implications for a caching strategy, since, when it is true, files that are
|
|||
|
cached at many places very rarely need to be invalidated. Although the
|
|||
|
Princeton computing facility does not have a single workstation per user, a
|
|||
|
similar metric is the degree to which files read by more than one user are
|
|||
|
read and written. In this respect, the Princeton trace is very similar to
|
|||
|
the DEC-SRC trace. Files read by more than one user make up more than 60% of
|
|||
|
read traffic, but less than 2% of write traffic. Files shared by more than
|
|||
|
ten users make up less than .2% of write traffic but still more than 30% of
|
|||
|
read traffic. Figure 3 plots the number of users who have previously read
|
|||
|
each file against the number of reads and writes.
|
|||
|
|
|||
|
5.3. File "Entropy"
|
|||
|
|
|||
|
Files in the DEC-SRC trace demonstrated a strong tendency to "become"
|
|||
|
read-only as they were read more and more often. That is, the probability
|
|||
|
that the next operation on a given file will overwrite the file drops off
|
|||
|
shar ply in proportion to the number of times it has been read in the past.
|
|||
|
Like the sharing property, this has implications for a caching strategy,
|
|||
|
since the probability that cached data is valid influences the choice of a
|
|||
|
validation scheme. Again, we find this property to be very strong in the
|
|||
|
Princeton trace. For any file access in the trace, the probability that it
|
|||
|
is a write is about 27%. If the file has already been read at least once
|
|||
|
since it was last written to, the write probability drops to 10%. Once the
|
|||
|
file has been read at least five times, the write probability drops below
|
|||
|
1%. Fig ure 4 plots the observed write probability against the number of
|
|||
|
reads since the last write.
|
|||
|
|
|||
|
Thu 00:00 Thu 06:00 Thu 12:00 Thu 18:00 Fri 00:00 Fri 06:00 Fri 12:00
|
|||
|
Fri 18:00 Sat 00:00 Sat 06:00 Sat 12:00 Sat 18:00 Sun 00:00 Sun 06:00 Sun
|
|||
|
12:00 Sun 18:00 Mon 00:00 Mon 06:00 Mon 12:00 Mon 18:00 Tue 00:00 Tue 06:00
|
|||
|
Tue 12:00 Tue 18:00 Wed 00:00 Wed 06:00 Wed 12:00 Wed 18:00
|
|||
|
|
|||
|
1000
|
|||
|
|
|||
|
2000
|
|||
|
|
|||
|
3000
|
|||
|
|
|||
|
4000
|
|||
|
|
|||
|
5000
|
|||
|
|
|||
|
Total reads per hour
|
|||
|
|
|||
|
Cache Hits (estimated)
|
|||
|
|
|||
|
Cache Misses (actual)
|
|||
|
|
|||
|
Figure 2 - Cache Hits and Misses Over Time
|
|||
|
|
|||
|
6. Conclusions
|
|||
|
|
|||
|
Although filesystem traces are a useful tool for the analysis of current and
|
|||
|
proposed systems, the difficulty of collecting meaningful trace data makes
|
|||
|
such traces difficult to obtain. The performance degradation introduced by
|
|||
|
the trace software and the volume of raw data generated makes traces over
|
|||
|
long time periods and outside of comput ing research facilities particularly
|
|||
|
hard to conduct.
|
|||
|
|
|||
|
Although not as accurate as direct, kernel-based tracing, a passive network
|
|||
|
monitor such as the one described in this paper can permit tracing of
|
|||
|
distributed systems relatively easily. The ability to limit the data
|
|||
|
collected to a high-level log of only the data required can make it
|
|||
|
practical to conduct traces over several months. Such a long term trace is
|
|||
|
presently being conducted at Princeton as part of the author's research on
|
|||
|
filesystem caching. The non-intrusive nature of the data collection makes
|
|||
|
traces possible at facilities where kernel modification is impracti cal or
|
|||
|
unacceptable.
|
|||
|
|
|||
|
It is the author's hope that other sites (particularly those not doing
|
|||
|
computing research) will make use of this toolkit and will make the traces
|
|||
|
available to filesystem researchers.
|
|||
|
|
|||
|
7. Availability
|
|||
|
|
|||
|
The toolkit, consisting of rpcspy, nfstrace, and several support scripts,
|
|||
|
currently runs under several BSD-derived platforms, including ULTRIX 4.x,
|
|||
|
SunOS 4.x, and IBM-RT/AOS. It is available for anonymous ftp over the
|
|||
|
Internet from samadams.princeton.edu, in the compressed tar file
|
|||
|
nfstrace/nfstrace.tar.Z.
|
|||
|
|
|||
|
Thu 00:00 Thu 06:00 Thu 12:00 Thu 18:00 Fri 00:00 Fri 06:00 Fri 12:00
|
|||
|
Fri 18:00 Sat 00:00 Sat 06:00 Sat 12:00 Sat 18:00 Sun 00:00 Sun 06:00 Sun
|
|||
|
12:00 Sun 18:00 Mon 00:00 Mon 06:00 Mon 12:00 Mon 18:00 Tue 00:00 Tue 06:00
|
|||
|
Tue 12:00 Tue 18:00 Wed 00:00 Wed 06:00 Wed 12:00 Wed 18:00 0
|
|||
|
|
|||
|
100
|
|||
|
|
|||
|
200
|
|||
|
|
|||
|
300
|
|||
|
|
|||
|
Reads per hour
|
|||
|
|
|||
|
Cache Hits (estimated)
|
|||
|
|
|||
|
Cache Misses (actual)
|
|||
|
|
|||
|
Figure 3 - Cache Hits and Misses Over Time - Private Workstation
|
|||
|
|
|||
|
0 5 10 15 20
|
|||
|
|
|||
|
n (readers)
|
|||
|
|
|||
|
0
|
|||
|
|
|||
|
20
|
|||
|
|
|||
|
40
|
|||
|
|
|||
|
60
|
|||
|
|
|||
|
80
|
|||
|
|
|||
|
100
|
|||
|
|
|||
|
% of Reads and Writes used by > n users
|
|||
|
|
|||
|
Reads
|
|||
|
|
|||
|
Writes
|
|||
|
|
|||
|
Figure 4 - Degree of Sharing for Reads and Writes
|
|||
|
|
|||
|
0 5 10 15 20
|
|||
|
|
|||
|
Reads Since Last Write
|
|||
|
|
|||
|
0.0
|
|||
|
|
|||
|
0.1
|
|||
|
|
|||
|
0.2
|
|||
|
|
|||
|
P(next operation is write)
|
|||
|
|
|||
|
Figure 5 - Probability of Write Given >= n Previous Reads
|
|||
|
|
|||
|
8. Acknowledgments
|
|||
|
|
|||
|
The author would like to gratefully acknowledge Jim Roberts and Steve Beck
|
|||
|
for their help in getting the trace machine up and running, Rafael Alonso
|
|||
|
for his helpful comments and direction, and the members of the pro gram
|
|||
|
committee for their valuable suggestions. Jim Plank deserves special thanks
|
|||
|
for writing jgraph, the software which produced the figures in this paper.
|
|||
|
|
|||
|
9. References
|
|||
|
|
|||
|
[1] Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., & Lyon, B. "Design
|
|||
|
and Implementation of the Sun Net work File System." Proc. USENIX, Summer,
|
|||
|
1985.
|
|||
|
|
|||
|
[2] Mogul, J., Rashid, R., & Accetta, M. "The Packet Filter: An Efficient
|
|||
|
Mechanism for User-Level Network Code." Proc. 11th ACM Symp. on Operating
|
|||
|
Systems Principles, 1987.
|
|||
|
|
|||
|
[3] Ousterhout J., et al. "A Trace-Driven Analysis of the Unix 4.2 BSD File
|
|||
|
System." Proc. 10th ACM Symp. on Operating Systems Principles, 1985.
|
|||
|
|
|||
|
[4] Floyd, R. "Short-Term File Reference Patterns in a UNIX Environment,"
|
|||
|
TR-177 Dept. Comp. Sci, U. of Rochester, 1986.
|
|||
|
|
|||
|
[5] Baker, M. et al. "Measurements of a Distributed File System," Proc. 13th
|
|||
|
ACM Symp. on Operating Systems Principles, 1991.
|
|||
|
|
|||
|
[6] Metcalfe, R. & Boggs, D. "Ethernet: Distributed Packet Switching for
|
|||
|
Local Computer Networks," CACM July, 1976.
|
|||
|
|
|||
|
[7] "Etherfind(8) Manual Page," SunOS Reference Manual, Sun Microsystems,
|
|||
|
1988.
|
|||
|
|
|||
|
[8] Gusella, R. "Analysis of Diskless Workstation Traffic on an Ethernet,"
|
|||
|
TR-UCB/CSD-87/379, University Of California, Berkeley, 1987.
|
|||
|
|
|||
|
[9] "NIT(4) Manual Page," SunOS Reference Manual, Sun Microsystems, 1988.
|
|||
|
|
|||
|
[10] "XDR Protocol Specification," Networking on the Sun Workstation, Sun
|
|||
|
Microsystems, 1986.
|
|||
|
|
|||
|
[11] "RPC Protocol Specification," Networking on the Sun Workstation, Sun
|
|||
|
Microsystems, 1986.
|
|||
|
|
|||
|
[12] "NFS Protocol Specification," Networking on the Sun Workstation, Sun
|
|||
|
Microsystems, 1986.
|
|||
|
|
|||
|
[13] Postel, J. "User Datagram Protocol," RFC 768, Network Information
|
|||
|
Center, 1980.
|
|||
|
|
|||
|
[14] Blaze, M., and Alonso, R., "Long-Term Caching Strategies for Very Large
|
|||
|
Distributed File Systems," Proc. Summer 1991 USENIX, 1991.
|
|||
|
|
|||
|
Matt Blaze is a Ph.D. candidate in Computer Science at Princeton University,
|
|||
|
where he expects to receive his degree in the Spring of 1992. His research
|
|||
|
interests include distributed systems, operating systems, databases, and
|
|||
|
programming environments. His current research focuses on caching in very
|
|||
|
large distributed filesystems. In 1988 he received an M.S. in Computer
|
|||
|
Science from Columbia University and in 1986 a B.S. from Hunter College. He
|
|||
|
can be reached via email at mab@cs.princeton.edu or via US mail at Dept. of
|
|||
|
Computer Science, Princeton University, 35 Olden Street, Princeton NJ
|
|||
|
08544.
|
|||
|
|
|||
|
|