872 lines
35 KiB
Plaintext
872 lines
35 KiB
Plaintext
|
|
Archive-name:tcp-ip/FAQ
|
|
Last-modified: 1996/4/1
|
|
|
|
Internet Protocol Frequently Asked Questions
|
|
|
|
Maintained by: George V. Neville-Neil (gnn@wrs.com)
|
|
Contributions from:
|
|
Ran Atkinson
|
|
Mark Bergman
|
|
Stephane Bortzmeyer
|
|
Rodney Brown
|
|
Dr. Charles E. Campbell Jr.
|
|
Phill Conrad
|
|
Alan Cox
|
|
Rick Jones
|
|
Jon Kay
|
|
Jay Kreibrich
|
|
William Manning
|
|
Barry Margolin
|
|
Jim Muchow
|
|
Subu Rama
|
|
W. Richard Stevens
|
|
|
|
Version 3.3
|
|
|
|
|
|
************************************************************************
|
|
|
|
The following is a list of Frequently Asked Questions, and
|
|
their answers, for people interested in the Internet Protocols,
|
|
including TCP, UDP, ICMP and others. Please send all additions,
|
|
corrections, complaints and kudos to the above address. This FAQ will
|
|
be posted on or about the first of every month.
|
|
|
|
This FAQ is available for anonymous ftp from :
|
|
ftp.netcom.com:/pub/gnn/tcp-ip.faq . You may get it from my home page at
|
|
ftp://ftp.netcom.com/pub/gnn/gnn.html
|
|
You can read the FAQ in HTMl format on Netcom or from the mirror
|
|
site http://web.cnam.fr/Network/TCP-IP/tcp-ip.html
|
|
|
|
************************************************************************
|
|
|
|
Table of Contents:
|
|
Glossary
|
|
1) Are there any good books on IP?
|
|
2) Where can I find example source code for TCP/UDP/IP?
|
|
3) Are there any public domain programs to check the performance of an
|
|
IP link?
|
|
4) Where do I find RFCs?
|
|
5) How can I detect that the other end of a TCP connection has
|
|
crashed? Can I use "keepalives" for this?
|
|
6) Can the keepalive timeouts be configured?
|
|
7) Can I set up a gateway to the Internet that translates IP
|
|
addresses, so that I don't have to change all our internal addresses
|
|
to an official network?
|
|
8) Are there object-oriented network programming tools?
|
|
9) What other FAQs are related to this one?
|
|
10) What newsgroups contain information on networks/protocols?
|
|
11) Van Jacobson explains TCP congestion avoidance.
|
|
12) Can I use a single bit subnet?
|
|
|
|
Glossary:
|
|
|
|
I felt this should be first given the plethora of acronyms used in the
|
|
rest of this FAQ.
|
|
|
|
IP: Internet Protocol. The lowest layer protocol defined in TCP/IP.
|
|
This is the base layer on which all other protocols mentioned herein
|
|
are built. IP is often referred to as TCP/IP as well.
|
|
|
|
UDP: User Datagram Protocol. This is a connectionless protocol built
|
|
on top of IP. It does not provide any guarantees on the ordering or
|
|
delivery of messages. This protocol is layered on top of IP.
|
|
|
|
TCP: Transmission Control Protocol. TCP is a connection oriented
|
|
protocol that guarantees that messages are delivered in the order in
|
|
which they were sent and that all messages are delivered. If a TCP
|
|
connection cannot deliver a message it closes the connection and
|
|
informs the entity that created it. This protocol is layered on top
|
|
of IP.
|
|
|
|
ICMP: Internet Control Message Protocol. ICMP is used for
|
|
diagnostics in the network. The Unix program, ping, uses ICMP
|
|
messages to detect the status of other hosts in the net. ICMP
|
|
messages can either be queries (in the case of ping) or error reports,
|
|
such as when a network is unreachable.
|
|
|
|
RFC: Request For Comment. RFCs are documents that define the
|
|
protocols used in the IP Internet. Some are only suggestions, some
|
|
are even jokes, and others are published standards. Several sites in
|
|
the Internet store RFCs and make them available for anonymous ftp.
|
|
|
|
SLIP: Serial Line IP. An implementation of IP for use over a serial
|
|
link (modem). CSLIP is an optimized (compressed) version of SLIP that
|
|
gives better throughput.
|
|
|
|
Bandwidth: The amount of data that can be pushed through a link in
|
|
unit time. Usually measured in bits or bytes per second.
|
|
|
|
Latency: The amount of time that a message spends in a network going
|
|
from point A to point B.
|
|
|
|
Jitter: The effect seen when latency is not a constant. That is, if
|
|
messages experience a different latencies between two points in a
|
|
network.
|
|
|
|
RPC: Remote Procedure Call. RPC is a method of making network access
|
|
to resource transparent to the application programmer by supplying a
|
|
"stub" routine that is called in the same way as a regular procedure
|
|
call. The stub actually performs the call across the network to
|
|
another computer.
|
|
|
|
Marshalling: The process of taking arbitrary data (characters,
|
|
integers, structures) and packing them up for transmission across a
|
|
network.
|
|
|
|
MBONE: A virtual network that is a Multicast backBONE. It is still a
|
|
research prototype, but it extends through most of the core of the
|
|
Internet (including North America, Europe, and Australia). It uses IP
|
|
Multicasting which is defined in RFC-1112. An MBONE FAQ is available
|
|
via anonymous ftp from: ftp.isi.edu" There are frequent broadcasts of
|
|
multimedia programs (audio and low bandwidth video) over the MBONE.
|
|
Though the MBONE is used for mutlicasting, the long haul parts of the
|
|
MBONE use point-to-point connections through unicast tunnels to
|
|
connect the various multicast networks worldwide.
|
|
|
|
|
|
1) Are there any good books on IP?
|
|
|
|
A) Yes. Please see the following:
|
|
|
|
Internetworking with TCP/IP Volume I
|
|
(Principles, Protocols, and Architecture)
|
|
Douglas E. Comer
|
|
Prentice Hall 1991 ISBN 0-13-468505-9
|
|
|
|
This volume covers all of the protocols, including IP, UDP, TCP, and
|
|
the gateway protocols. It also includes discussions of higher level
|
|
protocols such as FTP, TELNET, and NFS.
|
|
|
|
Internetworking with TCP/IP Volume II
|
|
(Design, Implementation, and Internals)
|
|
Douglas E. Comer / David L. Stevens
|
|
Prentice Hall 1991 ISBN 0-13-472242-6
|
|
|
|
Discusses the implementation of the protocols and gives numerous code
|
|
examples.
|
|
|
|
Internetworking with TCP/IP Volume III (BSD Socket Version)
|
|
(Client - Server Programming and Applications)
|
|
Douglas E. Comer / David L. Stevens
|
|
Prentice Hall 1993 ISBN 0-13-474222-2
|
|
|
|
This book discusses programming applications that use the internet
|
|
protocols. It includes examples of telnet, ftp clients and servers.
|
|
Discusses RPC and XDR at length.
|
|
|
|
TCP/IP Illustrated, Volume 1: The Protocols,
|
|
W. Richard Stevens
|
|
(c) Addison-Wesley, 1994 ISBN 0-201-63346-9
|
|
|
|
An excellent introduction to the entire TCP/IP protocol suite,
|
|
covering all the major protocols, plus several important applications.
|
|
|
|
"TCP/IP Illustrated, Volume 2: The Implementation",
|
|
by Gary R. Wright and W. Richard Stevens
|
|
(c) Addison-Wesley, 1995
|
|
ISBN 0-201-63354-X
|
|
|
|
This is a complete, and lenthy, discussion of the internals of TCP/IP
|
|
based on the Net/2 release of BSD.
|
|
|
|
Unix Network Programming
|
|
W. Richard Stevens
|
|
Prentice Hall 1990 ISBN 0-13-949876
|
|
|
|
An excellent introduction to network programming under Unix.
|
|
|
|
The Design and Implementation of the 4.3 BSD Operating System
|
|
Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, John S.
|
|
Quarterman
|
|
Addison-Wesley 1989 ISBN 0-201-06196-1
|
|
|
|
Though this book is a reference for the entire operating system, the
|
|
eleventh and twelfth chapters completely explain how the networking
|
|
protocols are implemented in the kernel.
|
|
|
|
Stevens, W. Richard, Unix Network Programming. 1990, Prentice-Hall.
|
|
|
|
An excellent introduction to network programming under Unix. Widely
|
|
cited on the Usenet bulliten boards as the "best place to start" if you
|
|
want to actually learn how to write Unix programs that communicate over
|
|
a network.
|
|
|
|
Rago, Steven A. Unix System V. Network Programming. 1993, Addison-Wesley.
|
|
|
|
A book that covers the same kinds of topics as W. Richard Stevens Unix
|
|
Network Programming, but is more specific to Unix System V Release 4
|
|
(SVR4), and so perhaps is more useful and up to date if you are
|
|
working specifically with that implementation. (Stevens book covers
|
|
Unix System V release 3.x). There is a much more extensive coverage
|
|
of Streams in Rago's book; 4 chapters, where Stevens only provides a
|
|
couple of subsections. The design project at the end of the book is
|
|
an implementation of SLIP.
|
|
|
|
|
|
2) Where can I find example source code for TCP/UDP/IP?
|
|
|
|
A) Code from the Internetworking with TCP/IP Volume III is available
|
|
for anonymous ftp from:
|
|
|
|
arthur.cs.purdue.edu:/pub/dls
|
|
|
|
Code used in the Net-2 version of Berkeley Unix is available for
|
|
anonymous ftp from:
|
|
|
|
ftp.uu.net:systems/unix/bsd-sources/sys/netinet
|
|
|
|
and
|
|
|
|
gatekeeper.dec.com:/pub/BSD/net2/sys/netinet
|
|
|
|
Code from Richard Steven's book is available on:
|
|
ftp.uu.net:/published/books/stevens.*
|
|
|
|
Example source code and libraries to make coding quicker is available
|
|
in the Simple Sockets Library written at NASA. The Simple Sockets
|
|
Library makes sockets easy to use! And, it comes as source code. It
|
|
has been tested on: Unix (SGI, DecStation, AIX, Sun 3, Sparcstation;
|
|
version 2.02+: Solaris 2.1, SCO), VMS, and MSDOS (client only since
|
|
there's no background there). It is provided in source code form, of
|
|
course, and sits atop Berkeley sockets and tcp/ip.
|
|
|
|
You can order the "Simple Sockets Library" from
|
|
|
|
Austin Code Works
|
|
11100 Leafwood Lane
|
|
Austin, TX 78750-3464 USA
|
|
Phone (512) 258-0785
|
|
|
|
Ask for the "SSL - The Simple Sockets Library". Last I checked, they
|
|
were asking $20 US for it.
|
|
|
|
|
|
For DOS there is WATTCP.ZIP (numerous sites):
|
|
|
|
WATTCP is a DOS TCP/IP stack derived from the NCSA Telnet program and
|
|
much enhanced. It comes with some example programs and complete source
|
|
code. The interface isn't BSD sockets but is well suited to PC type
|
|
work. It is also written so that it can be used and memory
|
|
allocation).
|
|
|
|
3) Are there any public domain programs to check the performance of
|
|
an IP link?
|
|
|
|
A)
|
|
|
|
TTCP: Available for anonymous ftp from....
|
|
|
|
wuarchive.wustl.edu:/graphics/graphics/mirrors/sgi.com/sgi/src/ttcp
|
|
|
|
On ftp.sgi.com are netperf (from Rick Jones at HP) and nettest
|
|
(from Dave Borman at Cray). ttcp is also availabel at ftp.sgi.com.
|
|
|
|
You can get to the NetPerf home page via:
|
|
|
|
http://www.cup.hp.com/netperf/NetperfPage.html
|
|
|
|
|
|
There is suite of Bandwidth Measuring programs from gnn@netcom.com.
|
|
Available for anonymous ftp from ftp.netcom.com in
|
|
~ftp/gnn/bwmeas-0.3.tar.Z These are several programs that meausre
|
|
bandwidth and jitter over several kinds of IPC links, including TCP
|
|
and UDP.
|
|
|
|
|
|
4) Where do I find RFCs?
|
|
|
|
A) This is the latest info on obtaining RFCs:
|
|
Details on obtaining RFCs via FTP or EMAIL may be obtained by sending
|
|
an EMAIL message to rfc-info@ISI.EDU with the message body
|
|
help: ways_to_get_rfcs. For example:
|
|
|
|
To: rfc-info@ISI.EDU
|
|
Subject: getting rfcs
|
|
|
|
help: ways_to_get_rfcs
|
|
|
|
The response to this mail query is quite long and has been omitted.
|
|
|
|
RFCs can be obtained via FTP from DS.INTERNIC.NET, NIS.NSF.NET,
|
|
NISC.JVNC.NET, FTP.ISI.EDU, WUARCHIVE.WUSTL.EDU, SRC.DOC.IC.AC.UK,
|
|
FTP.CONCERT.NET, or FTP.SESQUI.NET.
|
|
|
|
|
|
Using Web, WAIS, and gopher:
|
|
|
|
Web:
|
|
|
|
http://web.nexor.co.uk/rfc-index/rfc-index-search-form.html
|
|
|
|
WAIS access by keyword:
|
|
|
|
wais://wais.cnam.fr/RFC
|
|
|
|
Excellent presentation with a full-text search too:
|
|
|
|
http://www.cis.ohio-state.edu/hypertext/information/rfc.html
|
|
|
|
With Gopher:
|
|
|
|
gopher://r2d2.jvnc.net/11/Internet%20Resources/RFC
|
|
gopher://muspin.gsfc.nasa.gov:4320/1g2go4%20ds.internic.net%2070%201%201/.ds/
|
|
.internetdocs
|
|
|
|
|
|
|
|
5) How can I detect that the other end of a TCP connection has crashed?
|
|
Can I use "keepalives" for this?
|
|
|
|
A) Detecting crashed systems over TCP/IP is difficult. TCP doesn't require
|
|
any transmission over a connection if the application isn't sending
|
|
anything, and many of the media over which TCP/IP is used (e.g. ethernet)
|
|
don't provide a reliable way to determine whether a particular host is up.
|
|
If a server doesn't hear from a client, it could be because it has nothing
|
|
to say, some network between the server and client may be down, the server
|
|
or client's network interface may be disconnected, or the client may have
|
|
crashed. Network failures are often temporary (a thin ethernet will appear
|
|
down while someone is adding a link to the daisy chain, and it often takes
|
|
a few minutes for new routes to stabilize when a router goes down), and TCP
|
|
connections shouldn't be dropped as a result.
|
|
|
|
Keepalives are a feature of the sockets API that requests that an empty
|
|
packet be sent periodically over an idle connection; this should evoke an
|
|
acknowledgement from the remote system if it is still up, a reset if it has
|
|
rebooted, and a timeout if it is down. These are not normally sent until
|
|
the connection has been idle for a few hours. The purpose isn't to detect
|
|
a crash immediately, but to keep unnecessary resources from being allocated
|
|
forever.
|
|
|
|
If more rapid detection of remote failures is required, this should be
|
|
implemented in the application protocol. There is no standard mechanism
|
|
for this, but an example is requiring clients to send a "no-op" message
|
|
every minute or two. An example protocol that uses this is X Display
|
|
Manager Control Protocol (XDMCP), part of the X Window System, Version 11;
|
|
the XDM server managing a session periodically sends a Sync command to the
|
|
display server, which should evoke an application-level response, and
|
|
resets the session if it doesn't get a response (this is actually an
|
|
example of a poor implementation, as a timeout can occur if another client
|
|
"grabs" the server for too long).
|
|
|
|
6) Can the keepalive timeouts be configured?
|
|
|
|
A) This varies by operating system. There is a program that works on
|
|
many Unices (though not Linux or Solaris), called netconfig, that
|
|
allows one to do this and documents many of the variables. It is
|
|
available by anonymous FTP from
|
|
|
|
cs.ucsd.edu:pub/csl/Netconfig/netconfig2.2.tar.Z
|
|
|
|
In addition, Richard Stevens' TCP/IP Illustrated, Volume 1 includes a
|
|
good discussion of setting the most useful variables on many
|
|
platforms.
|
|
|
|
|
|
7) Can I set up a gateway to the Internet that translates IP addresses, so
|
|
that I don't have to change all our internal addresses to an official
|
|
network?
|
|
|
|
A) There's no general solution to this. Many protocols include IP
|
|
addresses in the application-level data (FTP's "PORT" command is the most
|
|
notable), so it isn't simply a matter of translating addresses in the IP
|
|
header. Also, if the network number(s) you're using match those assigned
|
|
to another organization, your gateway won't be able to communicate with
|
|
that organization (RFC 1597 proposes network numbers that are reserved for
|
|
private use, to avoid such conflicts, but if you're already using a
|
|
different network number this won't help you).
|
|
|
|
However, if you're willing to live with limited access to the Internet from
|
|
internal hosts, the "proxy" servers developed for firewalls can be used as
|
|
a substitute for an address-translating gateway. See the firewall FAQ.
|
|
|
|
8) Are there object-oriented network programming tools?
|
|
|
|
A) Yes, and one such system is called ACE (ADAPTIVE Communication
|
|
Environment). Here is how to get more information and the software:
|
|
|
|
OBTAINING ACE
|
|
|
|
An HTML version of this README file is available at URL
|
|
http://www.cs.wustl.edu/~schmidt/ACE.html. All software and
|
|
documentation is available via both anonymous ftp and the Web.
|
|
|
|
ACE is available for anonymous ftp from the ics.uci.edu (128.195.1.1)
|
|
host in the gnu/C++_wrappers.tar.Z file (approximately .5 meg
|
|
compressed). This release contains contains the source code,
|
|
documentation, and example test drivers for C++ wrapper libras.
|
|
|
|
9) What other FAQs might you want to look in?
|
|
comp.protocols.tcp-ip.ibmpc
|
|
Aboba, Bernard D.(1994) "comp.protocols.tcp-ip.ibmpc Frequently
|
|
Asked Questions (FAQ)" Usenet news.answers, available via
|
|
file://ftp.netcom.com/pub/ma/mailcom/IBMTCP/ibmtcp.zip,
|
|
57 pages.
|
|
|
|
comp.protocols.ppp
|
|
Archive-name: ppp-faq/part[1-8]
|
|
URL: http://cs.uni-bonn.de/ppp/part[1-8].html
|
|
|
|
comp.dcom.lans.ethernet
|
|
ftp site: dorm.rutgers.edu, pub/novell/DOCS
|
|
Ethernet Network Questions and Answers
|
|
Summarized from UseNet group comp.dcom.lans.ethernet
|
|
|
|
10) What other newsgroups deal with networking?
|
|
|
|
comp.dcom.cabling Cabling selection, installation and use.
|
|
comp.dcom.isdn The Integrated Services Digital Network
|
|
(ISDN).
|
|
comp.dcom.lans.ethernet Discussions of the Ethernet/IEEE 802.3
|
|
protocols.
|
|
comp.dcom.lans.fddi Discussions of the FDDI protocol suite.
|
|
comp.dcom.lans.misc Local area network hardware and software.
|
|
comp.dcom.lans.token-ring Installing and using token ring
|
|
networks.
|
|
comp.dcom.servers Selecting and operating data communications
|
|
servers.
|
|
comp.dcom.sys.cisco Info on Cisco routers and bridges.
|
|
comp.dcom.sys.wellfleet Wellfleet bridge & router systems hardware &
|
|
software.
|
|
comp.protocols.ibm Networking with IBM mainframes.
|
|
comp.protocols.iso The ISO protocol stack.
|
|
comp.protocols.kerberos The Kerberos authentication server.
|
|
comp.protocols.misc Various forms and types of protocol.
|
|
comp.protocols.nfs Discussion about the Network File System
|
|
protocol.
|
|
comp.protocols.ppp Discussion of the Internet Point to Point
|
|
Protocol.
|
|
comp.protocols.smb SMB file sharing protocol and Samba SMB
|
|
server/client.
|
|
comp.protocols.tcp-ip TCP and IP network protocols.
|
|
comp.protocols.tcp-ip.ibmpc TCP/IP for IBM(-like) personal
|
|
computers.
|
|
comp.security.misc Security isuipment for the PC.
|
|
comp.os.ms-windows.networking.misc Windows and other networks.
|
|
comp.os.ms-windows.networking.tcp-ip Windows and TCP/IP networking.
|
|
comp.os.ms-windows.networking.windows Windows' built-in networking.
|
|
comp.os.os2.networking.misc Miscellaneous networking issues of
|
|
OS/2.
|
|
comp.os.os2.networking.tcp-ip TCP/IP under OS/2.
|
|
comp.sys.novell Discussion of Novell Netware products.
|
|
|
|
11) Van Jacobson explains TCP congestion avoidance.
|
|
|
|
I've attached Van J's original posting on it (I seem to repost this every
|
|
6 months or so). If you want to see some real examples of this in action,
|
|
take a look at Chapter 21 of my "TCP/IP Illustrated, Volume 1".
|
|
|
|
Rich Stevens
|
|
|
|
---------------------------------------------------------------------------
|
|
>From van@helios.ee.lbl.gov Mon Apr 30 01:44:05 1990
|
|
To: end2end-interest@ISI.EDU
|
|
Subject: modified TCP congestion avoidance algorithm
|
|
Date: Mon, 30 Apr 90 01:40:59 PDT
|
|
From: Van Jacobson <van@helios.ee.lbl.gov>
|
|
Status: RO
|
|
|
|
This is a description of the modified TCP congestion avoidance
|
|
algorithm that I promised at the teleconference.
|
|
|
|
BTW, on re-reading, I noticed there were several errors in
|
|
Lixia's note besides the problem I noted at the teleconference.
|
|
I don't know whether that's because I mis-communicated the
|
|
algorithm at dinner (as I recall, I'd had some wine) or because
|
|
she's convinced that TCP is ultimately irrelevant :). Either
|
|
way, you will probably be disappointed if you experiment with
|
|
what's in that note.
|
|
|
|
First, I should point out once again that there are two
|
|
completely independent window adjustment algorithms running in
|
|
the sender: Slow-start is run when the pipe is empty (i.e.,
|
|
when first starting or re-starting after a timeout). Its goal
|
|
is to get the "ack clock" started so packets will be metered
|
|
into the network at a reasonable rate. The other algorithm,
|
|
congestion avoidance, is run any time *but* when (re-)starting
|
|
and is responsible for estimating the (dynamically varying)
|
|
pipesize. You will cause yourself, or me, no end of confusion
|
|
if you lump these separate algorithms (as Lixia's message did).
|
|
|
|
The modifications described here are only to the congestion
|
|
avoidance algorithm, not to slow-start, and they are intended to
|
|
apply to large bandwidth-delay product paths (though they don't
|
|
do any harm on other paths). Remember that with regular TCP (or
|
|
with slow-start/c-a TCP), throughput really starts to go to hell
|
|
when the probability of packet loss is on the order of the
|
|
bandwidth-delay product. E.g., you might expect a 1% packet
|
|
loss rate to translate into a 1% lower throughput but for, say,
|
|
a TCP connection with a 100 packet b-d p. (= window), it results
|
|
in a 50-75% throughput loss. To make TCP effective on fat
|
|
pipes, it would be nice if throughput degraded only as function
|
|
of loss probability rather than as the product of the loss
|
|
probabilty and the b-d p. (Assuming, of course, that we can do
|
|
this without sacrificing congestion avoidance.)
|
|
|
|
These mods do two things: (1) prevent the pipe from going empty
|
|
after a loss (if the pipe doesn't go empty, you won't have to
|
|
waste round-trip times re-filling it) and (2) correctly account
|
|
for the amount of data actually in the pipe (since that's what
|
|
congestion avoidance is supposed to be estimating and adapting to).
|
|
|
|
For (1), remember that we use a packet loss as a signal that the
|
|
pipe is overfull (congested) and that packet loss can be
|
|
detected one of two different ways: (a) via a retransmit
|
|
timeout or (b) when some small number (3-4) of consecutive
|
|
duplicate acks has been received (the "fast retransmit"
|
|
algorithm). In case (a), the pipe is guaranteed to be empty so
|
|
we must slow-start. In case (b), if the duplicate ack
|
|
threshhold is small compared to the bandwidth-delay product, we
|
|
will detect the loss with the pipe almost full. I.e., given a
|
|
threshhold of 3 packets and an LBL-MIT bandwidth-delay of around
|
|
24KB or 16 packets (assuming 1500 byte MTUs), the pipe is 75%
|
|
full when fast-retransmit detects a loss (actually, until
|
|
gateways start doing some sort of congestion control, the pipe
|
|
is overfull when the loss is detected so *at least* 75% of the
|
|
packets needed for ack clocking are in transit when
|
|
fast-retransmit happens). Since the pipe is full, there's no
|
|
need to slow-start after a fast-retransmit.
|
|
|
|
For (2), consider what a duplicate ack means: either the
|
|
network duplicated a packet (i.e., the NSFNet braindead IBM
|
|
token ring adapters) or the receiver got an out-of-order packet.
|
|
The usual cause of out-of-order packets at the receiver is a
|
|
missing packet. I.e., if there are W packets in transit and one
|
|
is dropped, the receiver will get W-1 out-of-order and
|
|
(4.3-tahoe TCP will) generate W-1 duplicate acks. If the
|
|
`consecutive duplicates' threshhold is set high enough, we can
|
|
reasonably assume that duplicate acks mean dropped packets.
|
|
|
|
But there's more information in the ack: The receiver can only
|
|
generate one in response to a packet arrival. I.e., a duplicate
|
|
ack means that a packet has left the network (it is now cached
|
|
at the receiver). If the sender is limitted by the congestion
|
|
window, a packet can now be sent. (The congestion window is a
|
|
count of how many packets will fit in the pipe. The ack says a
|
|
packet has left the pipe so a new one can be added to take its
|
|
place.) To put this another way, say the current congestion
|
|
window is C (i.e, C packets will fit in the pipe) and D
|
|
duplicate acks have been received. Then only C-D packets are
|
|
actually in the pipe and the sender wants to use a window of C+D
|
|
packets to fill the pipe to its estimated capacity (C+D sent -
|
|
D received = C in pipe).
|
|
|
|
So, conceptually, the slow-start/cong.avoid/fast-rexmit changes
|
|
are:
|
|
|
|
- The sender's input routine is changed to set `cwnd' to `ssthresh'
|
|
when the dup ack threshhold is reached. [It used to set cwnd to
|
|
mss to force a slow-start.] Everything else stays the same.
|
|
|
|
- The sender's output routine is changed to use an effective window
|
|
of min(snd_wnd, cwnd + dupacks*mss) [the change is the addition
|
|
of the `dupacks*mss' term.] `Dupacks' is zero until the rexmit
|
|
threshhold is reached and zero except when receiving a sequence
|
|
of duplicate acks.
|
|
|
|
The actual implementation is slightly different than the above
|
|
because I wanted to avoid the multiply in the output routine
|
|
(multiplies are expensive on some risc machines). A diff of the
|
|
old and new fastrexmit code is attached (your line numbers will
|
|
vary).
|
|
|
|
Note that we still do congestion avoidance (i.e., the window is
|
|
reduced by 50% when we detect the packet loss). But, as long as
|
|
the receiver's offered window is large enough (it needs to be at
|
|
most twice the bandwidth-delay product), we continue sending
|
|
packets (at exactly half the rate we were sending before the
|
|
loss) even after the loss is detected so the pipe stays full at
|
|
exactly the level we want and a slow-start isn't necessary.
|
|
|
|
Some algebra might make this last clear: Say U is the sequence
|
|
number of the first un-acked packet and we are using a window
|
|
size of W when packet U is dropped. Packets [U..U+W) are in
|
|
transit. When the loss is detected, we send packet U and pull
|
|
the window back to W/2. But in the round-trip time it takes
|
|
the U retransmit to fill the receiver's hole and an ack to get
|
|
back, W-1 dup acks will arrive (one for each packet in transit).
|
|
The window is effectively inflated by one packet for each of
|
|
these acks so packets [U..U+W/2+W-1) are sent. But we don't
|
|
re-send packets unless we know they've been lost so the amount
|
|
actually sent between the loss detection and the recovery ack is
|
|
U+W/2+W-1 - U+W = W/2-1 which is exactly the amount congestion
|
|
avoidance allows us to send (if we add in the rexmit of U). The
|
|
recovery ack is for packet U+W so when the effective window is
|
|
pulled back from W/2+W-1 to W/2 (which happens because the
|
|
recovery ack is `new' and sets dupack to zero), we are allowed
|
|
to send up to packet U+W+W/2 which is exactly the first packet
|
|
we haven't yet sent. (I.e., there is no sudden burst of packets
|
|
as the `hole' is filled.) Also, when sending packets between
|
|
the loss detection and the recovery ack, we do nothing for the
|
|
first W/2 dup acks (because they only allow us to send packets
|
|
we've already sent) and the bottleneck gateway is given W/2
|
|
packet times to clean out its backlog. Thus when we start
|
|
sending our W/2-1 new packets, the bottleneck queue is as empty
|
|
as it can be.
|
|
|
|
[I don't know if you can get the flavor of what happens from
|
|
this description -- it's hard to see without a picture. But I
|
|
was delighted by how beautifully it worked -- it was like
|
|
watching the innards of an engine when all the separate motions
|
|
of crank, pistons and valves suddenly fit together and
|
|
everything appears in exactly the right place at just the right
|
|
time.]
|
|
|
|
Also note that this algorithm interoperates with old tcp's: Most
|
|
pre-tahoe tcp's don't generate the dup acks on out-of-order packets.
|
|
If we don't get the dup acks, fast retransmit never fires and the
|
|
window is never inflated so everything happens in the old way (via
|
|
timeouts). Everything works just as it did without the new algorithm
|
|
(and just as slow).
|
|
|
|
If you want to simulate this, the intended environment is:
|
|
|
|
- large bandwidth-delay product (say 20 or more packets)
|
|
|
|
- receiver advertising window of two b-d p (or, equivalently,
|
|
advertised window of the unloaded b-d p but two or more
|
|
connections simultaneously sharing the path).
|
|
|
|
- average loss rate (from congestion or other source) less than
|
|
one lost packet per round-trip-time per active connection.
|
|
(The algorithm works at higher loss rate but the TCP selective
|
|
ack option has to be implemented otherwise the pipe will go empty
|
|
waiting to fill the second hole and throughput will once again
|
|
degrade at the product of the loss rate and b-d p. With selective
|
|
ack, throughput is insensitive to b-d p at any loss rate.)
|
|
|
|
And, of course, we should always remember that good engineering
|
|
practise suggests a b-d p worth of buffer at each bottleneck --
|
|
less buffer and your simulation will exhibit the interesting
|
|
pathologies of a poorly engineered network but will probably
|
|
tell you little about the workings of the algorithm (unless the
|
|
algorithm misbehaves badly under these conditions but my
|
|
simulations and measurements say that it doesn't). In these
|
|
days of $100/megabyte memory, I dearly hope that this particular
|
|
example of bad engineering is of historical interest only.
|
|
|
|
- Van
|
|
|
|
-----------------
|
|
*** /tmp/,RCSt1a26717 Mon Apr 30 01:35:17 1990
|
|
--- tcp_input.c Mon Apr 30 01:33:30 1990
|
|
***************
|
|
*** 834,850 ****
|
|
* Kludge snd_nxt & the congestion
|
|
* window so we send only this one
|
|
! * packet. If this packet fills the
|
|
! * only hole in the receiver's seq.
|
|
! * space, the next real ack will fully
|
|
! * open our window. This means we
|
|
! * have to do the usual slow-start to
|
|
! * not overwhelm an intermediate gateway
|
|
! * with a burst of packets. Leave
|
|
! * here with the congestion window set
|
|
! * to allow 2 packets on the next real
|
|
! * ack and the exp-to-linear thresh
|
|
! * set for half the current window
|
|
! * size (since we know we're losing at
|
|
! * the current window size).
|
|
*/
|
|
if (tp->t_timer[TCPT_REXMT] == 0 ||
|
|
--- 834,850 ----
|
|
* Kludge snd_nxt & the congestion
|
|
* window so we send only this one
|
|
! * packet.
|
|
! *
|
|
! * We know we're losing at the current
|
|
! * window size so do congestion avoidance
|
|
! * (set ssthresh to half the current window
|
|
! * and pull our congestion window back to
|
|
! * the new ssthresh).
|
|
! *
|
|
! * Dup acks mean that packets have left the
|
|
! * network (they're now cached at the receiver)
|
|
! * so bump cwnd by the amount in the receiver
|
|
! * to keep a constant cwnd packets in the
|
|
! * network.
|
|
*/
|
|
if (tp->t_timer[TCPT_REXMT] == 0 ||
|
|
***************
|
|
*** 853,864 ****
|
|
else if (++tp->t_dupacks == tcprexmtthresh) {
|
|
tcp_seq onxt = tp->snd_nxt;
|
|
! u_int win =
|
|
! MIN(tp->snd_wnd, tp->snd_cwnd) / 2 /
|
|
! tp->t_maxseg;
|
|
|
|
if (win < 2)
|
|
win = 2;
|
|
tp->snd_ssthresh = win * tp->t_maxseg;
|
|
-
|
|
tp->t_timer[TCPT_REXMT] = 0;
|
|
tp->t_rtt = 0;
|
|
--- 853,864 ----
|
|
else if (++tp->t_dupacks == tcprexmtthresh) {
|
|
tcp_seq onxt = tp->snd_nxt;
|
|
! u_int win = MIN(tp->snd_wnd,
|
|
! tp->snd_cwnd);
|
|
|
|
+ win /= tp->t_maxseg;
|
|
+ win >>= 1;
|
|
if (win < 2)
|
|
win = 2;
|
|
tp->snd_ssthresh = win * tp->t_maxseg;
|
|
tp->t_timer[TCPT_REXMT] = 0;
|
|
tp->t_rtt = 0;
|
|
***************
|
|
*** 866,873 ****
|
|
tp->snd_cwnd = tp->t_maxseg;
|
|
(void) tcp_output(tp);
|
|
!
|
|
if (SEQ_GT(onxt, tp->snd_nxt))
|
|
tp->snd_nxt = onxt;
|
|
goto drop;
|
|
}
|
|
} else
|
|
--- 866,879 ----
|
|
tp->snd_cwnd = tp->t_maxseg;
|
|
(void) tcp_output(tp);
|
|
! tp->snd_cwnd = tp->snd_ssthresh +
|
|
! tp->t_maxseg *
|
|
! tp->t_dupacks;
|
|
if (SEQ_GT(onxt, tp->snd_nxt))
|
|
tp->snd_nxt = onxt;
|
|
goto drop;
|
|
+ } else if (tp->t_dupacks > tcprexmtthresh) {
|
|
+ tp->snd_cwnd += tp->t_maxseg;
|
|
+ (void) tcp_output(tp);
|
|
+ goto drop;
|
|
}
|
|
} else
|
|
***************
|
|
*** 874,877 ****
|
|
--- 880,890 ----
|
|
tp->t_dupacks = 0;
|
|
break;
|
|
+ }
|
|
+ if (tp->t_dupacks) {
|
|
+ /*
|
|
+ * the congestion window was inflated to account for
|
|
+ * the other side's cached packets - retract it.
|
|
+ */
|
|
+ tp->snd_cwnd = tp->snd_ssthresh;
|
|
}
|
|
tp->t_dupacks = 0;
|
|
*** /tmp/,RCSt1a26725 Mon Apr 30 01:35:23 1990
|
|
--- tcp_timer.c Mon Apr 30 00:36:29 1990
|
|
***************
|
|
*** 223,226 ****
|
|
--- 223,227 ----
|
|
tp->snd_cwnd = tp->t_maxseg;
|
|
tp->snd_ssthresh = win * tp->t_maxseg;
|
|
+ tp->t_dupacks = 0;
|
|
}
|
|
(void) tcp_output(tp);
|
|
|
|
>From van@helios.ee.lbl.gov Mon Apr 30 10:37:36 1990
|
|
To: end2end-interest@ISI.EDU
|
|
Subject: modified TCP congestion avoidance algorithm (correction)
|
|
Date: Mon, 30 Apr 90 10:36:12 PDT
|
|
From: Van Jacobson <van@helios.ee.lbl.gov>
|
|
Status: RO
|
|
|
|
I shouldn't make last minute 'fixes'. The code I sent out last
|
|
night had a small error:
|
|
|
|
*** t.c Mon Apr 30 10:28:52 1990
|
|
--- tcp_input.c Mon Apr 30 10:30:41 1990
|
|
***************
|
|
*** 885,893 ****
|
|
* the congestion window was inflated to account for
|
|
* the other side's cached packets - retract it.
|
|
*/
|
|
! tp->snd_cwnd = tp->snd_ssthresh;
|
|
}
|
|
- tp->t_dupacks = 0;
|
|
if (SEQ_GT(ti->ti_ack, tp->snd_max)) {
|
|
tcpstat.tcps_rcvacktoomuch++;
|
|
goto dropafterack;
|
|
--- 885,894 ----
|
|
* the congestion window was inflated to account for
|
|
* the other side's cached packets - retract it.
|
|
*/
|
|
! if (tp->snd_cwnd > tp->snd_ssthresh)
|
|
! tp->snd_cwnd = tp->snd_ssthresh;
|
|
! tp->t_dupacks = 0;
|
|
}
|
|
if (SEQ_GT(ti->ti_ack, tp->snd_max)) {
|
|
tcpstat.tcps_rcvacktoomuch++;
|
|
goto dropafterack;
|
|
|
|
12) Can I use a single bit subnet?
|
|
|
|
A) It would seem that the consensus is no. The best citable answer
|
|
follows.
|
|
|
|
>From RFC1122:
|
|
"3.3.6 Broadcasts
|
|
Section 3.2.1.3 defined the four standard IP broadcast address
|
|
forms:
|
|
Limited Broadcast: {-1, -1}
|
|
Directed Broadcast: {<Network-number>,-1}
|
|
Subnet Directed Broadcast:
|
|
{<Network-number>,<Subnet-number>,-1}
|
|
All-Subnets Directed Broadcast: {<Network-number>,-1,-1}"
|
|
|
|
All-Subnets Directed broadcasts are being deprecated in favor of IP
|
|
multicast, but were very much defined at the time RFC1122 was written.
|
|
Thus a Subnet Directed Broadcast to a subnet of all ones is not
|
|
distinguishable from an All-Subnets Directed Broadcast.
|
|
|
|
For those old systems that used all zeros for broadcast in IP addresses,
|
|
a similar argument can be made against the subnet of all zeros.
|
|
|
|
Also, for old routing protocols like RIP, a route to subnet zero
|
|
is not distinguishable from the route to the entire network number
|
|
(except possibly by context).
|
|
|
|
Most of today's systems don't support variable length subnet masks
|
|
(VLSM), and for such systems the above is true. However, all the major
|
|
router vendors and *some* Unix systems (BSD 4.4 based ones) support
|
|
VLSMs, and in that case the situation is more complicated :-)
|
|
|
|
With VLSMs (necessary to support CIDR, see RFC 1519), you can utilize the
|
|
address space more efficiently. Routing lookups are based on *longest*
|
|
match, and this means that you can for instance subnet the class C net
|
|
with a mask of 255.255.255.224 (27 bits) in addition to the subnet mask
|
|
of 255.255.255.192 (26 bits) given above. You will then be able to use
|
|
the addresses x.x.x.33 through x.x.x.62 (first three bits 001) and the
|
|
addresses x.x.x.193 through x.x.x.222 (first three bits 110) with this
|
|
new subnet mask. And you can continue with a subnet mask of 28 bits, etc.
|
|
(Note also, by the way, that non-contiguous subnet masks are deprecated.)
|
|
|
|
This is all very nicely covered in the paper by Havard Eidnes:
|
|
|
|
Practical Considerations for Network Address using a CIDR Block Allocation
|
|
Proceedings of INET '93
|
|
|
|
This paper is available with anonymous FTP from
|
|
|
|
aun.uninett.no:/pub/misc/eidnes-cidr.ps
|
|
|
|
The same paper, with minor revisions, is one of the articles in the
|
|
special Internetworking issue of Communications of the ACM (last month,
|
|
I believe).
|
|
|
|
> I have be told that some network equipment (Cisco I think was the vendor
|
|
> named) will not correctly handle subnets that violated that standard.
|
|
As far as I know cisco is one of the router vendors that *do* handle
|
|
VLSMs correctly. Could you substantiate this claim?
|
|
|
|
Steinar Haug, SINTEF RUNIT, University of Trondheim, NORWAY
|
|
Email: Steinar.Haug@runit.sintef.no
|
|
|
|
--
|
|
George V. Neville-Neil work: gnn@wrs.com home:gnn@netcom.com
|
|
NIC: GN82
|
|
|
|
This signature kept blank due to the CDA.
|
|
|