textfiles/internet/FAQ/usl_bugs.faq

2116 lines
88 KiB
Plaintext

Path: senator-bedfellow.mit.edu!bloom-beacon.mit.edu!spool.mu.edu!agate!library.ucla.edu!news.mic.ucla.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cert.org!netnews.upenn.edu!dsinc!gvls1!boojum!esr
From: esr@snark.thyrsus.com (Eric S. Raymond)
Newsgroups: comp.unix.sys5.r4,comp.unix.pc-clone.32bit,comp.bugs.sys5,news.answers
Subject: Known Bugs in the USL UNIX distribution
Message-ID: <1mMD8p#M7bHGP74mn36O7fG8Zm0smCcO=esr@boojum.thyrsus.com>
Date: 5 Aug 93 16:27:16 GMT
Expires: 4 Aug 93 23:30:00 GMT
Sender: esr@boojum.thyrsus.com (Eric S. Raymond)
Followup-To: comp.unix.pc-clone.32bit
Lines: 2102
Approved: news-answers-request@MIT.Edu
Xref: senator-bedfellow.mit.edu comp.unix.sys5.r4:4553 comp.unix.pc-clone.32bit:5794 comp.bugs.sys5:1881 news.answers:11155
Archive-name: usl-bugs
Last-update: 05 Aug 1993
Supersedes: <unknown>
Version: 17.0
Many FAQs, including this one, are available via FTP on the archive site
rtfm.mit.edu (aka pit-manager.mit.edu or 18.172.1.27) in the directory
pub/usenet/news.answers. The name under which this FAQ is archived appears in
the Archive-name line above. This FAQ is updated monthly; if you want the
latest version, please query the archive rather than emailing the overworked
maintainer.
What's new in this issue:
* New bug info (see below)
* Instructions for fixing the FUBYTE problem under Del 2.2.
*** NEWS FLASH *** NEWS FLASH *** NEWS FLASH *** NEWS FLASH *** NEWS FLASH ***
May's new bug (II.43) is still *really serious*. Get after your
vendor to fix it ASAP!
*** NEWS FLASH *** NEWS FLASH *** NEWS FLASH *** NEWS FLASH *** NEWS FLASH ***
(In the table below, bugs new this issue are marked with a ** at the
left margin; old bugs for which information has been added are marked
with *)
0. Table of Contents
I. Introduction
II. General Bugs
1. UNIX kernel must lie below the 1024-cylinder mark
2. Suid programs dump core when signalled
3. DMAs on large ISA machines may fail
4. There is a cylinder limit on disk size
5. more(1) doesn't handle SIGWINCH
6. X performance problem
7. C shell background process termination logs you out
8. A security hole in login
9. COFF problems with long filenames
10. Flakeouts in the Wangtek device driver
11. A kernel declaration bug
12. Reading tar archives with cpio foos up on multiply-linked files
13. Process accounting is broken
14. tar(1) foos up in the presence of symbolic links
15. Symbolic links can interfere with shellscript execution
16. Piping a csh builtin causes the shell to hang.
17. tar(1) fails to restore adjacent symbolic links properly
18. COFF binaries linked with curses(3) and shared libc hang
19. shl hangs, sxt devices bad
20. num-lock prevents mouse from working properly
21. adjtime() doesn't work
23. cron mail doesn't go through aliasing
24. fragility in xterm
25. csh lossage due to bad optimization
26. Bug in cp(1)
27. tbl -me doesn't work
* 28. who -r fragility leads to boot-time problems
29. at(1) breaks here-documents in shell scripts
30. UHC mouse driver ignores the middle button.
31. mmap acces doesn't update file mod times
32. AT&T select(2) is incompatible with BSD select(2)
33. (4.2) The login program requires its PPID to be 1
34. (4.2) Bad MAXMINOR values can make the system unbootable
35. Incompatible change in TZ interpretation
36. Nulls in pixmaps can crash X
37. Potential security hole in SVr4s using sendmail
38. Reporting bug in df on non-root filesystems
39. tar writes -v output to stdout, not stderr
40. SIGPIPE is delayed and not reliable
41. /usr/lib/acct/fwtmp doesn't work
42. whatis database is full of garbage.
43. mmap is seriously broken
** 44. a bug in xterm
** 45. DrawText16() bug in XWIN
** 46. output redirection with exec fails in sh
** 47. rm fails to reject . or .. arguments
III. Serial-port and tty administration problems
1. Dropout problems with tty devices
2. Quick port setup option in sysadm is broken
3. ttymon drops DTR when it shouldn't
4. ttymon doesn't drop DTR when it should
5. (4.2) Terminating cu to a direct line locks up the port
* 6. Hardware flow control bug breaks streaming data transfers
7. Bad interaction between ttymon and networking
IV. Networking and File-Sharing Bugs
1. NFS locking is unusably slow
2. UFS file system problems
3. Byte-order problem with NFS when accessing Sun disks
4. Under weird circumstances, lseek on UFS may cause corruption
5. FTP problems
6. A bug in the WD80x3 support
7. Security hole near fingerd
8. Fatal bug in priority-band message handling.
9. SVr4.0.4 TCP/IP routing is broken
10. df(1) on NFS volumes returns bad data
11. rsh hogs the processor
** 12. MTU for remote networks ignored
** 13. Bug in remote printing.
V. SCSI Support Problems
1. sar is confused by SCSI
2. A configuration problem
3. Synchronous SCSI hang problem
4. ps chokes on commands that do SCSI I/O
5. Transfer speed problems with Adaptec 1542B on 486s
6. df gives inaccurate values for large SCSI partitions
VI. Development Tools Problems
1. General UCB library brokenness
2. USL emulation of BSD signals doesn't work
3. Possible string library problems
4. USL's ndbm support is broken.
5. An include file is missing
6. sscanf(3) has a potential bug
7. shmat(2) vs. vfork(2)
8. FIONREAD fails on regular files
9. fread(3) does the wrong thing on pipes and FIFOs
10. putw appears to be broken
11. Compiler problems
12. getlogin() doesn't work
13. syslog routines don't work
14. Bogus `r' in xt driver configuration flags
15. ioctl for kernel symbol fetches fails (4.2)
** 16. Bug in cc optimizer (4.2.1)
** 17. /usr/ucb/install uses missing group "staff"
VII. The FUBYTE Problem *
VIII. Destiny and Dell
I. Introduction
This posting lists known bugs in System V Release 4 implementations, and known
fixes applied by various porting houses (there's also random bits of
information about SCO UNIX here and there). It was formerly part of the
386-buyers-faq issues 1.0 through 4.0, and is still best read in conjunction
with the pc-unix/software FAQ descended from that posting.
This document is maintained and periodically updated as a service to the net by
Eric S. Raymond <esr@snark.thyrsus.com>, who began it for the very best
self-interested reason that he was in the market and didn't believe in plonking
down several grand without doing his homework first (no, I don't get paid for
this, though I have had a bunch of free software and hardware dumped on me as a
result of it!). Corrections, updates, and all pertinent information are
welcomed at that address.
This posting is periodically broadcast to the USENET group comp.unix.sysv386
and to a list of vendor addresses. If you are a vendor representative, please
check to make sure the information on your company is current and correct. If
it is not, please email me a correction ASAP. If you are a knowledgeable user
of any of these products, please send me a precis of your experiences for the
improvement of future issues.
The bug descriptions often include indications of fixes by the various porting
houses to their current releases. These are:
Consensys UNIX Version 1.3 abbreviated as "Cons" below
Dell UNIX Issue 2.2 abbreviated as "Dell" below
Esix Revision A abbreviated as "Esix" below
Micro Station Technology SVr4 UNIX abbreviated as "MST" below
Microport System V Release 4.0 version 4 abbreviated as "uPort" below
UHC Version 3.6 abbreviated as "UHC" below
SCO Open DeskTop 1.1 abbreviated as "SCO" below
II. General Bugs
1. UNIX kernel must lie below the 1024-cylinder mark
Bela Lubkin says "SCO's boot filesystem must lie below 1024 cylinder mark;
anything else can be anywhere. This is more-or-less a limitation of the BIOS
interface that the bootstrap loader must use. Could be circumvented by going
directly to controller hardware in the bootstrap loader, but that would be
horrendously complex with all the controllers & host adapters to be supported."
Actually this is not quite right. It's the *kernel* that must lie below
the 1K-cylinder mark; the rest of the root partition could extend above it.
But since partition endpoints are the only way to control where physical
blocks get allocated, it comes to the same thing
Roger Knopf <rogerk@sco.COM> adds: "The 1024 cylinder limit applies
not only to the kernel but also to /boot. Both are read in while we
are using the BIOS to talk to the hard disk. There are 10 bits set
aside in the register for cylinders in the INT 13 call, hence 1024
cylinders. There are a few controllers that allocate 2 more bits (they
are taken away from the space allocated for head bits, I recall). It
is trivial to modify all the relevant boot code to use these bits IF
YOU KNOW THAT THE CONTROLLER WILL USE THEM but I know of no way to
reliably determine that this is the case. Once the kernel is loaded
we use 16 bits everywhere to hold the cylinder number."
2. Suid programs dump core when signalled
Mark Snitily of SGCS says that under many SVr4s, signalling a
process that is running suid root will cause it to core-dump. He says
Dell and MST have fixed this, and SCO doesn't suffer from this.
3. DMAs on large ISA machines may fail
On ISA machines with more that 16MB of RAM, SVr4 may try to do DMA
from outside the bus's address space, causing serious problems. UNIX ought
to do an in-memory copy to within the low 16MB but the USL base code doesn't.
Dell says they've fixed this, and that's been confirmed by a user.
UHC says they've fixed this; they add that the special buffer-allocation
logic to handle the problem can be turned off with a tunable kernel parameter
if you've got less than 16M.
Microport says they've fixed this in their new 4.1 release, shipping early
March.
Esix offers a patch to correct this problem.
SCO used to have a similar bug but fixed it long ago.
John Sully <jms@mport.com> writes: "This was due to a bug in pre version 4
dma code. The USL code has always at least attempted to do a copy from low
memory to high memory on systems with more than 16Mb of RAM. By the way UHC is
wrong; the buffer allocation code only comes into play if you have more than
16Mb of memory. You can turn it off if you have a machine (ie. an EISA bus)
which will allow you to do DMA above 16Mb. You *must* have this tunable
(MAXDMAPAGE) turned on if you are using *ISA* bus masters in a system with more
than 16Mb of ram. Unfortunately doing this will affect all drivers which do
dma as there is no good way to do this on a per-driver basis."
4. There is a cylinder limit on disk size
Stock USL code is limited to 1,024 cylinders per Winchester, which
might cause problems with some disk drives.
Microport, Dell, Esix, MST, and UHC have fixed this.
5. more(1) doesn't handle SIGWINCH
It doesn't get its window size from the stty/termio structures, so it
doesn't cope with SIGWINCH properly.
6. X performance problem
Stock X11R4 and R5 (at least prior to 1.2E) is said to hog the
processor if you use the LOCALCONNECT option. Jan Brittenson
<bson@gnu.ai.mit.edu> posted the following workaround:
I don't know what causes the standard X server to hog the CPU, but
it can be avoided. Use the following program instead of xinit. Compile
it with `$CC -O -o xserv xserv.c -lX11' where CC is either
/usr/ccs/bin/cc or gcc. Set DISPLAY and XINITRC and run `xserv' from
your home directory. This is just a q&d hack, and not really a
substitute for xinit -- but it works.
/* xserv.c -- start X server
Start X server. Similar to xinit, but intended to
circumvent the X386 CPU Hog Mode
Jan Brittenson, June 2 1992 05:15 am
with corrections by Adam Donnison <adam@shinto.saki.com.au> Tue, 2 Mar 1993
*/
#include <stdio.h>
#include <sys/types.h>
#include <signal.h>
#include <setjmp.h>
#include <unistd.h>
#include <libgen.h>
#include <X11/Xlib.h>
#include <X11/Xos.h>
#include <X11/Xmu/SysUtil.h>
extern int errno;
/* This may need to be "/usr/X386/bin/X386" */
#define DEFAULT_XPATH "/usr/bin/X11/X"
/* Start X server. Fork-exec server, passing the DISPLAY environment
variable. Wait for server to get up and running (at which point it
passes back a SIGUSR1), at which point the user xinitrc file is run. */
#define XINITRC ".xinitrc"
#define DEFAULT_XCOMMAND "xterm -g +1+1 -n login -display :0"
extern void *malloc (), free ();
extern char *basename (), *getenv (), *strcpy ();
/* X stuff */
Display *top_display;
/* This is supposed to be in libgen.a... */
static char
*basename (s0)
char *s0;
{
register char *s1;
for (s1 = s0 + strlen (s0) - 1;
s1 > s0 && *s1 != '/'; s1--);
if (*s1 == '/')
return s1+1;
return s1;
}
jmp_buf sigusr1_frame;
static void
caught_sigusr1 (int dummy) { longjmp (sigusr1_frame, !0); }
static char
*dispname (s0)
char *s0;
{
register char *s1;
for (s1 = s0 + strlen (s0) - 1;
s1 > s0 && *s1 != ':'; s1--);
return s1;
}
/* No arguments */
int
main (argc, argv)
int argc;
char **argv;
{
char *xserver_file, *xinitrc_file, *home_path, *display, *display_X_arg;
int xserver_pid, orgmask;
/* Not that it really matters, just to avoid being used as a direct
replacement for xinit. */
if (argc != 1)
{
fprintf (stderr, "usage: %s\n", basename (*argv));
exit (1);
}
/* Resolve xinitrc path. This is done before the server is
started. */
if (!(home_path = getenv ("HOME")))
home_path = "/etc";
if (!(xinitrc_file = getenv ("XINITRC")))
{
xinitrc_file = malloc (strlen (home_path) + 1 + strlen (XINITRC) + 1);
sprintf (xinitrc_file, "%s/%s", home_path, XINITRC);
}
else
xinitrc_file = strdup (xinitrc_file);
/* Resolve display */
if (!(display = getenv ("DISPLAY")))
display = display_X_arg = ":0.0";
else
display_X_arg = dispname (display);
/* Tell server to notify us when up and running */
signal (SIGUSR1, SIG_IGN);
orgmask = sigblock (sigmask (SIGUSR1));
/* Start server */
if (!(xserver_pid = vfork ()))
{
xserver_file = DEFAULT_XPATH;
execl (xserver_file, xserver_file, display_X_arg, NULL);
fprintf (stderr, "%s: can't exec %s (errno = %d) -- start-up aborted\n",
basename (*argv), xserver_file, errno);
exit (1);
}
if (xserver_pid < 0)
{
fprintf (stderr, "%s: can't fork (errno = %d) -- start-up aborted\n",
basename (*argv), errno);
exit (1);
}
/* Await signal from server */
#if 0
/* Why the #@$*! doesn't this work?! */
sigsetmask (orgmask);
alarm (20);
sigpause (sigmask (SIGUSR1) | sigmask (SIGALRM));
#else
sleep (5);
#endif
/* Open display */
if (!(top_display = XOpenDisplay (display)))
{
fprintf (stderr, "%s: unable to open display '%s' -- start-up aborted\n",
basename (*argv), display);
exit (1);
}
/* Execute xinitrc file */
if (system (xinitrc_file) < 0)
system (DEFAULT_XCOMMAND);
/* Close display */
XCloseDisplay (top_display);
/* Terminate server */
kill (xserver_pid, SIGTERM);
/* Finished */
free (xinitrc_file);
}
7. C shell background process termination logs you out
In C shell, unless "ignoreeof" is set, termination of a background
process will log you out. With "ignoreeof" set, just the message
"Use logout to exit" will be printed.
8. A security hole in login
David Wexelblat <dwex@mtgzfs3.att.com> reports: "There is a HUGE security
hole in /bin/login in all USL derived SVR4s before 4.0.4. Refer to CERT
advisory CA-91:08, dated 5/23/91. This is known to be present in AT&T SVR4
2.1, and Microport SVR4 3.1. ESIX claims to have fixed it, Microport reports
that it is fixed in 4.1. I won't give any more details unless necessary.
Suffice to say that this bug allows any non-privileged user on an SVR4 system
to get read-write access to any file on the system."
9. COFF problems with long filenames
A source at Dell urges: "Our SVR4v2 did some stuff that USL didn't get
around to until SVR4v4. Try Dell UNIX 2.1 with a COFF program on a large UFS
filesystem in a directory with long names. Runs on Dell UNIX. Breaks on
others." I don't have more definite info yet.
10. Flakeouts in the Wangtek device driver
Dell reports that USL's Wangtek device driver is seriously flaky. "How'd
you like a multi volume backup where the second and subsequent volumes don't
follow on from the previous volumes?" UHC confirms this and is actively
working on the problem.
An anonymous SCOer says "The QIC02 tape controller `standard' is seriously
flaky. Our driver's in pretty good shape but nobody will ever have a truly
solid driver that supports every QIC02 controller you can find."
Gordon Ross <gwr@mc.com> reports: "Actually, the SCSI tape target driver
`st01' has a similar problem at version 4.0.3 which I corrected while I worked
on the SVR4 code. The correction was provided to the support group at USL.
The actual problem was that the SCSI tape would return a `check status'
completion code which was just trying to inform the driver of the arrival
of the `logical end of media' indication but the driver was treating it
as an error. The tape drive had in fact written the data, but the driver
incorrectly assumed that the "check status" return meant that it failed.
The result of this is that when you write into the end of the tape, you
can read back one more "chunk" than yu wrote. Of course, cpio does not
like this at all when doing multi-volume backups..."
11. A kernel declaration bug
A botch in USL's /etc/conf/pack.d/kernel/space.c (which is present in
Consensys 1.3, Dell 2.1, Esix 4.0.3A, Microport 4.0.3 and 4.0.4 and may also be
present in other SVr4s) can step on the linesw[] table. The problem is that
the domain name array initialization is wrong and too short; thus, when it's
set, data past the end of the array can be stomped. To fix this, find the
following near line 247:
char srpc_domain[] = SRPC_DOMAIN;
and change it to
char srpc_domain[SYS_NMLN] = SRPC_DOMAIN;
then rebuild the kernel.
Microport officially knows about this bug and plans to fix it in a
near-future update release. It has been fixed in Dell 2.2.
12. Reading tar archives with cpio foos up on multiply-linked files
Paul De Bra <debra@info.win.tue.nl> reports the following:
In theory, cpio(1) is supposed to be able to read tar(1) archives. In
practice...don't try it. Multiply-linked files will be extracted from the
archive, whether or not they match the current pattern and whether or not
you have selected 'u'. This happens even if you use the `t' option, so
it's not even save to list the archive files!
13. Process accounting is broken
In 4.0.3, process accounting doesn't work. From examining the accounting
scripts, it appears that /usr/lib/acct/accton is supposed to set a return code
depending on whether accounting was switched on already or not. However, it
always returns the same result - accounting switched off. This means that the
/usr/lib/acct/ckpacct script, which is run every hour to keep the proccess
accounting log in check, instead turns off accounting the first time it is run
after booting. The same happens with the nightly /usr/lib/acct/monacct
program.
I don't yet know whether this bug is present in 4.0.4. It is definitely
un-fixed in Dell 2.1 and Consensys 1.3. In Dell 2.2 the return bug is fixed,
but accounting isn't automatically enabled at boot time.
14. tar(1) foos up in the presence of symbolic links
Tar can get the names of symbolic links wrong when creating an archive.
This bug can be demonstrated by doing the following:
mkdir t
cd t
touch a 1234567890
ln -s 1234567890 b
ln -s a c
tar vcf ../t.tar .
The output generated by tar is:
a ./ 0 tape blocks
a ./a 0 tape blocks
a ./1234567890 0 tape blocks
a ./b symbolic link to 1234567890
a ./c symbolic link to a234567890
(Note the above commands should be done in the order shown and in a new
directory) This bug is nasty. Recommended solution: use GNU tar.
This is reported from Esix 4.0.3 and Consensys 1.3, but probably exists on
other SVr4s as well. It has been fixed in Dell 2.2.
15. Symbolic links can interfere with shellscript execution
There is a problem running #! scripts when symbolic links are involved.
Typing in the following from a command shell demonstrates the problem:
mkdir a b
ln -s a c
cd a
cat > script <<!
#!/bin/sh
echo Hello
!
chmod 755 script
cd ../b
ln -s ../c/script .
./script
The message generated from the last line is:
a/script: a/script: cannot open
This is reported from Esix 4.0.3, Consensys 1.3, and Dell 2.2, but
probably exists on other SVr4s as well.
16. Piping a csh builtin causes the shell to hang.
While running csh, this can be demonstrated by some of the following:
echo Hello | cat
history | more
(A solution to this one is use tcsh-6.02.)
This is reported from Esix 4.0.3 and Consensys 1.3, but probably exists on
other SVr4s as well. It is reported fixed in Dell 2.2.
17. tar(1) fails to restore adjacent symbolic links properly
Arthur Krewatt <...!rutgers!mcdhup!kilowatt!krewat> reports:
SVR4 tar has another strange bug. Seems if when restoring files, you
restore one file that is a link, say "a ->/a/b/c/d/e" and there is another
link just after it called "b ->/a/b/c" tar will restore it as "b ->/a/b/c/d/e"
This just seems to be a lack of the NULL at the end of the string, like
someone did a memmov or memcpy(dest,src,strlen(src)); where it should be
strlen(src)+1 to include the NULL.
18. COFF binaries linked with curses(3) and shared libc hang
...eating the CPU. Cause unknown.
19. shl hangs, sxt devices bad
shl(1) does not work. Try creating a layer and doing an 'ls'. Your session
hangs. Bruce Momjian <root%candle.uucp@bts.com>, who reported this bug, says
he believes it is the sxt devices which are broken. It definitely exists in
Consensys 1.3.
20. num-lock prevents mouse from working properly
When using the Motif window manager, if your num lock is on, your mouse
clicks are not recognized by the window manager. The mouse still works in
xterm(1). This is allegedly fixed in Destiny (4.2).
Under Dell 2.2 if num lock is on there's no problem, but if scroll lock
is on then mouse clicks aren't recognised.
21. adjtime() doesn't work
Hugh Stearns <hoyt@isus.tnet.com> reports that in 4.0.3.6 adjtime() doesn't.
Calling `date -a' works to adjust the time slowly.
23. cron mail doesn't go through aliasing
Hugh Stearns <hoyt@isus.tnet.com> reports that in 4.0.3.6 cron mail to adm
doesn't get redirected by the aliases file.
24. fragility in xterm
Hugh Stearns <hoyt@isus.tnet.com> reports that in 4.0.3.6, doing ~! from
a cu in xterm kills xterm. This has been fixed in Dell 2.2.
25. csh lossage due to bad optimization
If a csh user sources a non-existent file in their .cshrc (eg, source .alias,
where .alias doesn't exist), then the system will hang for a couple of minutes.
Eventually the user get an "Out of memory" error and the console logs "NOTICE:
out of swap space - Insufficient memory to allocate 2 pages - system call
failed".
This appears to be due to over-optimization of code surrounding a longjmp
call.
(There are numerous other reports of memory leak bugs in csh).
26. Bug in cp(1)
If ``copy'' encounters a directory before a file, it dumps core ...
--- cut ---
cd /tmp
mkdir copybug jnk
cd jnk
mkdir directory
>file
cp -r * /tmp/copbug
--- cut ---
This was reported from Consensys 4.0.3 but is probably a generic SVr4 bug.
It appears to have been fixed in ESIX SVR4.0.3A and Dell 2.2.
27. tbl -me doesn't work
Wolfgang Denk reports that trying to use "tbl -me" for any input file causes
tbl to quit. The problem is that newer tbl versions don't accept [nt]roff
contol lines (".rm @W") after .TS.
28. who -r fragility leads to boot-time problems
It coredumps if the name of the timezone (TZ) is longer than three characters
and the length is a multiple of four. This can be a real problem for European
sites... and is potentially more hazardous than immediately apparent as _a
lot_ of the initialization scripts (rc1.d, rc2.d) use ``who -r'' to see if the
machine is in single- or multi-user mode. And when ``who'' bombs out, the
``set'' command is iven an empty command-line and can't do much else than print
the shell variables, $1-$9 remain empty ... meaning that more or less all the
scripts fail in various ways and the system has an exceptionally hard time
coming up.
Peter Wemm <peter@DIALix.oz.au> reports that this bug was present in Dell
2.0, fixed in Dell 2.1, but reappeared in Dell 2.2. Dell says it's a generic
USL bug.
There is an easy workaround; make sure /etc/inittab is an odd number of
characters long. The bug is causes by an off-by-one in a buffer malloc.
29. at(1) breaks here-documents in shell scripts
at adds gratuitous empty lines to the job submitted by the user.
This prevents shell here-documents from working.
30. UHC mouse driver ignores the middle button
This may be a generic USL problem, but Dell (at least) has fixed it. UHC
says they have a patch for it, but I haven't seen the patch.
31. mmap acces doesn't update file mod times
Peter Wemm <peter@DIALix.oz.au> reports that under SVr4, if one mmap()'s a
file, and writes to it via the mapped memory, when the disk is updated, the
modification time does not update.
32. AT&T select(2) is incompatible with BSD select(2)
Paul Eggert <eggert@twinsun.com>, as quoted by James Buster <bitbug@lynx.com>
reports:
The select() system call waits for read, write, or exception activity
on a set of file descriptors, and yields an integer telling you how
much activity it found.
BSD's select(N,&R,&W,&E,&T) can yield up to 3*N, because BSD's select()
counts the number of bits that it turns on in in the R, W, and E
arguments, and R, W, and E each contain one bit per file descriptor.
However, System V Release 4 v2.1's select(N,&R,&W,&E,&T) yields at most N,
because SVR4's select() just counts the number of active file
descriptors, regardless of how many bits it turns on.
For example, the following code checks file descriptor 0. In BSD, this
code can set n to 2 if file descriptor 0 is ready for both reading and
writing. However, in SVR4, this code sets n to at most 1, because only
file descriptor 0 is active.
int n;
fd_set r, w;
FD_ZERO(r); FD_SET(0, &r);
FD_ZERO(w); FD_SET(0, &w);
n = select(1, &r, &w, (fd_set*)0, (struct timeval*)0);
At least one widely used piece of software depends on the BSD
behavior, namely X11R5 (see Xt/NextEvent.c). In this application, the
bug's symptoms are subtle and are rarely encountered, but they do
exist.
Most of X11R5's calls to select() don't care about this difference,
but the following files in the X11R5 distribution contain calls to
select() that may be affected by this bug:
contrib/lib/i18nXView2/lib/libxview/notify/ndetselect.c
contrib/lib/xview3/lib/libxview/notify/ndetselect.c
mit/fonts/server/os/waitfor.c
mit/lib/Xt/NextEvent.c
mit/server/os/WaitFor.c
(Note: this is a very old bug. Paul Eggert tells me that William Kucharski reported this bug to AT&T in 1989 when he ported X11R3!)
33. (4.2) The login program requires its PPID to be 1
Rick Richardson reports: "The "/bin/login" program has been changed to be
hardwired to require its PPID to be "1". In all other versions of UNIX, it is
sufficient that there be an /etc/utmp entry. This bug was reported to USL, and
I did get a fixed "login" program from them, but the fix did not make it into
the release. I don't know how mere mortals get the fix at this point."
34. (4.2) Bad MAXMINOR values can make the system unbootable
Rick Richardson reports: "If MAXMINOR is stune'ed to the maximum value,
0x3fff (18 bits), then the kernel will refuse to boot, cycling up to driver
initialization and then doing a processor recent. Interestingly, this bug was
not in the beta release, but was in the final release."
35. (4.2) Incompatible change in TZ interpretation
Rick Richardson reports: "While not really a bug, this is a surprise. In
4.2, the TZ variable was given a new meaning. Rather than the traditional
CST6CDT type of value, it now looks like ":US/Central". This causes 3.2 and
4.0 binaries which use the date/time routines to report GMT time. I have no
idea why another variable name was not choosen. I've taken to aliasing the
binaries, e.g. "TZ=CST6CDT svr4binary"."
Mike "Ford" Ditto <ford@omnicron.com> corrects this. "This change
was made in 4.0, not 4.2, and 4.0 binaries should have no problem with
the new format. Some 4.0 systems use the new format by default. The
old format should be avoided unless SVR3 binaries are in use, since
the new features of the time conversion libraries are only available
if the new format is used."
Christoph Badura points out that the time functions still read the old
TZ format, so you can set TZ=CST6DT or whatever and only the new features
will be disabled.
36. Nulls in pixmaps can crash X
Rick Richardson reports: "Displaying XPM2 pixmaps which have NULLS in them
will crash the X server. Admittedly, this is not much of a bug, since these
are ill-formed or corrupted pixmaps. But the server should stay up, even in
these conditions. A little error checking needed."
37. Potential security hole in SVr4s using sendmail
Christoph Badura writes: "/usr/ucblib/aliases contains an alias for
decode that feeds straight into uudecode. I don't know under what uid
uudecode gets invoked, but if it's root anyone can overwrite any file
on a SVR4 system running the stock sendmail. [Under Dell UNIX] t
appears that the files get created with a user-ID of "daemon". Not
nice but better than root."
38. Reporting bug in df on non-root filesystems
Paul Debra <debra@win.tue.nl> discovered that if df(1) is run on a
filesystem other than root with a n argument of `.', the file system
name is always reported as '/'. This does *not* happen if you give
it $PWD as argument.
This bug is present in Dell 2.2.
39. tar writes -v output to stdout, not stderr
This is an incompatible, undocumented change from earlier UNIXes and
royally screws up invocations like /bin/tar cvf - foo | /bin/tar tf - that
previously worked.
Observed in ESIX 4.0.3A and 4.0.4, Dell 2.2; probably generic. It
also existed in SCO ODT and Xenix before 2.0 and 3.2v4, but has been fixed in
these most recent versions.
40. SIGPIPE is delayed and not reliable
Wolfgang Denk reports a kernel bug in src/uts/i386/fs/fifofs/fifovnops.c
that results in SIGPIPE not getting raised immediately by failed writes.
You can reproduce this with the following program:
1 #include <stdio.h>
2 #include <signal.h>
3
4 extern int errno;
5
6 int sp();
7
8 int eop = 0;
9
10 char *line = "This is garbage.\n";
11
12 main () {
13 int i;
14 int l = strlen (line);
15
16 signal (SIGPIPE, sp);
17 for (;;) {
18 /*
19 for (i=0; i<10000; ++i) ;
20 */
21 if (write(1, line, l) != l) {
22 fprintf (stderr, "write error, errno=%d, eop=%d\n",
23 errno, eop);
24 fflush (stderr);
25 exit (errno);
26 }
27 }
28 }
29
30 int sp()
31 {
32 fprintf (stderr, "SIGPIPE\n");
33 fflush (stderr);
34 eop = 1;
35 }
To test this, pipe its reslt to ls.
He writes: "That is, you can't be sure that SIGPIPE will be raised when a pipe
breaks. Adding a short delay (for instance by uncommenting the for loop around
line 19) gives _always_ SIGPIPE -- but usually you don't want to have
additional delays in your program :-("
Bernard Fouche <bernard@cpio1.fr.mugnet.org> observes that this is
not necessarily a bug. He writes: "Compile your example with the
following change :
- do not include your delay loop.
- add a line between line 24 and 25. This line will be :
sleep(60);
This change will make a.out stay alive for 1 minute before
exiting.
- recompile, run with 'a.out|ls'.
- do 'ps -le |grep a.out'.
What you'll see is that a.out is now running in the background and its
father is init(1)! So the return value of write(2) (EIO) can now be
understood.
The only thing that I can tell is that pipes, that are now based on
streams in SVR4, have a more complex behavior than in SVR3.2 but I
would not call problem #40 a 'bug'. It can be related to the shell
that ran the command and/or the scheduler and/or the stream subsystem."
41. /usr/lib/acct/fwtmp doesn't work
John F. Haugh reports that under Dell UNIX the /usr/lib/acct/fwtmp command
does not work as described in the man page; the output contains no line
feeds and appears to be garbage. I have verified this.
This is probably a generic SVr4 bug.
42. whatis database is full of garbage.
Raymond Nijssen <raymond@woensel.es.ele.tue.nl> reports: "Both under ESIX
4.0.3 and 4.0.4, whatis database contains an awful lot of garbage, such as
nroff macros. In addition, quite a lot of man pages mentioned are missing, and
several available man pages are not mentioned. Since makewhatis is broken (at
least under 4.0.3A), this cannot be repaired easily. ESIX blamed USL for
this."
43. mmap is seriously broken
(thanks to Peter Wemm <peter@zeus.dialix.oz.au> for a detailed report.)
ALL SVR4.0s have/had a nasty kernel bug that causes seemingly random executable
and shared library corruption, and also unleashes a SERIOUS security bug. The
"Copy-on-Write" mechanism within the kernel has bugs. It is sufficient to say
that the security related bug allows any user with shell and compiler access to
WRITE to any file that they can read.
SVR4.2 has been fixed for some time. ICL apparently fixed it in their sparc
reference port (and x86 port), which means that Solaris2.x do not have the
bugs.
The most common symptom of shared library corruption is that programs
simply core dump when you attempt to access a non existing file.
$ more /notexisting
Segmentation Fault (core dumped).
To recover from this, restore /usr/lib/libc.so.1 from the distribution media.
The security bugs have no known workaround, other than crippling the mmap()
function in the kernel.
Dell has produced a fix for their release 2.2 systems. The patch is
available from dell1.dell.com:/support2.2/CoW.t
Although it has not been tested, it is very unlikeley that Dell's patch will
work on any other SVR4/386, as it replaces two kernel modules, and Dell's
kernel has autoconfiguration extensions that are not present in other systems.
Dell 2.2 has got a STREAMS optimizer function enabled in the system that joins
together small adjacent streams messages. There were bugs in the early USL
versions of this, but for 2.2, Dell enabled it after applying a fix from USL.
It seems that in some rare circumstances, some machines are quite unstable with
this enabled as default. support2.2/CoW.t also disables the optimization to
improve stability. This brings Dell 2.2 into line with the other SVR4.0.4
systems.
44. a bug in xterm
Nickolay Saukh <nms@ussr.eu.net> reports ""
45. DrawText16() bug in XWIN
Nickolay Saukh <nms@ussr.eu.net> reports "xterm strips off the eight bit of
first character in line. This bug was present in x11r5 but fixed by some
patch. I have no exact info under my thumb."
(Can anyone else confirm this bug?)
46. output redirection with exec fails in sh
Andreas Luik <luik@isa.de> reports: "In Bourne shell scripts, the output of
all following commands may be redirected using the "exec" builtin with an
output redirection, e.g.
exec > LOG
If such a construct is used in a for loop with a variable filename for the
redirection, e.g. exec > $f, only the first output redirection is executed in
the SVR4 /bin/sh. It works correctly in /bin/ksh as well as in the HPUX, SunOS
4.1 and AIX Bourne shells."
47. rm fails to reject . or .. arguments
Andreas Luik <luik@isa.de> reports: "rm does not check for `.' and `..'
arguments. The rm program should check for the arguments `.' and `..' (at
least if called with the -r option) and ignore this arguments with the message
"rm: cannot remove `.' or `..'". All implementation I'm aware of perform this
check. As far as I know, this check is also in the SVR4 sources but implemented
incorrect. This bug should be fixed for security reasons."
III. Serial-port and tty administration problems
Nickolay Saukh <nms@ussr.eu.net> reports "XWIN bug for DrawText16(). If one
tries to output text line with more then one font, then text segment with
second font (and subsequent segments) displayed shifted to left. This bug also
fixed by some patch to x11r5."
(Can anyone else confirm this bug?)
1. Dropout problems with tty devices
The most serious problem anyone has reported is that the USL asy driver is
flaky and occasionally drops characters at above 4800 baud.
Microport, Dell, Esix, and UHC say that they believe they've fixed this.
However, Dell, at least, was mistaken when they first made this claim; a more
detailed description of the problem is given below. I have been assured that
this is on the fix list for the next Dell release.
Bela Lubkin at SCO comments "386 interrupt latency vs. unbuffered UARTs.
This is a tough problem. Nobody's driver should drop characters with a
turned-on 16550. It's not so easy with a 16450. Anyone with 16450s or lower
should be able to solve their problems by dropping in a 16550."
2. Quick port setup option in sysadm is broken
In 4.0.3 sysadm, the quick port setup option, which is used to add and
delete terminal ports, is seriously broken. The script modifies /etc/conf/*
files, and has incorrect minor numbers, sets the 5th field of sdevice.d to Y
when it should be N, and is missing columns for node.d. See
/usr/sadm/sysadm/bin/q-add. This bug is present in USL 4.2 as well
(certainly in Consensys V.4.2).
3. ttymon drops DTR when it shouldn't
Hugh Stearns <hoyt@isus.tnet.com> reports that in 4.0.3.6 the ttymon(1)
utility for HDB uucp drops DTR every few weeks. The workaround is to disable
and re-enable it.
The SVr4.2 ttymon is even more broken; it *never* raises DTR after the
first outgoing call. Jeremy Chatfield at IF has confirmed that this is a
real bug in the USL sources and is on his urgent-fix list.
In the May 10, 1993 issue of Open Systems Today, page 70, Jason Levitt
describes some of his ttymon problems. He has a file posted on ftp.uu.net
under /published/open-systems-today/other/svr42uucp.tar; This tar file
contians a fixed ttymon program along with a text file describing setting
up ttymon and uucp so that it works pretty well.
4. ttymon doesn't drop DTR when it should
Stephen Hebditch <steveh@orbital.demon.co.uk> reports from a Dell
2.2 system:
"When a user logs out, ttymon does not appear to lower the DTR line for
a sufficiently long enough time to always cause the modem to drop
carrier. The WorldBlazer modem here is set to its default of 50ms DTR
detection time - the minimum time allowable - but around 2 times out of
10, when a user logs out it will not drop carrier although the DTR
light on its front panel can be seen to blink momentarily.
Disabling service for a particular device (e.g. using 'pmadm -d -p
ttymon3 -s 00') will only work if ttymon hasn't spawned a child process
for that port.
According to the manual "ttymon should exit if no one types anything in
<timeout> seconds after the prompt is sent". Occasionally when hanging
up an outgoing connection, spurious characters can trigger ttymon into
thinking that there is a new user wanting to log in. Because it has
seen these characters, ttymon will then not time-out, locking up that
port until the controlling ttymon child process is killed."
See the fix note attached to III.3.
5. (4.2) Terminating cu to a direct line locks up the port
The problem is the C2 security mechanisms. Terminating cu with ~.
doesn't tear them down correctly. Subsequently, another cu(1) will be
able to get at the port, but utilities which try to get at it directly (i.e.,
cat or stty) won't be.
Rick Richardson <rick@digibd.com> adds: "The "cu" problem where ports
can't be used by stty, seyon, or other programs once "cu" has had its way
with them: This problem apparently affects any program (cu, uucp) that uses
the DIAL(3) routines. Those routines have been modified to use the "cs"
connection server daemon to open the port and/or dial a phone number on behalf
of the client (though you'd hardly realize this from reading the manual page).
The "cs" daemon does *something*, where *something* is not known yet, which
causes all subsequent termio type ioctl's to fail. This bug has been reported
to USL and Univel, but no fix has been forthcoming."
He continues: "I had our streams device driver guy put in a version of one
of our serial port drivers with debugging turned on, and he said that it looked
like the driver "close" routine was never getting called - possibly because the
device close call only happens on the last close of a device, and the
connection server has still got the port open. This theory would seem to
indicate that "cu" and "uucp" are fine, but that the connection server is
broken. We don't really know, though -- its just a theory.
See the fix note attached to III.3.
6. Hardware flow control bug breaks streaming data transfers
Stephen Hebditch <steveh@orbital.demon.co.uk> reports from a Dell
2.2 system:
"There is a definite problem with hardware flow control. If
characters are being continually sent to the modem with no break, then
after around 40K or so the asy driver will ignore the fact that the
modem has lowered the CTS line and will keep on sending. Up to that
point it will correctly stall when the CTS line is lowered. If there
is a break in sending, then flow control will work correctly once
more. This means that streaming protocols such as Z-Modem will break
but simpler protocols like UUCP g which don't fill up the modem buffer
will work correctly."
Your editor has seen this one himself while attempting to use rz
for uploads to his friendly Internet site, as was his wont under SVr3.
I now get around this by using ymodem protocol for uploads.
This is probably a generic bug in 4.0.4 serial handling.
7. Bad interaction between ttymon and networking
Stephen Hebditch <steveh@orbital.demon.co.uk> reports from a Dell
2.2 system:
"A problem with ttymon, in.telnetd and in.rlogind. When a user logs out,
wrong entries are written to utmp and wtmp. This results in utmp and
wtmp containing a new record for that user for a session starting at
the time that they logged out. This results in some programs (finger
for example) showing that users are logged in when they are not and
means that login accounting is not possible."
See the fix note attached to III.3.
IV. Networking and File Sharing Bugs
1. NFS locking is unusably slow
Randy Terbush <randy@dsndata.dsndata.com> has posted code which
demonstrates a serious bug in the SVr4 NFS locking daemon.
In his own words: "The symptoms are ~30% cpu usage by 'lockd' and
severe slowing of the machines on the network. This program
demonstates that it takes ~20 seconds to obtain locks from an ailing
'lockd'. We have verified that this bug does not exist in HPUX 8.0x."
Randy's code is too large to be included here. He is, quite
rightly, exercised at USL's exceedingly slow response to this problem.
The comment in his makefile reads, in part:
# USL has admitted to the existance of this bug in version 4.0, 4.1,
# and 4.2 of their distributed and yet to be released sources. This is
# a network crippling problem that they have refused to fix until
# release 4.3, which will be OVER 1 YEAR from today. (29 Oct 1992)
# If your version of 'lockd' exhibits this same problem, I would
# strongly urge you to contact your vendor and ask them to put some
# pressure on USL to fix this problem. SVR4 is virtually useless in a
# network of shared resources while this problem exists.
2. UFS file system problems
In stock USL 4.0.3, you can't use a UFS file system as the root; the system
hangs if you try. Consensys, Dell, Esix, Microport, MST, UHC, and ESIX all
appear to have fixed this.
David Aitken, the UNIX product manager at UHC, writes "The ufs as root file
system [problem] was not really a bug, just a little oversight on USL's part -
we have fixed it completely by adding one line to the /stand/boot script:
rootfstype=ufs!" He adds that they've been using ufs on their lab machines for
over 10 months with no trouble, and the latest UHC release defaults to ufs if
you have more than 120MB of disk.
3. Byte-order problem with NFS when accessing Sun disks
Christoph Badura <bad@generics.ka.sub.org> notes that the stock USL resolver
library suffers from serious confusion about the byte order in the
socketaddr_in structure. This bug is acknowledged by USL for the 4.0.4
release. A symptom of this bug is that Sun disks will not mount correctly over
NFS. As a workaround, try removing the references to /usr/lib/resolv.so from
/etc/netconfig and rebooting your system. Unfortunately, this will mean
you can't use nameservers.
Alan Batie <batie@agora.rain.com> writes: "Actually, you don't have to
remove resolv.so, just put tcpip.so first and have a hosts file with the names
of hosts you want to do NFS mounts from. This way you can use nameservers for
most things."
4. Under weird circumstances, lseek on UFS may cause corruption
Christoph Badura <bad@generics.ka.sub.org> reports that a UFS lseek() to an
offset which is a multiple of 4096 but not a multiple of 8192, followed by a
write(), may corrupt the file being written. The bug shows up only, if the
file has no pages in the page pool associated with it at the seek offset and at
4k before the seek offset. He has sent USL kernel fix for this, which was
included in 4.0.4.
5. FTP problems
The in.ftpd on SVR4.0.3 does not support all the commands listed in RFC 959.
When recent SCO UNIX/ODT versions ftp to SVR4.0.3, the SVR4 side will refuse,
drop the connection, and core dump after you authenticate. This is because the
SCO end sends the 'SYST' command ala RFC 959, and the SVR4.0.3 end doesn't
recognise it. Some ports have fixed this.
Christoph Badura adds: "The bug is do to a longjmp(3) on a sigjmpbuf obtained
by sigsetjmp(3). ARGH. Testing led to a bug in the original BSD sources, which
is still present in the NET/2 ftpd. "
6. A bug in the WD80x3 support
MST reports a serious bug in the SVr4 kernel support for this card. Here's
how to reproduce it:
server: init 3 and share (export) /usr for example.
client: mount -F nfs server:/usr /mnt
cd /mnt
find . -print | cpio -ocBuv > /dev/null
what happens:
server and client will "hang" together.
"cue":
hit keys on server and/or client, hang will go away
for 10-20 seconds temporarily. Yank BNC connectors
do the same trick.
They say they've heard from customers that this happens on Dell, UHC as well
as USL 4.0.4. PCNFS/BWNFS network xcopy suffers this as well. Client can be a
Sun Sparc for that matter.
7. Security hole near fingerd
Jerry Whelan <guru@stasi.bradley.edu> reports:
We encountered a cute security hole in AT&T SVR4 2.1 (which I believe
translates to USL 4.0.2). It apparently was fixed in AT&T SVR4 3.0. The
hole related to the finger daemon. If a user set his .plan to a symbolic
link pointing to a protected file (such as /etc/shadow, or somebody's
mail file) then fingering the user would cause the finger daemon to read
that file and display it.
I don't know if the bug exists in any other vendor's versions of 4.0.2.
We replaced our fingerd with gnu finger, only to find the same problem.
I sent the changes back to the gnu finger developer, but I don't think a
newer fixed version has been officially released yet.
Steve Peltz <peltz@cerl.uiuc.edu> writes: "The fix to the fingerd problem
(pointing a .plan file to a protected file and thus getting read access to it)
can be fixed by changing inetd.conf to not give root privileges to the fingerd
process. It seems like overkill to have fingerd set to the user id of the
person you're fingering to see if you should have access to the file."
8. Fatal bug in priority-band message handling.
Douglas C. Schmidt" <schmidt@liege.ICS.UCI.EDU> reports:
There is a bug with handling priority-band messages that causes several
System V Release 4 versions (particularly Solaris 2.1) to crash. The following
code replicates the problem. Sun has been notified and claims they will fix
this problem in the next release (2.2?).
/* This program causes System V Release 4 to crash! */
#include <sys/types.h>
#include <sys/fcntl.h>
#include <stdio.h>
#include <stropts.h>
#define FIFO "/tmp/foo"
#define BIGFILE "/usr/dict/words"
static int
do_child (int fifo_fd)
{
struct strbuf msg;
char buf[BUFSIZ];
msg.maxlen = sizeof buf;
msg.buf = buf;
do
{
int flags = 0;
if (getmsg (fifo_fd, 0, &msg, &flags) != -1)
(void) printf ("(%2d) (%2d): %s",
msg.len - sizeof (int), *(int *) msg.buf, msg.buf + sizeof (int));
else
return -1;
}
while (msg.len != 0);
return 0;
}
static int
do_parent (int fifo_fd)
{
FILE *fp;
char buf[BUFSIZ];
(void) srand ((unsigned) time (0));
if ((fp = fopen (BIGFILE, "r")) == 0)
return -1;
while (fgets (buf + sizeof (int), sizeof buf, fp) != 0)
{
struct strbuf msg;
int band = rand () % 11;
msg.buf = buf;
msg.len = strlen (buf + sizeof (int)) + 1 + sizeof (int);
*(int *) buf = band;
if (putpmsg (fifo_fd, 0, &msg, band, MSG_BAND) == -1)
return -1;
}
return 0;
}
int
main (void)
{
int fd;
#if defined (TEST_FIFO)
(void) unlink (FIFO);
if (mkfifo (FIFO, 0666) == -1)
perror ("mkfifo"), exit (1);
#else
int pipe_fds[2];
if (pipe (pipe_fds) == -1)
perror ("pipe"), exit (1);
#endif
switch (fork ())
{
case -1:
perror ("fork"), exit (1);
/* NOTREACHED */
case 0:
#if defined (TEST_FIFO)
if ((fd = open (FIFO, O_RDONLY)) == -1)
return -1;
#else
fd = pipe_fds[0];
close (pipe_fds[1]);
#endif
if (do_child (fd) == -1)
perror ("do_child"), exit (1);
break;
default:
#if defined (TEST_FIFO)
if ((fd = open (FIFO, O_WRONLY)) == -1)
return -1;
#else
fd = pipe_fds[1];
close (pipe_fds[0]);
#endif
if (do_parent (fd) == -1)
perror("do_parent"), exit (1);
break;
}
return 0;
}
9. SVr4.0.4 TCP/IP routing is broken
Raymond Nijssen <raymond@woensel.es.ele.tue.nl> reports:
"I found a problem with ESIX 4.0.4 TCP/IP routing. I'm not sure if it's also
present in other SVR4 flavors. The problem is that once a system has received
an ICMP route redirect message, it is supposed to store the new route in its
routing tables. This does not work properly, which is revealed by ping(1)ing
to a host though a gateway in a more complex network configuration. For almost
every packet is sent to another gateway than the one which corresponds with the
network of the destination. This in turn leads to an enormous amount of ICMP
messages, which leads to bad network thoughput. We also had some mysterious
crashes until we decided to change the network configuration to circumvent this
problem."
(This seems very likely to be a generic SVr4 problem).
10. df(1) on NFS volumes returns bad data
Raymond Nijssen reports from Esix 4.0.3A and 4.0.4: " Diskspace figures of
NFS mounted filesystems reported by both /bin/df and /usr/ucb/df are 4 times
too big."
11. rsh hogs the processor
Raymond Nijssen <raymond@woensel.es.ele.tue.nl> reports from Esix 4.0.3A and
4.0.4: "The rsh command hogs the CPU. On an empty system, `rsh foo -n bar'
takes 1 second kernel-mode CPU per second elapsed."
12. MTU for remote networks ignored
Nathan D. Lane <nathan@seldon.foundation.tricon.com> reports: "Esix 4.0.4
ignores the MTU for remote networks. I have PPP setup on my RS/6000 and the
Esix box connects via ethernet to the RS/6000. Packets are always sent out
"full size" by the Esix machine, no matter where their destination. It is my
understanding that, when routing to a remote network where the MTU is a)
unknown or b) set to something lower than 1536, the originating machine should
make the packets smaller. Instead, when the Esix box blasts out its packets
across the PPP link, it sends them full size, making the other end do *a lot*
of packet reassembly.""
This has not been confirmed on other ports, but seems likely to be
a generic SVR4 problem.
13. Bug in remote printing.
A couple of USENETters have reported that the remote-printing support for
lpr (the System V print spooler) is broken in SVr4.0. Printing is done
correctly, but the job is not then removed from the print queue on either
system.
V. SCSI Support Problems
1. sar is confused by SCSI
Sar -d doesn't work on SCSI drives. Dell fixed this in 2.1 and it's
reported to work OK in Esix 4.0.3A; no report of any other SVr4 having fixed
this yet. SCO fixed it in 3.2.4. Appears to be fixed in USL 4.2.
2. A configuration problem
Stock USL 4.0 requires you to jumper your SCSI devices to fixed IDs
during installation (it can be changed to any other ID after). Specifically,
the tape must be ID 6.
Dell says they've fixed this. The requirement is definitely still present
in Esix and Consensys 1.3. UHC thinks they've fixed this, but their 4.0.3.6
release still seems to demand ID 1 to install.
I've seen an email report that USL 4.2 still has this problem. But after
publishing this, I got a request for more info from Mike Drangula
<miked@usl.com> at USL. He wrote:
> As far as
> I know ( and I wrote the SCSI configuration tools for 4.2 ), there is only
> one case where a device is required to be at a particular SCSI ID, unless
> you count the requirement that the HBA be at ID 7.
>
> The only requirement for a given SCSI id is that, on a SCSI-based MCA
> machine that uses IBM's SCSI Host Adapters, the boot disk must be at ID 6
> if there is more than one disk installed on the HBA.
>
> The old requirement that the tape be set to SCSI ID 6 is no longer in effect.
> If your HBA will support booting from it, there is not even a requirement
> that the boot SCSI disk be at SCSI ID 0. The only requirement for disks is
> that the boot disk must have the lowest SCSI ID of any DISKS on the system
> ( except in the already noted case of MCA SCSI )
Give Mike a hand for actually reading this bug list.
3. Synchronous SCSI hang problem
David Wexelblat <dwex@mtgzfs3.att.com> reports: "Stock SVR4.0.3 will hang
the SCSI bus with a 1542 in synchronous mode. Dell fixed this, and this has
been given to Microport [ed note: Microport 4.0.4 and Consensys 4.0.3 have
fixed the problem; MST UNIX and Esix 4.0.3 still have this problem; I have not
yet been able to determine if ESIX 4.0.4 does]. In the file /sbin/bcheckrc,
change the line:
echo MARK > /dev/rswap
to
echo MARK | dd of=/dev/rswap bs=512 conv=sync > /dev/null 2>&1
The magic is apparently the conv=sync, which forces a 512 byte block
to be written. The original echo writes 4 bytes, which apparently causes
synchronous SCSI to go out to lunch.
Now, you ask, how can I fix this, since the system won't boot? There are
a couple of methods. First, if possible, disable synchronous negotiation
(1542 jumper J5-1 removed, plus whatever you may need to do to your drive).
Then boot up, edit /sbin/bcheckrc, then shutdown, restrap for synchronous,
then reboot. Everything should be OK.
That's the easy way. Unfortunately, some hard drives will only work
in synchronous mode. Well, you can still recover from this phenomenon.
Here's how:
1) Install on your hard drive
2) Boot from the first boot floppy. When it tells you to, insert
the second boot floppy. At the first prompt, hit <DEL> to
break out to a shell.
3) Mount your hard drive under /mnt with the following command
(replace FS-TYPE with s5, s52, or ufs, whichever you used for
for your root partition):
/etc/fs/FS-TYPE/mount /dev/dsk/c0t0d0s1 /mnt
4) Now edit /mnt/sbin/bcheckrc:
ed /mnt/sbin/bcheckrc
You may want the 'ed' man page handy (I barely remember how to
to use 'ed' :->). For simplicity, you can delete/comment out
the offending line, then replace it with the correct line later.
5) Unmount the hard drive:
umount /mnt
6) Reboot from the hard drive. Everything should come up OK. and
you can finish editing /sbin/bcheckrc, if necessary.
Note that you perform these actions at your own risk. The first version was
performed by me on Microport SVR4, and the second was performed by someone
else (on my suggestion) on ESIX SVR4."
This problem appears to be fixed on Consensys 1.3 and Dell 2.1; also
(pace David's remark) in ESIX 4.0.4, which has
echo MARK | /sbin/dd.arch conv=sync > /dev/rswap 2> /dev/null
4. ps chokes on commands that do SCSI I/O
Hugh Stearns <hoyt@isus.tnet.com> reports that in 4.0.3.6, ps
doesn't work when a SCSI command in progress. It stops printing at the
process executing the scsi command.
This is still broken in Dell 2.2 and ESIX 4.0.3.
5. Transfer speed problems with Adaptec 1542B on 486s
If a system mount or install fails, try setting the DMA speed to 5MB/s,
rather than the default 5.7MB/s. This is accomplished by removing the jumper
shorting the 12th pin pair of jumper block 5.
6. df gives inaccurate values for large SCSI partitions
Derek Terveer <derek.terveer@stpaul.gov> reports "I was on a Esix 4.0.4
system recently with a >1024 cylinder (i.e., ~1.05 GB disk) and the df command
was giving wildly inaccurate values. I presume that this has something to do
with the size of the partitions, because it works just fine on a system with
smaller drives and partitions."
VI. Development Tools Problems
1. General UCB library brokenness
The BSD compatibility libraries were badly broken in USL code. A Dell
source adds "That meant that almost all the apps derived from them were broken
too. Most stuff like automount will die when you send a SIGHUP, instead of
rereading the map file. You can get a system into very strange states when
that happens."
John Sully <jms@mport> of Microport opines: "This is a bug in automount
itself rather than BSD compatibility, since the automount which comes with SVR4
is not compiled with the BSD libraries. (isn't this comforting?? :-()."
Peter Wemm <peter@DIALix.oz.au> reports "There is a very simple and reliable
sure to this sort of thing: Using your favourite hex editor, change all
instances of "signal" in the binary file to "sigset". Most BSD code assumes
that signal() auto-rearms after handling a signal. On SVR4, signal() does not,
but sigset() is argument compatible, and has BSD semantics."
Esix and UHC's BSD libraries are USL stock. I don't yet know
the status of other ports. Microport has run into things they think may be
symptoms of this but have no fix yet.
John Sully <jms@mport> of Microport counters with: "One common thread I find
on reading of these problems is that the BSD compatibility libraries are
*misused*. [...] The problem is that BSD and SYSV have similarly named .h files
which sometimes contain different definitions for objects with the same name.
This has been known to cause all sorts of problems because the SYSV headers are
picked up and then the calls are satisfied from the BSD library rather than the
shared object library. I have found that if you use /usr/ucb/cc that the BSD
compatibility is much less broken than it would seem at first because it
ensures that the correct headers are picked up."
However, note that there is at least one *real* bug known --- as of 4.0.4
the signal emulation cannot explicitly set a handler to SIG_DFL or SIG_IGN.
Developers should be very careful that if they use -L/usr/lib/ucb -lucb
the cc used is also the Berkeley cc.
2. USL emulation of BSD signals doesn't work
A different source reports that the the USL implementatation of BSD signals
is broken in both 4.0.3 and 4.0.4; in particular, the sigvec() family doesn't
work properly. It is possible to make minor tweaks to source to make such apps
work properly with the native USL signals implementation.
Here's more on the signals problem, thanks to Richard <rc@siesoft.co.uk>:
------------------------------------------------------------------------------
The problem is to do with the signal() function that is within the BSD
compatability libc.
To reproduce the problem do the following:
#include <stdio.h>
#include <sys/types.h>
#include <signal.h>
#include <sys/siginfo.h>
main()
{
signal(SIGPIPE,SIG_IGN);
pause();
}
and compile it with cc xx.c -o xx /usr/ucblib/libucb.a
(John Sully observes that this is definitely wrong; /usr/ucb/cc should have
been used rather than "cc ... -L/usr/ucblib -lucb" or the equivalent "cc ...
/usr/ucblib/libucb.a".)
If you run the program and then signal it with a SIGPIPE, the program
will die, even though you've told it to ignore SIGPIPE.
The fix is difficult unless you've got source because there's a missing 'else'
clause from the signal() code. This is the only signal fault I've found in
the BSD signal functions, details of the rumoured sigvec problem would be
useful?
If you're trying to compile an application you could change the application
code to do the following, this does work..
void
catch(s)
int s;
{
/* DO NOTHING */
;
}
main()
{
signal(SIGPIPE,catch);
pause();
}
SUMMARY
You can only change a signal handler to a function handler, any number of
times. Any attempt to set the handler to SIG_DFL, or SIG_IGN will fail.
This bug has given some people working with X11R5 aggro, causing the X server
to die when you close a client.
Christoph Badura <bad@flatlin.ka.sub.org> confirms this bug
He has sent USL a source fix. It appears already to have been fixed in Dell
2.2.
------------------------------------------------------------------------------
3. Possible string library problems
There are also persistent rumors of problems in the BSD-emulation string
libraries. I have not been able to pin down specifics on this.
4. USL's ndbm support is broken.
Christoph Badura <bad@generics.ka.sub.org> reports "The ndbm functions in
the ucb library are broken [apparently due to a compiler of optimizer bug in cc
-- ed.]. Try makeing the whatis data base for /usr/share/man with Tom
Christiansen's perl rewrite of man.
The easiest way to fix this is to compile GNU's replacement ndbm.c with gcc
-fpcc-struct-return -traditional (gcc1.40 or 2.2 will do nicely) and install it
in your C library. Source is available for FTP from prep.ai.mit.edu.
5. An include file is missing
Both 4.0.3 and 4.0.4 USL versions are missing the documented dial.h
file from their /usr/include directory. Dell 2.[12] has it.
6. sscanf(3) has a potential bug
Anthony Shipman <als@bohra.cpg.oz.au> reports: " I found the following bug
in SCO Unix 3.2.* and I think it may be common to many AT&T derived Unixes.
sscanf() calls _doscan() to read from a pretend file. The file
uses the string as a buffer and a fake file descriptor of 60 (=_NFILE).
Since _NFILE (for SCO UNIX) is 60 it assumes that fd 60 can never be open.
Then when fscanf() hits the end of the string it calls _filbuf() to read
into the buffer (which is the string) from fd 60. This should fail with
an errno=9 and then _filbuf() sets EOF and it all terminates.
However in SCO Unix you can reconfigure the kernel to increase the number
of files per process to a recommended maximum of 150. If you do this then
your program might have fd 60 open one day. Then sscanf() will read from this
file overwriting your string. The byte count to the read() in _filbuf()
is some undefined but large value so a lot of memory will be overwritten. In
my case the string was on the stack so my stack was wiped.
In short if you configure your kernel to have NOFILES > _NFILE ie more than
the default then sscanf() is a time bomb in your code."
This is alleged to have been fixed in SVr4, but I haven't been able to
confirm the fix. Bob Tinsmamn of SCO support writes: "We're fixing it
too, in a maintenance supplement to the Development System that will
come out at the end of this year or the beginning of 1993, known as
Development System Maintenance Supplement 4.2 or MSD 4.2."
7. shmat(2) vs. vfork(2)
The shmat(2) call is known to interact bady with vfork(2). Specifically,
if you attach a shared-memory segment, vfork(), and then the child releases
the segment, the parent loses it too! Workaround; use fork(2).
UHC and Microport both suspect that they still have this bug and opine that
anyone who uses vfork deserves to lose. Dell has no plans to fix it.
John Sully <jms@mport.com> writes: "This is not a bug. It is completely
consistent with the semantics of a change to the address space of the child.
Think about it: any change to the address space of a child process created by
vfork(2) is reflected in the parent since the child is actually executing in
the parent's address space. Therefore if the child changes the address space
(in this case by releasing the shared memory segment) what should happen?
Right, the parent should have the same change happen. And what does happen?
The segment is released in the parent. One can argue about the braindead
semantics of vfork(2) all day, but the fact remains that this is exactly what
one would expect to happen. To quote from the manual page:
[...] vfork differs from fork in
that the child borrows the parent's *memory* and thread of
control until a call to execve or an exit (either by a call
to exit or abnormally.) [ emphasis added ]
and later:
It does not work, however, to return while
running in the child's context from the procedure which
called vfork since the eventual return from vfork would then
return to a no longer existent stack frame.
Please note that the entire address space of the parent is used by
the child created by vfork(2). The manual page also points out
several other caveats involved in doing anything to the parent's
address space except successfully calling an exec family function or
_exit (note it specifically says *not* to call exit(2)). I do not believe
that having a shared memory segment disappear from the parent's address
space is out of line after reading the man page for vfork(2).
It is interesting to note that Sun after implementing its new VM system in
SunOS 4.0 initially had no plans to support vfork, since they felt that the COW
semantics of the new fork would provide the necessary efficiency gain. Indeed
they found that most programs which used vfork worked just fine by doing
-Dvfork=fork. All that is, except for a certain popular command interpreter
[ed: can you say C shell?]. So we are stuck with the legacy of this braindead
system call.
BTW, Microport has no plans to fix this :-)."
8. FIONREAD fails on regular files
Christoph Badura <bad@generics.ka.sub.org> reports that the FIONREAD ioctl()
fails on regular (disk) files. He has sent USL a one-line kernel fix.
12. fread(3) does the wrong thing on pipes and FIFOs
Ed Hall <edhall@rand.org> writes: "Unlike the raw read() system call,
fread() is supposed to be able to make several partial reads to satisfy the
data requested by its arguments. The exceptions are an EOF or an error on the
stream. This characteristic is quite useful when moving data through pipes or
over network connections, since partial reads are quite common in these cases.
Well, the version of fread() in ESIX 4.0.3 (and likely other Sys5R4's) only
does a single physical read, and if it only satifies part of the requested
number of bytes, that's all you get. This can sting you even if you carefully
check the value returned by fread(), since the value returned is rounded down
to the number of complete "nitems" read, although your position in the stream
can be up to size-1 bytes beyond that point. Neither ferror() nor feof()
indicate anything is wrong when this happens."
This bug (which is also present in 4.0.4) is serious and nasty and should
be high on every porting house's list to fix. It appears to be peculiar to
USL 4.0.3 and 4.0.4; 4.0.2 does *not* have it, nor does SCO.
A USL source claims it has been fixed in 4.1.
10. putw appears to be broken
There is a bug in the ESIX SVR4.0.3A putw() routine in the C shared
library which is probably USL's. The following program demonstrates
it:
/* compile with: cc -o file file.c */
#include <stdio.h>
main()
{
int i;
for (i=0; i<1022; ++i) {
putchar('1');
}
putw(-11, stdout);
for (i=0; i<1022; ++i) {
putchar('1');
}
}
The putw() routine does not output 4 bytes, as it should. It may be
there is some interaction with buffer flushing that is causing the
problem. Also, note that if you change the sign of the first argument
to putw(), the program works fine.
11. Compiler problems
Ronald Guilmette <rfg@ncd.com> also reports the following:
------------------------------------------------------------------------------
/* Here is a bug in the original SVR4 C compiler (aka C Issue 5) which
effectively prevents you from making good use of the `const' and
`volatile' qualifiers defined by ANSI C in conjunction with pointer
types and typedef statements. Compile this code and you will get:
"qualifiers.c", line 23: left operand must be modifiable lvalue: op "="
...if your copy of the svr4 C compiler still has the bug. Note that
given these declarations, the ANSI C standard say that the thing pointed
to by the variable `pci' should be considered to be constant... not the
variable `pci' itself. (The GCC compiler, either version 1.x or version
2.x, correctly compiles this example without complaint.)
*/
typedef const int *ptr_to_const_int;
ptr_to_const_int pci;
int i;
void main ()
{
pci = &i;
}
------------------------------------------------------------------------------
/* Here is a subtle bug in the original SVR4 C compiler (aka C Issue 5)
which prevents you from first declaring a tagged type (i.e. a struct
type or a union type) in a parameter list, and then defining that tagged
type later on within the same scope. (Note that according to the ANSI C
standard, the scope in which parameters get declared and the outermost
block of a function body are one and the same scope. Thus, this really
is legal ANSI C code!)
Try compiling this with your C compiler on SVR4. If your compiler still
has the bug, you will get:
"tagged_type.c", line 24: warning: dubious tag declaration: struct S
"tagged_type.c", line 28: warning: improper member use: i
"tagged_type.c", line 28: warning: improper member use: i
"tagged_type.c", line 31: warning: dubious tag declaration: struct S
"tagged_type.c", line 35: warning: improper member use: i
"tagged_type.c", line 35: warning: improper member use: i
(The GCC compiler also had this bug in version 1.x, but it has been fixed
in version 2.x.)
*/
void foobar1 (arg) /* use old-style without prototypes */
struct S *arg;
{
struct S { int i; }; /* define the type `struct S' */
arg->i = arg->i; /* legal according to ANSI C rules! */
}
void foobar2 (struct S *arg) /* use new-style with prototypes */
{
struct S { int i; }; /* define the type `struct S' */
arg->i = arg->i; /* legal according to ANSI C rules! */
}
------------------------------------------------------------------------------
/* Here is a serious bug in the original SVR4 `dump' program which dumps
out parts of object files in either plain hex form or symbolically.
To see the `dump' program get a segfault and die, save this code under
the name `dump-bug.c' and then do:
cc -g -c dump-bug.c
dump -v -D dump-bug.o
The bug arises whenever `dump' tries to read Dwarf debugging information
for an array of pointers to any "user defined" type (e.g. `struct S' in
this example). Past that point, `dump' is totally confused, so further
Dwarf debugging information finally causes it to go belly-up.
*/
struct S { int i; };
struct S *array[10];
int j;
------------------------------------------------------------------------------
It appears that the svr4 C compiler (for x86 machines) doesn't conform real
well to either the letter or the spirit of the IEEE 754 floating-point
standard. In particular, "unordered comparisons" and other operations on
NaNs don't always produce the result that that the IEEE 754 standard calls
for.
An AT&T source comments: "This is documented in the SVID as a future direction.
We do not support NaNs in -Xa and -Xt modes, only in -Xc. Try
isnan(sqrt(-1.0)) to determine which modes support it."
------------------------------------------------------------------------------
The compiler fails to issue diagnostics in cases where a typedef name is
reused to declare a formal parameter, as in:
-----------------------------------------------------------------------
typedef int FOO;
void bar (FOO)
int FOO;
{
}
-----------------------------------------------------------------------
The compiler crashes on the following invalid input:
-----------------------------------------------------------------------
int i;
volatile void *pvv;
void pvv_test ()
{
(i ? *pvv : *pvv); /* ERROR */
}
-----------------------------------------------------------------------
The compiler fails to issue diagnostics for cases where an attempt is
made to "forward declare" an enum type (without also defining it), as
in:
-----------------------------------------------------------------------
enum enum0 *ep; /* ERROR */
-----------------------------------------------------------------------
The compiler rejects the following code with an error, although there
seems to be no good reason why it should (because no object is being
declared).
-----------------------------------------------------------------------
#include <limits.h>
typedef char array_type[ULONG_MAX];
-----------------------------------------------------------------------
12. getlogin() doesn't work
Robert Withrow <witr@rwwa.com> reports "The posix function
getlogin() doesn't work on most svr4s (at least up to SVR4.0.3.0...
cuserid() *does* work, but it makes porting a pain. Try it some time
and perhaps add it to your list."
Raymond Nijssen <raymond@woensel.es.ele.tue.nl> confirms this and
adds that this bug (due to utmp and wtmp file corruptions [possibly
caused by ttymon bugs described above --- ed.]) breaks executables such
as talk(1).
13. syslog routines don't work
Raymond Nijssen <raymond@woensel.es.ele.tue.nl> reports: "Under ESIX 4.0.3,
syslog routines are unusable. They are slightly better under 4.0.4, but still
severely broken."
"In addition, replacing the syslogd executable that comes with Esix with the
one provided by Marc Boucher (marc@cam.org) shows that the syslog() call itself
is sane. It's available from ftp.cam.org."
14. Bogus `r' in xt driver configuration flags
Raymond Nijssen <raymond@woensel.es.ele.tue.nl> reports: "Both under ESIX
4.0.3 and 4.0.4, the `r' flag is present in the third column of
/etc/conf/cf.d/mdevice for the [n][s]xt drivers, suggesting that these drivers
would be required for relinking the kernel. This is not the case. I saw at
least one release of Dell SVR4 in which this was ok." (Making this change
reduces the kernel's size somewhat.)
15. ioctl for kernel symbol fetches fails
Trying to obtain kernel values of certain symbols fails. The
two symbols from the kernel that are quite useful are "avenrun" and
"total" which as far as I can tell are defined in the "mm" driver.
This bug manifests itself in applications like "top", "u386mon" ...
One used to use the nlist() function call, but according to the man page
for nlist() it should not be used due to the dynamic loading and unloading
of drivers that can happen at any time in the "life" of a V.4.2 kernel.
Try the sample hack below to see if your system has the same problem.
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/ksym.h>
main()
{
int fd=0;
long ar[3];
struct mioc_rksym k;
fd = open("/dev/kmem", O_RDONLY);
k.mirk_buflen = sizeof(ar);
k.mirk_buf = (void *)&ar;
k.mirk_symname = "avenrun";
if((ioctl(fd, MIOC_READKSYM, &k))==-1) {
perror("ioctl");
exit(1);
}
printf("%d %d %d\n",ar[0],ar[1],ar[2]);
close(fd);
}
Thanks to David P. Cutter <dpc@shady.grail.com> for reporting this.
16. Bug in cc optimizer (4.2.1)
Nickolay Saukh <nms@ussr.eu.net> reports a bug in
cc, the Optimizing C Compilation System (CCS) 2.0 07/24/92
If you have global (external) structure/union with name 'tr'
commands to access very first member (with zero offset) are
garbled. Simple text to reproduce the bug
struct _tr {
int aa;
int bb;
} tr;
void
foo(int zz) {
tr.aa = zz;
}
Here is the result of cc -O -S foo.c
.file "ccbug.c"
.version "01.01"
.type foo,@function
.text
.globl foo
.align 4
.nopsets "cc"
.align 16
foo:
movl 4(%esp),%eax
movl %eax,&r
^------------- <<<< THE BUG
ret
.align 16,7,4
.size foo,.-foo
.ident "acomp: (CCS) 2.0 07/24/92 "
.data
.comm tr,8,4
.text
.ident "optim: (CCS) 2.0 07/24/92 "
** 17. /usr/ucb/install uses missing group "staff"
/usr/ucb/install uses the group name "staff" as the default group to install
programs. As this group does not exist in /etc/group, the installation will
fail. I would suggest changing the /etc/group file like in Solaris as follows:
nuucp::9:root,nuucp
staff::10:
VII. The FUBYTE Problem
(Thanks to Christoph Badura <bad@flatlin.ka.sub.org> for this info)
The kernel function fubyte() is documented to return a positive value when
given a valid user space address and -1 otherwise. In the latter case u.u_error
is set to EFAULT. USL SysV R4.0.3 has a sign extension bug in the
implementation of fubyte() for local file descriptors (i.e. not opened via
RFS), which causes fubyte() to return negative values if the byte fetched has
its high bit set. This bug doesn't affect STREAMS drivers, as they don't call
(and in fact are normally unable to call) fubyte(). Thus writing a byte with
the high bit set to certain character device drivers returns with -1 and errno
set to EFAULT.
The bug may affect any character device driver that calls fubyte(). It's not
limited to serial card drivers. The bug is noticed most often with serial card
drivers, since uucp uses byte values > 127 very early during g-protocol setup
and drivers for serial cards tend to use fubyte() quite often.
Note also that the bug's effect is different if the driver checks for a -1
return value of fubyte() or just a negative one. In the former case it is
possible to pass bytes with the 8 bit set through fubyte(), except for 0xff
which is -1 in two's complement. That makes the bug more obscure.
The fix is easy. First, make a backup copy of the kernel object file
/etc/conf/pack.d/kernel/vm.o! A disassembly of vm.o(lfubyte) should reveal
*exactly* one mov[s]bl (move byte to long w/sign extend). That one needs to be
patched into a movzbl (zero extend). The difference is one bit in the second
byte of the opcode.
The movsbl has the bit pattern 00001111 1011111w mod/rm-byte.
The movzbl has the bit pattern 00001111 1011011w mod/rm-byte.
The 'w' bit is 0 for the instruction in question. So the opcodes are 0f be and
0f b6. Here is the diff -c from dis -F lfubyte showing the patch applied to
the Dell 2.1 kernel:
*** vm.o Mon Mar 9 00:31:38 1992
--- vm.o.org Mon Mar 9 00:32:40 1992
***************
*** 22,28 ****
11c90: 85 c0 testl %eax,%eax
11c92: 75 09 jne 0x9 <11c9d>
11c94: 8b 45 08 movl 8(%ebp),%eax
! 11c97: 0f b6 00 movzbl (%eax),%eax
11c9a: 89 45 fc movl %eax,-4(%ebp)
11c9d: c7 05 d8 13 00 00 00 00 00 00 movl $0x0,0x13d8
11ca7: 83 3d dc 13 00 00 00 cmpl $0x0,0x13dc
--- 22,28 ----
11c90: 85 c0 testl %eax,%eax
11c92: 75 09 jne 0x9 <11c9d>
11c94: 8b 45 08 movl 8(%ebp),%eax
! 11c97: 0f be 00 movsbl (%eax),%eax
11c9a: 89 45 fc movl %eax,-4(%ebp)
11c9d: c7 05 d8 13 00 00 00 00 00 00 movl $0x0,0x13d8
11ca7: 83 3d dc 13 00 00 00 cmpl $0x0,0x13dc
Of course there is a workaround at the driver level. Canonically, one would do
this by checking for fubyte() returning -1 *and* u.u_error being set to EFAULT
(u.u_error is cleared upon entering a system call). However, in R4.0.3
fubyte() does NOT set u.u_error. It *does* set u.u_fault_catch.fc_errno.
Cristoph reports that Dell 2.1 can be object-patched successfully to fix this.
I'm told that the offending 11c97 is at exactly the same address in the
Consensys 1.3 kernel.
At vm.o:fa7d in Dell 2.2 there's a movzbl (%edx),%edx; same instruction,
different target register. Here's the relevant diff output:
*** vm.o-old Wed Jul 7 03:13:11 1993
--- vm.o Wed Jul 7 03:13:00 1993
***************
*** 25,31 ****
fa76: 85 c0 testl %eax,%eax
fa78: 75 09 jne 0x9 <fa83>
fa7a: 8b 55 08 movl 8(%ebp),%edx
! fa7d: 0f b6 12 movzbl (%edx),%edx
fa80: 89 55 fc movl %edx,-4(%ebp)
fa83: c7 05 d8 13 00 00 00 00 00 00 movl $0x0,0x13d8
fa8d: 83 3d dc 13 00 00 00 cmpl $0x0,0x13dc
--- 25,31 ----
fa76: 85 c0 testl %eax,%eax
fa78: 75 09 jne 0x9 <fa83>
fa7a: 8b 55 08 movl 8(%ebp),%edx
! fa7d: 0f be 12 movsbl (%edx),%edx
fa80: 89 55 fc movl %edx,-4(%ebp)
fa83: c7 05 d8 13 00 00 00 00 00 00 movl $0x0,0x13d8
fa8d: 83 3d dc 13 00 00 00 cmpl $0x0,0x13dc
Applying this patch produces a working kernel.
I do not know the status of the other ports.
Another poster (Marc Boucher <marc@cam.org>) adds:
On ESIX SVR4.0.3 Rev. A, the instruction movsbl in question can be changed to
movzbl (as described above) with a binary-editor on file
/etc/conf/pack.d/kernel/vm.o. At offset 0x11eb0, change 0xbe to 0xb6.
Before patching, verify that your /etc/conf/pack.d/kernel/vm.o is the same as
mine! On my system, the /bin/sum generated checksum of vm.o was "4440 222".
The problem results from a sign-extension bug. The function lfubyte(), which
is called by fubyte(), is declared as
int lfubyte(char *addr); /* actually caddr_t */
The byte is fetched with
val = *addr;
which triggers sign extension. Casting addr to a unsigned char * or declaring
it as such solves the problem.
This bug is still present in stock USL 4.0.4. However, it has been fixed in
Dell 2.2.
Raymond Nijssen contributes the following:
---- README --------------------------------------------------------------->8--
This shell script was written to help out people who are less experienced in
patching kernel binaries.
This version can be used to fix the fubyte bug in follwing SVR4 flavors:
ESIX 4.0.3A
ESIX 4.0.4
Dell 2.1
Consensys 1.3
You need sdb and your system has to be able to rebuild the kernel.
After the patch is applied, you have to rebuild the kernel by running
/etc/conf/bin/idbuild and /etc/conf/bin/idreboot for the patch to take effect.
You have to be root to do all this.
The program will ask for your confirmation before it changes anything.
Please do make a backup first, and remember that you can select the old kernel
(/stand/unix.old) at boot time by pressing the space bar at the 'Booting the
ESIX system....' prompt, in case the system fails to boot from the patched
kernel, though this is higly unlikely.
Systems to which this patch was applied have been running flawlessly
for several months, in case you have doubts...
Happy patching!
--------------------------------------------------------------------------->8--
----- fbfix --------------------------------------------------------------->8--
#!/bin/sh
#
# Copyright (c) 1993 Raymond X.T. Nijssen (raymond@woensel.es.ele.tue.nl)
# All Rights Reserved
#
# the bug...
#
b=fubyte
# offsets according to flakey USL sdb. gdb and dis say something different
esix403_o=0x11eb0
esix404_o=0x11683
dell21_o=0x11c98 #dell 2.1
cons13_o=$dell21_o #consensys 1.3
# data
v=0x458900be #old
r=0x458900b6 #new
# file
f=/etc/conf/pack.d/kernel/vm.o
# progs
s=/usr/ccs/bin/sdb
i=/etc/conf/bin/idbuild
c='\c';t='\t';n='\n';N=/dev/null
# aux
pe() if [ -n "$e" ];then echo ${n}ERROR: $e $n;e="";fi
yn() { while :;do echo $n$1 [$2] $c;read a;if [ -z "$a" ];then a=$2;fi
case "$a" in y*)return 0;;n*)return 1;;*)echo Answer 'y' or 'n';;esac;done;}
cr() if id|grep "^uid=0">$N;then return 0
else e="Only root may patch the kernel";return 1;fi
ab() { echo ${n}FATAL: $e$n;exit 1;}
ac() { pe;yn "Continue ?" "y";return;}
qu() { R="";if [ -n "$1" ];then d="[$1] :";else d=":";fi
while [ -z "$R" ];do echo ${n}Enter the $2 $d $c;read a
if [ "$a" ];then R=$a;elif [ -n "$1" ];then R=$1;
else e="No $2 entered";ac||exit 0;fi;done;}
# main
if [ ! -t 0 ];then e="This program must not be piped into a shell";ab;fi
if [ ! -f $s ];then e="$s not found";ab;fi
if [ ! -f $f ];then e="$f not found";ab;fi
if [ ! -f $i ];then e="$i not found";ab;fi
echo $n$n${t}YOU are responsible for running this program.$n$n${t}Clauses 9 and 10 of the GNU GENERAL PUBLIC LICENSE$n${t}apply to this program.$n$n${t}If you continue, you thereby agree that its author, $n${t}nor his employer, nor anybody else except yourself, has any $n${t}liablity for any loss, damage etc. etc.$n
ac||exit 1
echo $n$n${t}Fixable versions with the $b bug$n$n$t$t[1]$t ESIX 4.0.3A$n$t$t[2]$t ESIX 4.0.4$n$t$t[3]$t DELL 2.1$n$t$t[4]$t Consensys 1.3$n
R=1;qu "$R" "SVR4 flavor this system is running"
case $R in 1)o=$esix403_o;; 2)o=$esix404_o;;3)o=$dell21_o;; 4)o=$cons13_o;;
*)e="Invalid answer";ab;;esac
echo $n${t}Looking for replacement target ... $c
if echo $o:?lx|$s -e $f 2>$N|grep $o/$v>$N;then echo found
if yn "Do you want to patch the kernel now?" "n";then
cr||ab
qu "$f.orig" "name of backup file"
if [ -f $R ];then e="File $R already exists";ab;fi
if cp $f $R;then echo $n${t}Copied $f to $R;else e="Failed to write $R";ab;fi
if echo $o!$r|$s -e -w $f>$N 2>&1;then
echo ${n}Fixed $b bug, you may now run $i and reboot$n;else e="$s failed";pe
if cp $R $f;then echo $n${t}Copied $R to $f;else e="Restore $f failed";pe;fi
e="Patch failed!!";ab;fi
fi
else echo not found;e="Replacement target not found at expected offset";ab;fi
--------------------------------------------------------------------------->8--
VIII. Destiny and Dell
A source at at UNIX System Labs Europe claims that `Destiny' (the new Release
4.2) incorporates all of Dell UNIX's fixes to 4.0.3; thus, any bug for which a
Dell fix is indicated above should be gone in Destiny.
--
Send your feedback to: Eric Raymond = esr@snark.thyrsus.com