This document was originally put together to show how Unix debugging tools can be used to track down problems even when you don't have the source code (ie commercial products). It may one day become a presentation if I can figure out how to present it without boring an audience senseless.
At work we use IBM Tivoli Remote Control. It allows a Controller to take remote control of a Target (both typically Windows boxes). It is functionally similar to products such as pcANYWHERE, Carbon Copy or VNC. It also supports functions such as chat and file transfer.
In order to support environments where there is no route from Controller to Target, Remote Control has an optional component known as a Remote Control Gateway. The RC Gateway functions as a TCP-level relay between Controller and Target. After connection establishment, the RC Gateway just passes the data straight through.
We had been having problems with our Remote Control. Ocasionally, new connections would fail, and when they did, existing connections would freeze. Other Tivoli operations would slow down noticably.
A some time ago, in another customer environment, we had had similar problems. When Tivoli Support was unable to resolve this intermittent but crippling problem, we reconfigured our environment to eliminate the need for the RC Gateway. Now, with the problem unresolved, our network archicture and those of our customers required the use of the gateway. So I waded in.
top
The first stop was top. Top has been around for half of forever. Running it showed the following.
last pid: 20676; load averages: 1.02, 1.05, 1.07 20:59:46 69 processes: 67 sleeping, 1 running, 1 on cpu CPU states: 0.0% idle, 98.3% user, 1.6% kernel, 0.0% iowait, 0.0% swap Memory: 512M real, 413M swap in use, 934M swap free PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 6160 root 1 59 0 6016K 2592K run 9:06 98.06% pcgateway 14906 nobody 1 48 0 19M 5184K sleep 174:02 0.95% Collector_prog1 20673 dtucker 1 58 0 2592K 1792K cpu 0:03 0.66% top 373 root 1 58 0 76M 13M sleep 21.4H 0.55% oserv 29562 root 1 58 0 2600K 1264K sleep 3:53 0.30% sshd
pcgateway is the aforementioned RC Gateway. Now all it does is pass a bit of traffic from Controllers to Targets and back. Only one such process is ever present on a given Tivoli node, so it must simulaneosly handle all concurrent sessions.
OK, so it looks like our problem is in the RC Gateway. But what the hell is it doing?
truss
On most Unix (and similar platforms) it's possible to catch a process' system calls. On Solaris, the tool to do this is truss. Trussing our misbehaving pcgateway showed:
$ truss -p 6160 poll(0x00075B50, 1, 0) = 0 poll(0x00075B50, 1, 0) = 0 poll(0x00075B50, 1, 0) = 0 poll(0x00075B50, 1, 0) = 0 [ad infinitum]
So we now know that our pcgateway is caught in a loop poll()ing. poll() is a system call that tells the proccess which file descriptors are ready for reading and writing.
We'll need to know which file descriptors are involved. Fortunately, Vic Abel wrote a tool called lsof.
lsof
Lsof ("list open files") can show, among other things, all open files and network file descriptors. It's a handy tool to have around.
$ lsof -p 6160 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME [snip] pcgateway 6160 root 5u inet 0x30002782238 0t0 TCP *:2501 (LISTEN) pcgateway 6160 root 6u inet 0x300021e1110 0t0 TCP rcgateway:*->203.2.41.147:* (IDLE) pcgateway 6160 root 7u inet 0x300027839f8 0t0 TCP rcgateway:47607->10.2.15.17:2501 (ESTABLISHED) pcgateway 6160 root 8u inet 0x30002782198 0t0 TCP rcgateway:*->203.2.41.147:* (IDLE) pcgateway 6160 root 9u inet 0x30002783318 0t0 TCP rcgateway:*->10.2.15.17:* (IDLE)
At this point, we've gained as much info as we can and the service is still down. We force a core dump with kill -ABRT for post-mortem analysis and restarted the service.
gdb
The poll man page shows that the arguments are an array of structures (of type "pollfd"), the number of structures in the array (1 in this case) and the timeout. The number of structures and timeout are both integers, but the array is effectively a pointer, so we can't see directly what's in it. A struct pollfd looks like (from the /usr/include/sys/poll.h header file):
typedef struct pollfd { int fd; /* file desc to poll */ short events; /* events of interest on fd */ short revents; /* events that occurred on fd */ } pollfd_t;
On a SPARC in 32 bit mode, the file descriptor number is an integer (the first four bytes), the next two bytes are flags for the events we care about and the last two bytes are the returned events.
Now we know what we're looking for (from the man page and header files), and where to look for it (from the lsof output), we can feed the core dump into a debugger to peek inside the process as it was when it was killed:
$ gdb -q $BINDIR/TME/PCREMOTE/pcgateway pcgateway.core [snip] #0 0xfed15d0c in _poll () from /usr/lib/libc.so.1 (gdb) backtrace #0 0xfed15d0c in _poll () from /usr/lib/libc.so.1 #1 0xfecce6dc in select () from /usr/lib/libc.so.1 #2 0x162e8 in start_gateway () #3 0xfef70128 in run_mtc () from /opt/Tivoli/lib/solaris2/libtmf.so #4 0x19644 in cthread_body () (gdb) print *(0x75b50) $1 = 9 (gdb) x/xh 0x75b50+4 0x75b54: 0x0004 (gdb) x/xh 0x75b50+6 0x75b56: 0x0000 (gdb) quit $ grep 0x0004 /usr/include/sys/poll.h #define POLLOUT 0x0004 /* fd is writeable (won't block) */
The "backtrace" shows the sequence of function calls leading up to the current location. The gateway application is actually using select() function call, which is implemented using poll() by the standard C library (libc.so) on Solaris. Since the stack trace shows it was killed in a function called start_gateway, we might infer that it was trying to start a connection.
The "print" shows us the file descriptor and the first "x/xh" shows us a hex dump of the next half-word (ie short integer). A quick look in the header file shows that the flag is POLLOUT.
Referring to our lsof output, file descriptor 9 is a TCP socket connecting to one of the RC targets.
The other interesting thing that we can see is that the socket is in the IDLE state and that it is not bound to any port (compare with fd#7).
So the Tivoli RC gateway is trying to write (continuously) to a socket that's no longer connected and that's failing. Why is it trying to write to an unconnected socket? We don't know yet. It might be a race condition where it's sensitive to the timing of the connections coming down, or it might be due to a connection being torn down from the target side.
At this point, we have a few ideas about what might trigger the problem. What we really need is the syscall trace from just before the problem. We ended up running truss in a screen and logging all of its output to a file. We cleared the log (which generated about 100MB of data per day) every couple of days.
After a couple of weeks, we caught it in the act!
accept(5, 0x000572A0, 0x00057288, 1) = 6 Accepts connection from RC controller, tcp connection is fd#6 getitimer(ITIMER_REAL, 0x00057100) = 0 recv(6, " T C P : 1 7 2 . 2 0 . 1".., 128, 0) = 21 # Receives IP address of target host to connect to from controller so_socket(2, 2, 6, "", 1) = 7 # Allocates a new socket for outgoing connection (fd#7) setsockopt(7, 65535, 4, 0x0004CB38, 4, 1) = 0 ioctl(7, FIONBIO, 0x0004CB34) = 0 # Sets fd#7 for non-blocking IO connect(7, 0x0004CCD0, 16, 1) Err#150 EINPROGRESS # Initiates TCP connection to RC target poll(0x0004CA68, 1, 0) = 0 getitimer(ITIMER_REAL, 0x0004C9B0) = 0 poll(0x00035C90, 1, 988) = 0 Received signal #14, SIGALRM, in poll() [caught] poll(0x00035C90, 1, 988) Err#4 EINTR setcontext(0x0003D798) poll(0x00035C90, 1, 0) = 0 setitimer(ITIMER_REAL, 0x00035AF8, 0x00000000) = 0 poll(0x0005DC38, 1, 0) = 1 recv(9, "\b\0\012\001 '\0", 8192, 0) = 8 # Receives data on unrelated connection poll(0x0005DC40, 1, 0) = 1 send(8, "\b\0\012\001 '\0", 8, 0) = 8 # Forwards data poll(0x0005DC40, 1, 0) = 0 getitimer(ITIMER_REAL, 0x0005DB90) = 0 setitimer(ITIMER_REAL, 0x0005DB90, 0x00000000) = 0 poll(0x00035C90, 1, 1) = 0 [snip] Received signal #14, SIGALRM, in poll() [caught] setcontext(0x0003D798) poll(0x00035C90, 1, 0) = 0 setitimer(ITIMER_REAL, 0x00035AF8, 0x00000000) = 0 poll(0x0004CA68, 1, 0) = 1 # Polls (fd unknown, probably 7) and fd is ready send(6, " G A T E W A Y _ C O N N".., 128, 0) = 128 # Sends "GATEWAY_CONNECTED" message to controller on fd#6 getitimer(ITIMER_REAL, 0x0004CBE0) = 0 setitimer(ITIMER_REAL, 0x0004CBE0, 0x00000000) = 0 poll(0x00035C90, 1, 1) = 0 Received signal #14, SIGALRM, in poll() [caught] poll(0x00035C90, 1, 1) Err#4 EINTR setcontext(0x0003D798) poll(0x00035C90, 1, 0) = 0 setitimer(ITIMER_REAL, 0x00035AF8, 0x00000000) = 0 poll(0x0004CC98, 1, 0) = 1 recv(7, 0x0004CE88, 8192, 0) Err#146 ECONNREFUSED # Tries to read from fd#7 to target # but connection is refused (by target)! poll(0x0004CCA0, 1, 0) = 1 recv(6, "\v\0", 8192, 0) = 2 # Receives data from controller on fd#6 poll(0x0004CC98, 1, 0) = 0 Received signal #14, SIGALRM [caught] setcontext(0x0004C980) poll(0x0004CC98, 1, 0) = 0 # Gets caught in poll loop, polling fd#7 to write.
It seems that there are two potential causes of the problem:
a) The data from the unrelated connection while the new connection is pending causes pcgateway to think that the connection has completed when it hasn't, or b) The pcgateway doesn't check for a successful connection and tries to use the socket anyway. Or maybe a combination of both.
We can't tell much more without access to the source code or a debug build of pcgateway, so we send of the data and analysis to Support.
After some to-and-froing with the Support and Devel teams, we had the cause confirmed. The pcgateway process did not correctly check for errors writing to the target. Because it used non-blocking IO, they didn't check for connection failure after the initial attempt, (which nearly always return EINPROGRESS). Later, when it actually went to use the connection, it assumed it was up. This was true most of the time, but not always.
Eventually, the APAR was released with Remote Control 3.7 patch 11:
IY20447: When trying to establish a Tivoli Remote Control session using the Remote Control Gateway, the "pcgateway(.exe)" process may cause a CPU usage of 100%.
The problem turned out to be a bug in pcgateway, triggered by a race condition. We have fast access from the controller to the RC Gateway but slow access (via a WAN link) to the targets.
The Tivoli Framework invokes both controller and target at the same time. The problem occurs when the controller is able to start and connect to the target via the gateway before the target is ready to accept the connection. When the target refused the connection the gateway didn't handle it and was sent into a tailspin.
This shows the problems you'll get when trying to execute a synchronous operation (Remote Control) over a fundamentally asynchronous infrastructure (Tivoli Management Framework).
Collect information as you go. I find it useful to make a directory for each problem I'm working on and subdirectories for each day I work on it. I then copy all data gathered that day (eg process logs, core dumps and keystroke logs)
The keystroke logs were particularly useful. The Unix command
script
will create one for you. I started a script before
each debugging session and saved its output in that day's directory.
If, in the future, I wanted to compare a current instance of the
problem with output from a previous one, it was easy. It also made
reconstructing this story possible.
Top provides basic system performance information. It's been around for a long time.
truss is the native tool that provides system call tracing on Solaris. Most SysV Unixes have it, as does AIX 5. AIX 4.3 has a system-wide tool called trcrpt and there's a script to emulate truss using it. strace provides the same facility on Linux, SysV, Solaris and Irix.
HP-UX has trace (HP-UX 10.20 and below) or tusc (HP-UX 11.00 and above).Lsof allows you to view a process' open file descriptors, current working directory and file maps. There's also a companion lslk for viewing locks.
gdb is the GNU debugger. It's a source-level debugger that runs on multiple platforms.
screen lets you run an interactive program in the background and attach and detach it at will.
Page copyright 2002 Darren Tucker. Unrestricted use and
reproduction permitted as long as this copyright notice is
retained.
Last modified: $Date: 2022-05-25 $