Friday 1 June 2012

libssh2 and exit signals

Imagine this scenario:

1. You run ssh2_exec to invoke a command on a remote host.
2. The command takes some time to run (even if it's just a loop, with ls and sleep).
3. You open a terminal session to the host, find the process id, and kill it.
4. ssh2_exec terminates.

Did you get an exit signal?

If not, here's the first obvious tip of the day: Run the same command on another host, especially one that you know has a different OS (e.g., Linux vs. Solaris vs. HP-UX).

Second obvious tip: libssh2_trace() is your friend. Be sure to build a debug version of libssh2. In my case, this meant that win32/GNUmakefile had this:

# must be equal to DEBUG or NDEBUG
ifndef DB
    # DB    = NDEBUG
    DB    = DEBUG
endif

Back to the missing exit signal. ssh2_exec.c has this check:

    if (exitsignal)
        printf("\nGot signal: %s\n", exitsignal);
    else
        printf("\nEXIT: %d bytecount: %d\n", exitcode, bytecount);
 
       
On my first run, I got this:

Got signal:

which was not exactly as expected:

The remote command may also terminate violently due to a signal. Such a condition can be indicated by the following message.  A zero 'exit_status' usually means that the command terminated successfully.

      byte      SSH_MSG_CHANNEL_REQUEST
      uint32    recipient channel
      string    "exit-signal"
      boolean   FALSE
      string    signal name (without the "SIG" prefix)
      boolean   core dumped
      string    error message in ISO-10646 UTF-8 encoding
      string    language tag [RFC3066]
     
 
So, first thing to do was set trace on. The example has this:

#if 0
    libssh2_trace(session, ~0 );
#endif

so, I just changed it to "#if 1".

And this is what I was getting from the server:

0000: 62 00 00 00 00 00 00 00  0B 65 78 69 74 2D 73 69 : b........exit-si
0010: 67 6E 61 6C 00 00 00 00  0F 00 00 00 00 00 00 00 : gnal............
0020: 00 00                                            : ..

[libssh2] 16.161761 Conn: Channel 0 received request type exit-signal (wr 0)
[libssh2] 16.161761 Conn: Exit signal  received for channel 0/0

It became clear that the remote host wasn't sending the signal name. 0x62 is SSH_MSG_CHANNEL_REQUEST. The next 4 bytes are the recipient channel. The following 4 bytes are the length of the string "exit-signal", i.e, 0x0B. Then, we have a boolean, which the RFC says it's a byte, 00, which represents "FALSE". And another string, so we need 4 bytes for its size, which gives us 0x0F. And then... just the proverbial crickets chirping away, no sign of our string with the signal name.

Having a somewhat fertile imagination, I have a tendency to imagine all sorts of extraordinary causes for these problems, sometimes involving tiny green leprechauns misplacing bits and bytes and whatnot.

The truth, such as it is, is usually achieved by simplicity. So, I've decided to follow my first obvious tip of the day, above. I ran the same command on another server. And got:

0000: 62 00 00 00 00 00 00 00  0B 65 78 69 74 2D 73 69 : b........exit-si
0010: 67 6E 61 6C 00 00 00 00  04 54 45 52 4D 00 00 00 : gnal.....TERM...
0020: 00 00 00 00 00 00                                : ......

[libssh2] 12.655970 Conn: Channel 0 received request type exit-signal (wr 0)
[libssh2] 12.655970 Conn: Exit signal TERM received for channel 0/0

So, if your remote process is getting an abrupt termination and you're not getting an exit signal, there's a good chance the remote sshd may be at fault.

No comments:

Post a Comment