Monday, April 28, 2014

sysdig really is strace plus lsof


What does that mean, anyway? This morning it meant that I could see helpful internal file table information in the context of syscalls, so I could see more than just the file descriptors shown by strace.

The problem

A simple WSGI application which is managed by uWSGI is sitting behind either httpd or nginx, and I've been trying out different protocols, socket types, and both httpd and nginx with it. With httpd and FastCGI over TCP connecting to uWSGI, I get several I/O failures out of a thousand. With httpd and FastCGI over AF_UNIX I get perhaps 40 or 50. I don't see any errors with nginx, and I don't see any errors with HTTP or SCGI. What is going on?

The AF_UNIX scenario has the most frequent I/O errors, so I started debugging that. mod_proxy_fcgi sees EPIPE intermittently when writing 8 bytes to the application (after sending the FastCGI begin and parameters data).

I configured httpd (Event MPM) to use a single child process to serve requests, and configured uWSGI to do the same, to make data analysis easier.

I figured out a relatively short (i.e., limited data collection) ab invocation to trigger the failure (50 requests, concurrency 5 IIRC), and tried it several times until it failed. Each time I ran sudo sysdig -w trace.scap to capture data during the run, hoping for a failure. (The capture was started anew for every run to minimize the data collected.) After a short time, I hit 1 EPIPE scenario and had a matching capture. Now what?

22000 was the httpd child pid and 20881 was the uWSGI pid. Here are a few sysdig invocations used to start getting acquainted with the data. (No, I don't necessarily recommend egrep over sysdig expressions :) ).

$ sudo sysdig -r trace.scap "proc.pid=20881 or proc.pid=22000" |less
$ sudo sysdig -r trace.scap proc.pid=22000 | egrep '(connect|writev|close|read|PIPE)' |less
$ sudo sysdig -r trace.scap "(proc.pid=20881 or proc.pid=22000) and fd.type=unix" |less
(and more)

The view with both processes and just the operations on AF_UNIX sockets was the most helpful. Searching for EPIPE yields this information:

89254 07:12:54.259984241 6 httpd (22014) > writev fd=26(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=8 
89285 07:12:54.260023713 6 httpd (22014) < writev res=-32(EPIPE) data=........

The long hex addresses (ffff880108bcc440 and ffff880108bcba80) are the key to seeing what happens inside both processes, since the file descriptors won't match between processes.

$ sysdig -r trace.scap "(proc.pid=20881 or proc.pid=22000) and fd.type=unix"  | egrep '(ffff880108bcc440|EPIPE)'
87796 07:12:54.244173691 6 httpd (22014) < connect res=0 tuple=ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s 
87797 07:12:54.244176698 6 httpd (22014) > fcntl fd=26(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) cmd=4(F_GETFL) 
87800 07:12:54.244178198 6 httpd (22014) > fcntl fd=26(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) cmd=5(F_SETFL) 
87802 07:12:54.244180563 6 httpd (22014) > writev fd=26(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=16 
88619 07:12:54.259281618 7 uwsgi (20882) < accept fd=10(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) tuple=ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s queuepct=0 
88627 07:12:54.259287565 7 uwsgi (20882) > read fd=10(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=4096 
89021 07:12:54.259721201 6 httpd (22014) > writev fd=26(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=699 
89040 07:12:54.259743588 6 uwsgi (20882) > read fd=10(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=4096 
89083 07:12:54.259786716 6 httpd (22014) > writev fd=26(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=8 
89097 07:12:54.259802288 6 uwsgi (20882) > read fd=10(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=4096 
89148 07:12:54.259862936 6 uwsgi (20882) > write fd=10(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=8 
89158 07:12:54.259872030 6 uwsgi (20882) > write fd=10(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=43 
89161 07:12:54.259874552 6 uwsgi (20882) > write fd=10(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=8 
89165 07:12:54.259877021 6 uwsgi (20882) > write fd=10(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=11 
89184 07:12:54.259892476 6 uwsgi (20882) > write fd=10(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=24 
89188 07:12:54.259895312 6 uwsgi (20882) > close fd=10(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) 
89254 07:12:54.259984241 6 httpd (22014) > writev fd=26(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s) size=8 
89285 07:12:54.260023713 6 httpd (22014) < writev res=-32(EPIPE) data=........ 
89314 07:12:54.260054390 6 httpd (22014) > close fd=26(<u>ffff880108bcc440->ffff880108bcba80 /tmp/helloFastCGI.s)

The buffer at ffff880108bcc440 was also used for other socket connections, though not concurrently. I omitted the data for connections that didn't have the EPIPE error.

So httpd writes 16 then 699 then 8 then dies trying to write 8 more. uWSGI reads several times with a 4096-byte buffer, but we have to dig into the trace to find out how much it really read (following the "< read" entry). uWSGI then writes 8 then 43 then 8 then 11 then 24 bytes, then closes before httpd tried to write the last 8.

The missing data showing how much uWSGI actually received when asking for 4K is

88633 07:12:54.259291681 7 uwsgi (20882) < read res=16 data=................
89045 07:12:54.259748834 6 uwsgi (20882) < read res=699 data=..........HTTP_HOST127.0.0.1..HTTP_USER_AGENTApacheBench/2.3..HTTP_ACCEPT*/*.<PA
89104 07:12:54.259806481 6 uwsgi (20882) < read res=8 data=........

So uWSGI read the 16, the 699, and the 8 but never tried to read again before closing.

Looking at the mod_proxy_fcgi source code I see that the final 8, the one that occasionally fails with EPIPE, is for telling the application that there are 0 bytes of FastCGI STDIN (i.e., no request body). So it seems that if mod_proxy_fcgi can get those 8 bytes in the socket buffer before uWSGI closes the socket it won't get a write, and usually it is able to. Next I added a sleep for 15 milliseconds just before that last write. With that it fails every time, since those 8 bytes are never in the socket buffer before uWSGI closes.

So why does it work reliably with nginx? It has all the data (begin request plus params/envvars + end of stdin) in a single buffer, so it doesn't get caught. uWSGI doesn't read it anyway, but there's no EPIPE error.

I looked a bit further at uWSGI, in particular its FastCGI protocol handler. Some higher-level code calls into the protocol handler to perform different work; one call is to read the request body. I found by setting breakpoints in gdb that the function to read the request body is never called. For some protocols that makes sense, since you can tell from the request header fields for this request that there's no body. So perhaps the FastCGI protocol handler needs to be smarter and slurp up the last 8 byte record earlier in this scenario. More investigation is required, presumably with gdb, to understand the flows between the generic code and the FastCGI protocol handler. (It seems very simple, but initial testing with a POST request didn't show what I expected.)

While playing with this I created a workaround in mod_proxy_fcgi:

             rv = send_data(conn, vec, nvec, &len, 0);
             if (rv != APR_SUCCESS) {
-                break;
+                if (last_stdin && writebuflen == 0) {
+                    ap_log_rerror(APLOG_MARK, APLOG_ERR, rv, r, "ignoring error sending empty stdin");
+                }
+                else {
+                    break;
+                }

It isn't harmful, but it still looks like uWSGI needs a fix.

Sometimes you win, sometimes you lose. Recently I was trying to find whether a file was opened before or after a change in identity, but AFAICT sysdig doesn't have a filter for the syscall, and so it continues to report (if you ask for it in a -p format) the original uid/gid.

No comments: