thraxil.org:

Subprocess Hanging: PIPE is your enemy

by anders pearson Thu 13 Mar 2008 20:35:31

Every once in a while, I run across a bug or a tricky problem where googling for a solution doesn't turn up much. When I come up with a solution, I like to write it up and put it online so the next person to come across it hopefully will have an easier time figuring it out. This is one of those posts.

One of the internal applications I wrote at work does a lot of work via external programs. It's basically glueing together a whole bunch of shell scripts and putting a nice interface on them.

Running an external program from Python isn't very hard in the simple case. There's actually a wealth of options available. The entry level is to use os.system() and give it a list of arguments. That gives you the return code but doesn't give you the output of the command.

For what I'm doing, I need to have access to the return code, STDOUT, and STDERR. Requirements like that lead to the os.popen* functions. Basically, something like:

import os
(c_stdin,c_stdout,c_stderr) = os.popen3(cmd,'r')
out = c_stdout.read()
err = c_stderr.read()
c_stdin.close()
c_stdout.close()
c_stderr.close()

There are still problems with that. The environment that the child command runs in (variables, cwd, tty, etc) is the same environment that the parent is running in. So to set, eg, to set environment variables for the child, you have to put them into os.environ in the parent, or to set the cwd for the child command, you have to have the parent do an os.chdir(). That can be troublesome in some situations. Eg, if the parent is a CherryPy process, doing an os.chdir() makes it hopelessly lost and it will crash. So you have to fork() a child process, set up the environment there, do the above code, and then pass the results back to the parent process.

This has been enough of a pain point for Python programmers that Python 2.4 added the subprocess module. The code above can be replaced with:

from subprocess import Popen, PIPE
p = Popen(cmd,stdout=PIPE,stderr=PIPE)
(out,err) = p.communicate()

Aside from being a little shorter, subprocess.Popen() also takes additional arguments like cwd and env that let you manipulate the environment of the child process (it does the fork() for you). It basically gives you one very nice interface for doing anything and everything related to spawning external commands. Life is generally better with subprocess around.

Unfortunately, there is a subtle, devious bug in that code. I know this after encountering it and spending many hours trying to figure out what was going on.

Where I encountered it was when the command being run was doing an svn checkout. The checkout would run for a while and then the svn command would hang at some point. It wouldn't use CPU, there would be no error messages. The process would still show up in ps or top. It would just stop and the parent process would sit and wait for it to finish. Complete deadlock. Running the exact svn command on the commandline, it would run with no problems. Doing an svn checkout of a different repository would work fine. Kill the hung svn process and the parent would complete and STDOUT would show most of the expected output from the svn checkout. With the particular repository, it would always hang at exactly the same spot; completely repeatable.

How could an svn checkout of a particular repository hang, but only when run via subprocess?

After much frustrating debugging, searching, and experimentation, I narrowed it down to the output of the svn command on STDOUT. If I added a -q (quiet) flag, it would complete without hanging. I eventually noticed that the output that had been collected in STDOUT after killing the svn process was right around 65,000 characters. Since 216 is 65536, that seemed like a coincidence worth investigating. I wrote a test script that just wrote 216 characters to STDOUT and ran it via subprocess. It hung. I modified it to print 216 - 1 characters to STDOUT. No hanging. The troublesome svn repository happened to have a lot of files in it, so a lot of verbose output on the checkout.

A closer inspection of the subprocess.Popen docs revealed a warning "Note: The data read is buffered in memory, so do not use this method if the data size is large or unlimited." I'd probably read that before and assumed that it was a warning about possibly consuming a lot of memory and being inefficient if you try to pass a lot of data around. So I ignored it. The STDOUT chatter of shell scripts that I was collecting for logging purposes did not strike me as "large" (65K is positively tiny these days) and it isn't an application where I'm particularly concerned about memory consumption or efficiency.

Apparently, that warning actually means "Note: if there's any chance that the data read will be more than a couple pages, this will deadlock your code." What was happening was that the memory buffer was a fixed size. When it filled up, it simply stopped letting the child process write to it. The child would then sit and patiently wait to be able to write the rest of its output.

Luckily the solution is fairly simple. Instead of setting stdout and stderr to PIPE, they need to be given proper file (or unix pipe) objects that will accept a reasonable amount of data. (A further hint for anyone who found this page because they encountered the same problem and are looking for a fix: Popen() needs real file objects with fileno() support so StringIO-type fake file objects won't work; [tempfile.TemporaryFile] is your friend).

This strikes me as kind of a poor design. Subprocess is wonderful in most ways and a real improvement over the old alternatives. But with this kind of issue, the programmer using it will probably not encounter any problems in development and think everything is fine but some day will find their production app mysteriously deadlocked and have almost no clues as to what's causing it. That seems like something that deserves a big flashing red warning in the docs every time PIPE is mentioned.

comments

I like the fact that you added the page because you could not find it in Google, that was nice of you to think about the next guy!

Using your solution is a really great way to solve this problem, I just hope the next person who needs and finds this info leaves a thank you on your discovery should they be heading in the same direction.

consider pexpect

http://news.ycombinator.com/item?id=146816

What about deadlocks when invoking a shell script which itself forks a child process? From the real shell, it returns right away as expected; but, from Subprocess, it deadlocks until the child process dies. Before you ask, data output does NOT exceed 64K. :)

Thanks.

It's kind of a different issue entirely, but basically, don't do that.

One of my pet peeves is command line apps or scripts that background themselves and don't have a "foreground" mode. I consider that a major unix faux pas. So if you're writing the shell script that you're calling, just don't do it. If you didn't write it, can you fix it and add a foreground mode? If, for some reason you can't fix it, you're going to be stuck with some kind of ugly workaround, probably involving pidof.

I've seen more of the command line apps that background themselves without a foreground mode cropping up lately, particularly on OS X. I suspect that it's programmers who are used to writing GUI apps and making polished "user friendly" interfaces who have been asked to put a command line frontend on their application. The command takes a long time to run and, coming from a GUI background, they think "the user shouldn't have to wait around while it works, let's do it in the background". Good instinct for a GUI app. Annoying as hell for a command line application as all it really does it make it hard to put in a script.

Thanks Anders, I hit this recently and worked around it but didn't get to the bottom.

I was just debugging this very same issue. You have saved me quite some time. Thanks!

I have run into the same issue. Thanks to your post, now I have a solution!

One way to diagnose the issue - an strace of the hung child process shows that it is waiting on a write().

Is it still an issue? I've just tested this on python 2.6.2 and Linux and found no problems. Also I've experimented with different buffer sizes (bufsize parameter for Popen).

Holy Cow Thanks!

Thank you for your informative post!

Encountered the same problem, we had a log client running that suddenly stopped without explanation. Doing strace revealed that it hanged when writing to stdout. I totally agree that the note in python documentation should warn more for this behaviour.

Brilliant, thank you! Saved me quite a lot of trouble.

This problem had me stuck for the last few months, and I'd almost given up on it. That's why it was hanging on large sound files. Thanks a lot!

I found this myself, and wasn't sure if it is a buggish behavior. It took me a few hours to notice that the problem was subprocess.PIPE couldn't handle "large" output. By the way I am using python 2.7.1 and this is still an issue!

Thank you for posting on this issue. We were experiencing the same exact problem. Setting the stdout and stderr to a file worked perfectly.

Excellent article. I also ran into this issue and solved it by making the child process less chatty. Thanks for sharing your insight!

I think the fact that the subprocess waits until you read its output before it produces more isn't a problem. The problem is calling the communicate() method, which (at least now) is documented as "do not use this method if the data size is large or unlimited". What you should do is reading from .stdout of the object returned from Popen(), without expecting the process to terminate before you read its output.

Buffering to a disk file is the same as buffering to memory, only slower and with a slightly larger size limit.

I just ran into this issue in Python 3 (so this might not apply to Python 2), but in my case the solution was to RTFM about file buffering and realize that 1) subprocesses will hang if their output buffer fills up, so you must make sure to read all of it as the subprocess runs; 2) Popen.stdout is an instance of FileIO, which means that its read() method returns at most the requested number of bytes but could also return less. You can wrap it in a BufferedReader object to get a read() method that returns all of the requested data.

I experienced a similar issue that I need to run zcat, and then head, tail or wc -l on the output, all called from within python 2.65. Apparently the output is much larger than the 65K (it is a few hundreds MB to a few GB). It has puzzled me for a while why zcat always hang until I read this wonderful post. For my case, I would like to avoid creating temp file as well since it could be very big, and I don't necessarily want to write out the entire file. I ended up wrapping the two commands with unix pipe ("|") connected into one string, than pass that into subprocess.Popen, with shell=True (which I try to avoid usually). This has been working fine for my case.

While I do not yet need this solution, I DO work with the subprocess module fairly often and surely will need it sooner or later, so thank you very much for this page. Bookmarked and Captured. :)

Thank you!!

Thank you very much for your post. This is the reason my process was getting stuck. I wasted 1 day before reading your post.

Hey, I just found your post, it's great!

This helped me to write this blogpost and some workaround code which works great for me:

http://www.synsecblog.com/2013/01/python-subprocesspipe-is-your-enemy.html

Cheers!

Checked this on Python 3.3 on W32. Doesn't apply.

Wrote a program (in C) that just writes zero bytes into stdout and stderr in a loop (1 write into each, per iteration), and takes one argument - number of bytes to write).

Wrote a Python script that runs that program via subprocess, with two pipes, and receives the output via communicate() method.

Works as intended - memory usage visibly grows, and length of the returned variables is correct. It also takes longer to run as the argument grows.

I stopped at 16 megabytes, didn't try more.

Thanks a lot for this information. I was also facing the same issue.

Like many others I ran into this problem as well. The issue seems to be in the way the Linux Kernel implements pipes, from the pipe(7) man page:

"In Linux versions before 2.6.11, the capacity of a pipe was the same as the system page size (e.g., 4096 bytes on i386). Since Linux 2.6.11, the pipe capacity is 65536 bytes."

Granted Python could come up with a version of communicate() that allows python to read from the child's stdout without waiting for the child process to complete.

I honestly just spent all night scratching my head. This was it, and it all makes sense now. Fantastic explanation, and good catch, I may never have found it :)

Thank you VERY much. You just saved my web application from a bug I've been struggling with for days!

Holy freaking shite!

We are now in 2014 and this post is still relevant!

Many thanks!

P.s. Boo Python subprocess' 65k limit !


formatting is with Markdown syntax. Comments are not displayed until they are approved by a moderator. Moderators will not approve unless the comment contributes value to the discussion.

namerequired
emailrequired
url
remember info?