Here’s how to read and print a subprocess stdout in “real-time”, or in other words, capture the subprocess’ stdout as soon as bytes are written to it.
# parent_process.py
from subprocess import Popen, PIPE
with Popen(["python", "child_process.py"], stdout=PIPE) as p:
while True:
# Use read1() instead of read() or Popen.communicate() as both blocks until EOF
# https://docs.python.org/3/library/io.html#io.BufferedIOBase.read1
text = p.stdout.read1().decode("utf-8")
print(text, end='', flush=True)
# child_process.py
from time import sleep
while True:
# Make sure stdout writes are flushed to the stream
print("Spam!", end=' ', flush=True)
# Sleep to simulate some other work
sleep(1)
If you’d like to learn more about Python’s I/O, buffers configuration, and a real life problem that inspired this blog post, keep reading :)
On a Python group chat I’ve read an interesting question, I’m reporting an edited version below:
I have a script that opens a program with Popen. stdout is redirected to a PIPE. The script reads few lines on stdout to discover how to connect to the program using a socket. Unfortunately, at some point later the stdout pipe gets full as it isn’t read, and it blocks the subprocess.
That behaviour is expected, in fact, it’s mentioned in Python’s subprocess docs
for Popen.wait()
This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data.
(Note: I have omitted the last sentence about Popen.communicate()
as it’s not
relevant for our case, I’ll go back to it in more detail later.)
So, how can we read the first few lines written by a subprocess on its stdout, save them and throw away the rest while the subprocess is running and writing without stopping it?
TL;DR (part 2), take me to the solution
We could redirect our subprocess stdout to a file instead of a pipe, read the first few lines and forget about the rest until the subprocess terminate and then delete the file.
That would possibly look something like this:
from subprocess import Popen
from time import sleep
max_lines_to_read = 10
lines_read = 0
with open("my_command.out", "w") as subprocess_out:
with Popen(["my_command"], stdout=subprocess_out) as process:
with open("my_command.out", "r") as subprocess_in:
while True:
text = subprocess_in.read()
if not text:
sleep(1)
continue
if lines_read < max_lines_to_read:
if text.endswith("\n"):
# TODO: Store or use the whole line
lines_read += 1
else:
break
# TODO: Write the rest of the logic here and terminate `process` if needed
Note: I didn’t use readline()
or readlines()
because they would behave
just like read()
if our subprocess doesn’t terminate each write with a new
line (i.e. use one write per line) so it’s less confusing to simply use
read()
and look for \n
ourselves.
This solution works, but if we’re only interested in few lines there’s really no point in having that file on disk, what if it ends up being several gigabytes and the subprocess having to run for days? You don’t want to be that person who forced IT to impose stricter quotas on your VMs mounts, do you? 😉
No, we’re dealing with a stream of data, and we should be coding accordingly.
So, how can we stream the output of a subprocess as it gets generated, rather than waiting for it to terminate and print it all?
If we read Python’s official documentation (as all good Pythonistas always do)
for subprocess module,
we’re strongly encouraged to use Popen.communicate()
for writing/reading
piped subprocesses STDIN/STDOUT. That doesn’t quite work the way we expect
though, in fact communicate()
seems to be blocking and even calling it with a
timeout communicate(timeout=2)
doesn’t seem to work as bytes aren’t returned
while the pipe is open and being written. Bummer.
Unfortunately Python’s official documentation doesn’t offer any alternative
solution, “There should be one – and preferably only one – obvious way to do
it.” the Zen of Python says, although using Popen.communicate()
to read the
stdout of a piped subprocess is all but obvious. Sorry Zen of Python and
official docs, but we have to find another way.
While trying to figure out why Popen.communicate()
didn’t work as expected
I’ve refreshed my knowledge on POSIX pipes and buffering strategies in libc.
There are essentially three kinds of streams:
Typically, POSIX pipes are fully buffered streams, while streams attached to a TTY are usually line buffered. It’s important to remember that, especially when redirecting stdout to a pipe or a file (instead of a terminal).
Python follows the same strategies when implementing its buffers, and it’s also worth remembering that an extra layer of internal buffering might occur on both reads and writes.
Lastly, it’s important to remember that stdout streams in Python can be
handled by different io
classes, depending on the type of stream/buffering
strategy.
Consider the following program, I’ve called it spam_one_line.py
:
# spam_one_line.py
import sys
from time import sleep
for _ in range(8):
print("Spam!", end=' ') # Printed string ends with a space instead of the default
sleep(1)
print("Lovely Spam! Wonderful Spam!")
print("Line written", file=sys.stderr)
What do you think the output of this program will be on your terminal? Or more importantly, when do you think those characters will appear?
Spoiler alert: two lines will appear at the same time:
Spam! Spam! Spam! Spam! Spam! Spam! Spam! Spam! Lovely Spam! Wonderful Spam!
Line written
That’s because stdout and stderr are both attached to a TTY and that by
default means sys.stdout
and sys.stderr
are instances of
io.TextIOWrapper
(which is the same type of instance that is returned by open()
when opening a
text file) but with line_buffering=True
. Hence, characters are flushed onto
the underlying binary buffer when new line is encountered.
It’s easy to check whether a stream is attached to a TTY as io.IOBase
class
implements isatty()
method that can be invoked on all its subclasses; in this
case sys.stdout.isatty() == True
.
So what if we want to “print immediately” on stdout? Well, one way to do it
is to call print()
with flush=True
:
print("Spam!", end=' ', flush=True)
From Python 3.7 onwards, another way is to reconfigure sys.stdout
to disable
the interpreter’s buffer and transmit all the subsequent writes to the system
buffer:
sys.stdout.reconfigure(write_through=True)
What buffering strategy and what type of stream is Python implementing when a
Python process is invoked using subprocess.Popen
and its stdout is
redirected to a pipe instead of being attached to a TTY?
Let’s first refresh what a pipe is and how it works:
In very simple terms, a pipe is a mechanism for multiprocess communication provided by the OS. It has two separate ends, a writing and a reading one. The data is handled in a first-in, first-out (FIFO) order.
So when we call subprocess.Popen
and redirect the subprocess’ stdout to a
pipe, Popen
first creates the pipe, which means creating the two ends as two
separate binary file descriptors pointing to the same file (one in reading mode
and one in writing mode); then forks the calling process (creating a child
process which will share both file descriptors), redirect the child process
stdout to the file descriptor pointing at the writing end of the pipe, and
finally exec the program that should run as the child process.
For more info see libc’s Pipes and FIFOs, Creating a pipe and Pipe atomicity documentation.
So for example if we instantiate a process
object as:
with subprocess.Popen(["my_command"], stdout=subprocess.PIPE) as process:
...
Then process.stdout
will hold the reading end of the pipe while (assuming
“my_command” is another Python program) the child process’ sys.stdout
will
hold the writing end.
It’s important to notice that the reading end is handled by an instance of
io.BufferedReader
as it’s open in binary reading mode while the writing end will still be
handled by an instance of io.TextIOWrapper
(again, assuming the child process
runs a Python program) but in this case both sys.stdout.isatty()
and
sys.stdout.line_buffering
will evaluate to False
.
Okay, so, using a io.BufferedReader
instance in our use case isn’t great,
because we basically want to read lines from stdout as if it was attached to a
TTY. So, is there a way to reconfigure our piped subprocess stdout buffering
strategy? Luckily, this time, the answer can be found by reading
Popen docs
and its many, many options; in fact, setting bufsize=1
and
universal_newlines=True
when invoking Popen
, will change the reading end
of our pipe’s type to io.TextIOWrapper
and the underlying buffer will be line
buffered. Note that the wrapper’s buffer on top of the binary one, instead, will have
line_buffering=False
(which is a bit confusing but coherent).
So, having io.TextIOWrapper
instead of io.BufferedReader
as our reading
end, make Popen.communicate()
non-blocking and behave as we expect? Sadly, no.
But, we can read directly from the stdout
stream of our subprocess, remember?
And since our reading end (process.stdout
) is an instance of
io.TextIOWrapper
and the buffering strategy is line buffered we can call
readline()
on it and expect it to block until a full line is available on the
buffer and return it.
So now we should have all what we need to solve our initial problem in a better way.
Instead of dumping our subprocess output to a file, reading the first few lines and forgetting about the following ones; we could consume the subprocess’ output in a separate thread, send the first few lines to the parent process using a queue and then continue to consume the rest of the output in the thread (and discarding it). This way we’d use only the memory we need.
from subprocess import Popen, PIPE
from threading import Thread
from queue import SimpleQueue
def consume_output(p, q, max_lines):
line_count = 0
while p.poll() is None:
line = p.stdout.readline()
if line_count < max_lines:
q.put(line)
line_count += 1
def main(max_lines):
program_output = []
line_count = 0
with Popen(["my_command"], stdout=PIPE, bufsize=1, universal_newlines=True) as p:
q = SimpleQueue()
t = Thread(target=consume_output, args=(p, q, max_lines))
t.start()
while True:
line = q.get()
program_output.append(line)
line_count += 1
if line_count == max_lines:
break
# TODO: Write the rest of the logic here and do what you need with `program_output`
p.terminate()
t.join() # Blocks until t terminates
if __name__ == "__main__":
main(max_lines=2)
Note: we’re using
Popen.poll()
to check whether the child process is running or has terminated, in case it has
been terminated, the thread shall also terminate.
Also note that if “my_command” is a Python program as well, you’ll have to remember
to flush prints that you want to transmit immediately, because since sys.stdout
has been redirected to a pipe then sys.stdout.line_buffering == False
and the
buffer will be flushed only when the underlying binary buffer is full.
# spam_many_lines.py
import sys
from time import sleep
while True:
for _ in range(8):
print("Spam!", end=' ', flush=True)
sleep(1)
print("Lovely Spam! Wonderful Spam!", flush=True)
print("Line written", file=sys.stderr)
Problem solved.
That’s great, but the title of this blog post mentions capturing output in “real time”; so what if the child process doesn’t atomically writes full lines? For example, what if we want to capture the output of one of those command line programs that print their progress on one line (e.g. with a progress-bar)? In that case reading line-by-line wouldn’t be of much use.
In an earlier section, we talked about our
subprocess piped stdout being handled by a io.BufferedReader
instance; that
is the default mode subprocess.Popen
instantiates our process.stdout
.
By default, io.BufferedReader
, handles a fully buffered, binary stream, and
implements a read()
method, although that blocks until EOF if
called with a negative o no parameter, otherwise if called with a positive
integer n
will block until n
bytes are read.
I must say, io.BufferedReader.read()
behaviour is not very clear from the
official Python documentation in my opinion especially because other read
methods like io.TextIOWrapper.read()
or os.read()
return “up to” n
bytes
when called with a positive integer, which means they won’t block.
So, is there a way to read bytes from a binary buffer as soon as they’re
written, without having to wait for EOF (hence before the writing process closes
the pipe), and without having to read byte by byte with read(1)
(which is not
very efficient)?
Luckily, that can be achieved by using a different read method:
io.BufferedReader.read1()
(even though, again, the official Python
documentation is not super clear).
Another option would be calling os.read()
passing the subprocess’ stdout file
descriptor and a positive n
in order to read at most n
bytes per call.
Let’s write a simple program that just reads the subprocess stdout and prints it in real-time:
import sys
from subprocess import Popen, PIPE
with Popen(["my_command"], stdout=PIPE) as p:
while True:
text = p.stdout.read1().decode("utf-8")
print(text, end='', flush=True)
# TODO: Write the rest of the logic here and terminate `process` if needed
That’s it!
Note that, alternatively, you could have also used os.read()
in case you
wanted to read up to a fixed number of bytes, so you could have rewritten the
read line as:
text = os.read(p.stdout.fileno(), 1024).decode("utf-8")
(Obviously you’d have to import os
for that)
The default buffer size of all concrete io.Buffered*
classes is
io.DEFAULT_BUFFER_SIZE
which is platform dependant.
In case you wanted to change buffer strategy for stdout stream, you can
re-initialise it. For example, in the above spam_many_lines.py
you might want
to set custom buffer settings in case stdout is not attached to a TTY, but
keep default settings otherwise (and/or not want to pass flush=True
to all
print()
functions).
So you could first check if sys.stdout
is attached to a TTY, then disable
io.TextIOWrapper
buffer and instead set a custom buffer size for the underlying
binary buffer.
if not sys.stdout.isatty():
buff_size = 8
sys.stdout = io.TextIOWrapper(
open(sys.stdout.fileno(), 'wb', buff_size),
write_through=True,
encoding="utf-8"
)
Note, stdout buffer will be automatically flushed after 8 bytes have been written.
And that’s all I’ve got on Python buffers for now.
If you made it here, give yourself a pat on the back and see you soon!