Finding my first Python reference cycle

I’ve worked in Python for nearly six years now and, up to this point, I’ve not had to deal with the king of memory-related gotchas: the reference cycle. Simply put, a reference cycle happens when two or more objects create a cyclic directed graph in their path of references (ok, not that simple). The simplest form is the “I point to you, you point to me, neither of our reference counts get to zero” scenario.

Usually, this isn’t a big deal. Although objects don’t get immediately collected when dereferenced, the garbage collector will eventually notice that there is some wasted memory, do the more expensive cycle detection work, and free the objects. In many scenarios like single-run scripts, cron jobs, or low-load applications you’ll never notice; Python does it’s job, keeps things clean, and stays out of your way. If, however, you’re writing a high performance internet server intended to handle the ever increasing demands of the modern Internet, you just might be screwed. That GC work will cost you desired though-put and scalability. In the worst case, the GC won’t be invoked in time and you’ll hemorrhage memory (or at least gobble it up in competition with other things your system might need to do). I this case, initial numbers left us with a customer-detectable throughput degradation when a particular security feature was enabled.

Without further ado, here is the distilled pattern that ran us into trouble.

class Filter:
    def __init__(self, callback):
        self._callback = callback

    def update(self, data):
        """Override and call parent."""
        self._callback(data)


class AddLF(Filter):
    def update(self, data):
        data = data + '\n'
        Filter.update(self, data)


class AddCRLF(Filter):
    def update(self, data):
        data = data + '\r\n'
        Filter.update(self, data)


class Foo:
    def __init__(self, dos_mode=False):
        if dos_mode:
            self._filter = AddCRLF(self.the_callback)
        else:
            self._filter = AddLF(self.the_callback)
        self._filtered_data = list()

    def update(self, data):
        self._filter.update(data)

    def the_callback(self, data):
        self._filtered_data.append(data)

    def all_data(self):
        return ''.join(self._filtered_data)

At a glance, this isn’t an unreasonable design pattern. If you’re writing a cross platform tool, or something that needs to listen on a socket and account for variations and flaws in protocol implementations, passing the input through a filter to clean it up is reasonable. It’s even somewhat pythonic in that a call-back is used, allowing for various patterns of code re-use and subclassing in both the filters and the consumer.

The usage pattern is something like this: for each chunk of data you receive, call update(), the main class hands the data off to the filter, the filter does its thing, and uses the callback to pass the data back up.

# Many of you will recognize this pattern from crypto digests
f = Foo()
f.update('asdf')
f.update('jklm')

print f.all_data()
f = None
# At this point ref-counting could have reclaimed the object

Pay attention to that last line: when we assign f = None, under normal circumstances the reference count would hit zero, and all objects should be collected. Here this is not the case!

Once I realized this region of code was the likely culprit, I started playing with the gc module and its debugging abilities. Most importantly, using the gc.DEBUG_SAVEALL setting causes the garbage collector to (instead of completing collection) store all unreachable objects in a list named gc.garbage. This can be examined for hints.

import gc, pprint

gc.set_debug(gc.DEBUG_SAVEALL)
print gc.get_count()
print gc.collect()
pprint.pprint(gc.garbage)

Here is the output from that single object creation.

(249, 5, 0)
6
[{'_filter': <__main__.AddLF instance at 0x7f23ec3261b8>,
  '_filtered_data': ['asdf\n', 'jkl;\n']},
 <__main__.Foo instance at 0x7f23ec326170>,
 <bound method Foo.the_callback of <__main__.Foo instance at 0x7f23ec326170>>,
 {'_callback': <bound method Foo.the_callback of <__main__.Foo instance at 0x7f23ec326170>>},
 <__main__.AddLF instance at 0x7f23ec3261b8>,
 ['asdf\n', 'jkl;\n']]

You might be able to see the loop already. We have, left over, a Foo instance, an AddLF instance (which we know is stored as a member variable of Foo), and (most importantly) a reference to a bound method on Foo. This bound method holds a reference back to the Foo instance, completing the cycle.

What the original designer of this code probably didn’t realize is that passing that callback when instantiating a Filter would create that loop (via the bound instance method, which has a reference to the instance). After a minor refactoring, I’ve gotten rid of the cycle and reference counts hit zero when the last external reference to a Foo object is eliminated. Most importantly all the related data is freed as well; in this case Foo._filtered_data (which could grow to be very large) will get freed up as well.

This anti-pattern is only one step away from the simplest case reference cycle, but it had real consequences. With it gone, memory is managed far more efficiently and this feature is usable in a demanding, high-performance environment.

Discuss this post at reddit.