python

Finding my first Python reference cycle

I’ve worked in Python for nearly six years now and, up to this point, I’ve not had to deal with the king of memory-related gotchas: the reference cycle. Simply put, a reference cycle happens when two or more objects create a cyclic directed graph in their path of references (ok, not that simple). The simplest form is the “I point to you, you point to me, neither of our reference counts get to zero” scenario.

Usually, this isn’t a big deal. Although objects don’t get immediately collected when dereferenced, the garbage collector will eventually notice that there is some wasted memory, do the more expensive cycle detection work, and free the objects. In many scenarios like single-run scripts, cron jobs, or low-load applications you’ll never notice; Python does it’s job, keeps things clean, and stays out of your way. If, however, you’re writing a high performance internet server intended to handle the ever increasing demands of the modern Internet, you just might be screwed. That GC work will cost you desired though-put and scalability. In the worst case, the GC won’t be invoked in time and you’ll hemorrhage memory (or at least gobble it up in competition with other things your system might need to do). I this case, initial numbers left us with a customer-detectable throughput degradation when a particular security feature was enabled.

Without further ado, here is the distilled pattern that ran us into trouble.

class Filter:
    def __init__(self, callback):
        self._callback = callback

    def update(self, data):
        """Override and call parent."""
        self._callback(data)

class AddLF(Filter):
    def update(self, data):
        data = data + '\n'
        Filter.update(self, data)

class AddCRLF(Filter):
    def update(self, data):
        data = data + '\r\n'
        Filter.update(self, data)

class Foo:
    def __init__(self, dos_mode=False):
        if dos_mode:
            self._filter = AddCRLF(self.the_callback)
        else:
            self._filter = AddLF(self.the_callback)
        self._filtered_data = list()

    def update(self, data):
        self._filter.update(data)

    def the_callback(self, data):
        self._filtered_data.append(data)

    def all_data(self):
        return ''.join(self._filtered_data)

At a glance, this isn’t an unreasonable design pattern. If you’re writing a cross platform tool, or something that needs to listen on a socket and account for variations and flaws in protocol implementations, passing the input through a filter to clean it up is reasonable. It’s even somewhat pythonic in that a call-back is used, allowing for various patterns of code re-use and subclassing in both the filters and the consumer.

The usage pattern is something like this: for each chunk of data you receive, call update(), the main class hands the data off to the filter, the filter does its thing, and uses the callback to pass the data back up.

# Many of you will recognize this pattern from crypto digests
f = Foo()
f.update('asdf')
f.update('jkl;')
print f.all_data()
f = None
# At this point ref-counting could have reclaimed the object

Pay attention to that last line: when we assign f = None, under normal circumstances the reference count would hit zero, and all objects should be collected. Here this is not the case!

Once I realized this region of code was the likely culprit, I started playing with the gc module and its debugging abilities. Most importantly, using the gc.DEBUG_SAVEALL setting causes the garbage collector to (instead of completing collection) store all unreachable objects in a list named gc.garbage. This can be examined for hints.

import gc, pprint

gc.set_debug(gc.DEBUG_SAVEALL)
print gc.get_count()
print gc.collect()
pprint.pprint(gc.garbage)

Here is the output from that single object creation.

(249, 5, 0)
6
[{'_filter': <__main__.AddLF instance at 0x7f23ec3261b8>,
  '_filtered_data': ['asdf\n', 'jkl;\n']},
 <__main__.Foo instance at 0x7f23ec326170>,
 <bound method Foo.the_callback of <__main__.Foo instance at 0x7f23ec326170>>,
 {'_callback': <bound method Foo.the_callback of <__main__.Foo instance at 0x7f23ec326170>>},
 <__main__.AddLF instance at 0x7f23ec3261b8>,
 ['asdf\n', 'jkl;\n']]

You might be able to see the loop already. We have, left over, a Foo instance, an AddLF instance (which we know is stored as a member variable of Foo), and (most importantly) a reference to a bound method on Foo. This bound method holds a reference back to the Foo instance, completing the cycle.

What the original designer of this code probably didn’t realize is that passing that callback when instantiating a Filter would create that loop (via the bound instance method, which has a reference to the instance). After a minor refactoring, I’ve gotten rid of the cycle and reference counts hit zero when the last external reference to a Foo object is eliminated. Most importantly all the related data is freed as well; in this case Foo._filtered_data (which could grow to be very large) will get freed up as well.

This anti-pattern is only one step away from the simplest case reference cycle, but it had real consequences. With it gone, memory is managed far more efficiently and this feature is usable in a demanding, high-performance environment.

I do not host comments due to spam and abuse. Discuss this post at reddit.

Tags:

Tuesday, November 3rd, 2009 Geekdom 3 Comments

Start your Python project with optparse and logging

Python continues to be my favorite language to hack in. It’s useful for tasks big and small and has the advantage of being more readable and maintainable than a lot of other scripting languages. I remember when, having learned the basics of Python, I decided to rewrite my home-built CD-to-MP3 script suite from its perl incarnation (perl was my previous favorite language for system programming projects). I knocked that project out in only about 2 hours, including a major redesign made easy by Python’s object syntax and built in “pickle” for serialization.

Since then, I’ve usually chosen Python for a variety of system programming tasks. This includes things like making backups, deploying software, configuration file templating, web site scraping, health monitoring, and some bigger data crunching/graphing. Time after time, these quick projects (often at work) turned into something bigger. Eventually they get rolled into a product or adopted by the production operations group as standard kit.

At some point I realized that there was something that I was doing that made this transition from “quick hack” to “standard tool” easy: I always start my projects with logging and optparse in place from day zero. During development, this means that I don’t have to scatter print-style debugging statements throughout the code, I can just use logging.debug() and turn them on and off at will (via command line flags). Once deployed or passed on, it means other people using it can immediately start interacting with my script just like other familiar Unix utilities.

So, here is a variation on what my typical python script starts out with.

#!/usr/bin/python

import logging

def foo():
    """These will only get output if you turn up verbosity."""
    logging.debug("This is debug.")
    logging.info("This is info.")

def bar():
    """These will all be output a default logging levels."""
    logging.warn("Warning!  Things are getting scary.")
    logging.error("Uh-oh, something is wrong.")
    try:
        raise Exception("ZOMG tacos.")
    except:
        logging.exception("Just like error, but with a traceback.")

if '__main__' == __name__:
    # Late import, in case this project becomes a library, never to be run as main again.
    import optparse

    # Populate our options, -h/--help is already there for you.
    optp = optparse.OptionParser()
    optp.add_option('-v', '--verbose', dest='verbose', action='count',
                    help="Increase verbosity (specify multiple times for more)")
    # Parse the arguments (defaults to parsing sys.argv).
    opts, args = optp.parse_args()

    # Here would be a good place to check what came in on the command line and
    # call optp.error("Useful message") to exit if all it not well.

    log_level = logging.WARNING # default
    if opts.verbose == 1:
        log_level = logging.INFO
    elif opts.verbose >= 2:
        log_level = logging.DEBUG

    # Set up basic configuration, out to stderr with a reasonable default format.
    logging.basicConfig(level=log_level)

    # Do some actual work.
    foo()
    bar()

This is obviously a pretty minimal setup, but it achieves what most people need. You get option parsing and checking along with usage information. You get warnings and errors spat out to sys.stderr (leaving sys.stdout for actual program output). You can specify -v to crank up verbosity.

More than once, I’ve had a pager-frazzled sysadmin ask me how the heck to use the new tool. I get to reply, “Just run it with --help, it has full usage instructions built in.” This makes sysadmins happy; really happy. It means they don’t have to remember odd semantics or dig around for the how-to page on the internal wiki at 3:00am when they need to use it. Culturally, there is almost nothing more valuable than a sysadmin who has had success with your code, and nothing worse than one that has had trouble with it. To me, the first step in granting a sysadmin that sort of success is to make your code act like everything else in /usr/bin and that’s why I’ll continue starting my projects in this manner.

Tags: ,

Wednesday, July 1st, 2009 Geekdom 1 Comment

A brief Python decorator primer

My last post looked at easing a common database idiom using a decorator. Interestingly, the brief discussion on reddit regarding the post had a number of people remarking that the coverage of decorators was more interesting than the SQL-related bits. Curious, that.

It seems I’m not alone in being occasionally puzzled by the way decorators work. It took me a little while to comprehend them. I’m still not a huge fan of how they work, but I can see their utility and do use them when appropriate. Almost without fail I end up heading over to PEP-318 and puzzling over the variations. The PEP isn’t terribly helpful since it is dominated by historical discussion rather than examples or instructions.

So let’s start with the basic no-argument decorator. It doesn’t take any declarative arguments when decorating a function; you just type @some_deco without parenthesis when using it.

def some_deco(f):
    def _inner(*args, **kwargs):
        print "Decorated!"
        return f(*args, **kwargs)

    return _inner

@some_deco
def some_func(a, b):
    return a + b

print some_func(1, 2)

This will print out:

Decorated!
3

According to the PEP, if you wrap some_func with @some_deco, it is the equivalent of some_func = some_deco(some_func). What it fails to call out is that this happens at load time. What we're returning from the decorator function is a new function or callable that will be bound to the name some_func when this module is loaded. Subsequent run-time calls to some_func are actually calling the _inner function we defined on the fly when the decorator was called. Thanks to Python's scoping rules and closures, we have access to the passed f parameter from the inner function, so we can call it.

Lovely. Now you have the power to tack on any sort of things you want to around a given function. Print a log message before or after, spawn a thread, add some exception handling. Whatever you want.

Things get a little more hairy if we want our decorators themselves to take parameters. According to the PEP, if was want parameters on the decorator to take parameters it's the equivalent of func = decomaker(argA, argB, ...)(func). Again, this happens at load time, so it's worth staring at for a moment. In plain English, this is saying that decomaker will be called with some arguments, and then the result of that will be called with a function reference. This is key! There are actually two invocations happening. The first will need to return a callable that accepts a function as it's argument. That will then be called, and the result bound to the name func, so it better be callable as well. Let's try it.

def log_wrap(message):
    """Print `message` each time the decorated function is called."""
    def _second(f):
        def _inner(*args, **kwargs):
            print message
            return f(*args, **kwargs) 

        # This is called second, so return a callable that takes arguments
        # like the first example.
        return _inner

    # This is called first, so we want to pass back our inner function that accepts
    # a function as argument
    return _second

@log_wrap("Called it!")
def func(a, b):
    return a + b

print func(1, 2)

This will output:

Called it!
3

Ok, ok, that's not terribly useful, but it illustrates the point (and expands on the first example by being a parameterized version). It requires that we construct two callable items inside our decorator, and that each will be called in turn during load time. Let's quick dissect what happens at load time. When python gets to the declaration of func with the decorator sitting on top of it, it calls the decorator function with its argument (in this case, the string "Called!"). What is returned is a function reference to _second. Next, that just-returned reference to _second is invoked with a single argument: a function pointer to func, which is what we're actually trying to decorate. The return value of _second is a reference to _inner, which will be bound to the name func, ready for run-time invocation.

Yeah, I'm a little dizzy, too. But if you practice this a couple times, it'll start to make sense. Again, the key to getting this second form working is to recognize that two calls are happening at load time and returning the appropriate callable references at the right time.

Once you have it down, use this new hammer when it serves you, remembering that not every problem is a nail.

Tags:

Thursday, May 28th, 2009 Geekdom 3 Comments

Python, decorators, and database idioms

Now that I’m a co-maintainer of the MySQL driver for Python I get to see a lot of confused people pop in on the support forums with some elementary problems. These seem to be mostly young developers who haven’t quite mastered the “start transaction, work with the database, commit” pattern that so many of us have become used to after a few years.

I still find the overhead of doing that sort of thing repeatedly to be tiresome. It doesn’t take long for your code to become a mess of try/except blocks, making code reading confusing and arduous. To avoid this, I’ve been using a Python decorator to wrap my SQL functions. It goes a little something like this (simplified form):

def dbwrap(func):
    """Wrap a function in an idomatic SQL transaction.  The wrapped function
    should take a cursor as its first argument; other arguments will be
    preserved.
    """
    def new_func(conn, *args, **kwargs):
        cursor = conn.cursor()
        try:
            cursor.execute("BEGIN")
            retval = func(cursor, *args, **kwargs)
            cursor.execute("COMMIT")
        except:
            cursor.execute("ROLLBACK")
            raise
        finally:
            cursor.close()

        return retval

    # Tidy up the help()-visible docstrings to be nice
    new_func.__name__ = func.__name__
    new_func.__doc__ = func.__doc__

    return new_func

@dbwrap
def do_something(cursor, val1=1, val2=2):
    """Do that database thing."""
    cursor.execute("SELECT %s, %s", (val1, val2))
    return cursor.fetchall()

class SomeClass(object):
    def __init__(self):
        conn = MySQLdb.connect(db='test')
        self.instance_var = SomeClass.get_stuff(conn)
        print self.instance_var
        conn.close()

    @staticmethod
    @dbwrap
    def get_stuff(cursor):
        """Load something meaningful from the database."""
        cursor.execute("SELECT 'blah'")
        return cursor.fetchall()[0][0]

if '__main__' == __name__:
    conn = MySQLdb.connect(db='test')
    print do_something(conn)
    print do_something(conn, 3, 4)
    print do_something(conn, 5)
    print do_something(conn, val2=6)
    #help(do_something)

    s = SomeClass()
    #help(s)

Now, there are obviously several additions you could make here. You could add retry/reconnect logic, be a little more careful when closing the cursor (which can throw exceptions), or even attempt some deadlock recovery. But I think this simple example shows how you can effectively use decorators to implement a useful DB interaction idiom on both functions and object static methods.

And, damn, if it doesn’t save a lot of typing.

Tags:

Friday, May 22nd, 2009 Geekdom No Comments

Django and app-relative static content

This one drove me a little nuts at first, but I finally realized how to do it.  In the process of trying to write a little application using Django and jQuery, I found it to be annoyingly non-obvious how I should serve up a self contained set of graphical assets (like images) and my own application-local jQuery distribution.  Sure, I could drop stuff in the same media root as the admin app, or tell people how to create a static mapping, but that kind of defeats modularity.  I really want people to be able to drop my little application into their existing Django project or installation, add a single include() entry into their project’s urls.py and go.

I tried a couple of things including a manually constructed RegexURLPattern, but it didn’t feel very clean or pythonic. Over the weekend I realized I could just write a simple delegating view function in my application’s views.py:

import os
import django.views.static

STATIC_ROOT = os.path.dirname(os.path.normpath(__file__)) + '/static'

def static(request, path):
    """Return app-relative static resources and collateral."""
    return django.views.static.serve(request, path, STATIC_ROOT)

With this trick and the app-relative templates loaded from the templates/ directory (if you don’t waste time on the doc bug I found) it is quite possible to write a fully self-contained, drop-in application.

Tags:

Tuesday, January 20th, 2009 Geekdom No Comments

Not a good start

This is going to be one of those days.  I decided to install a python package called Plex that we use in a few places.  It looked like a coworker was misusing it a little, so I wanted to understand more.  I took a quick look at the tarball contents and was instantly annoyed.

[kylev@kylev-dt tmp]$ tar tzf Plex-1.1.5.tar.gz
[ Bunch of stuff scrolls off]
tests/._test6.py
tests/test6.py
tests/._test7.in
tests/test7.in
tests/._test7.out
tests/test7.out
tests/._test7.py
tests/test7.py
tests/._test8.in
tests/test8.in
tests/._test8.out
tests/test8.out
tests/._test8.py
tests/test8.py
tests/._test9.in
tests/test9.in
tests/._test9.out
tests/test9.out
tests/._test9.py
tests/test9.py

What jumps out at me is the damn ._ files everywhere. Crud, the author did this on a Mac, which in certain cases (and versions of tar) will include these annoying extra empty files. No big deal, it’s just annoying. Maybe I’ll talk to the maintainer later and have him fix it.

Let’s get on with it and get this baby installed:

[kylev@kylev-dt tmp]$ tar xzf Plex-1.1.5.tar.gz
[kylev@kylev-dt tmp]$ cd Plex-1.1.5.tar.gz

Wait, what the hell? Why did tab-completion give me the tarball again? Oh, damnit! While being distracted and annoyed with the OSX dot-underscore files, I failed to notice that this tarball doesn’t have a top level container directory! Argh, it has just spewed files all over instead of being neatly contained. No big deal, I’m in my ~/tmp directory so I probably didn’t over-write anything important. Time to clean things up:

[kylev@kylev-dt tmp]$ tar tzf Plex-1.1.5.tar.gz | xargs rm
rm: cannot remove `./._Iconr': No such file or directory
rm: cannot remove `Iconr': No such file or directory
rm: cannot remove `Plex/': Is a directory
rm: cannot remove `doc/': Is a directory
rm: cannot remove `examples/': Is a directory
rm: cannot remove `tests/': Is a directory

Bah, I could have used rm -rf, but I didn’t want to trash the whole examples or doc directories in case I had one not created by this package. Let’s just clean up the last bits one by one.

[kylev@kylev-dt tmp]$ rmdir ._Icon^M

What did tab-completion just do with… facepalm! The tarball contained a directory with a carriage-return in the name! Time to fire up emacs dired to finish cleaning up.

That was one troublesome tarball. It can only get better from here, right?

Tags: ,

Tuesday, December 2nd, 2008 Day in the Life, Geekdom No Comments

Pylons on Fedora 8

I’ve started playing with Pylons and like it so far. It has the rails-ish routing system, but doesn’t have some of the nonsense. Additionally, it’s designed to be open and adaptable, letting you plug in your own ORM or eschew using one entirely (unlike Django which seems to really want you to use their badly crippled ORM).

To make things easier for people running Fedora 8 or other RPM-based Linux distros, I’ve created some spec files and F8 RPMs to let you install Pylons in an RPM-friendly way. The other dependencies (like python-simplejson) can be found already via yum. Try them, and let me know what you think; I may submit these for F9.

Update: These have been submitted for review.

Tags: , ,

Friday, April 11th, 2008 Geekdom No Comments

All in order

I swear I have rare moments of clarity. I was really happy when I finally realized how a non-recursive in-order binary tree traversal would work. It’s a piece of CS trivia that I continually forget. I’m also pleased to realize how easy it is to write in python as an iterator class. 16 lines, with comments and some white space.
› Continue reading

Tags:

Friday, January 7th, 2005 Geekdom Comments Off

Fun with python properties

I ran into an oddity of something I wanted to do with python properties while playing with my music server the yesterday. After a little futzing, I found a way around my problem. Read on to find out how to do python properties and inheritance.
› Continue reading

Tags:

Wednesday, October 13th, 2004 Geekdom Comments Off
 

Twitter Feed

- Twitter Goodies - Profile

Archives