Monday, December 24, 2012

Notes about Chicago

I had a chance to visit Chicago last week. The trip was devastating to me, but that's another story. This post is about the things to do in Chicago if you somehow find yourself there for the very first time.

  1. The transit system (metro/train and bus) in Chicago is run by CTA, short for Chicago Transit Authority. If you need to find your way to a metro station, watch for signs of CTA.
  2. They charge you a flat rate of 2.25 dollars a trip. So, from the O'Hare airport, you can take the Blue line to downtown Chicago for 2.25 dollars, one way.
  3. The fare card can be purchased with cash at the O'Hare station, at the vending machine. Once you have the fare card, you can top it up with either cash or credit/debit card.
  4. There is the GO Express shuttle service from the airport to downtown for 32 dollars. You can catch this van at the international terminals.
  5. The Field Museum is one tiring place to visit. Why? Because there are just too many things to see. You may need to spend one whole day to be here. The 3D movie is a let down.
  6. The Lincoln Park Zoo is one nice cozy environment. It is not big. Neither is it small. Just about the right size for one morning or afternoon trip.
  7. Lou Malnati's pizzeria serves fantastic Chicago deep dish pizza.
I hope some reader will find this post helpful to them.

Thursday, November 22, 2012

How to use UniBeast with InstallESD.dmg

Update 3: See below for a (supposed) workaround for UniBeast MountainLion 1.7 (I have not tried out v1.7 yet. Let me know if it works for you, though).

Update 2: See below for a workaround for UniBeast MountainLion 1.6.1.

Update: As commented below by osxtosh, this trick only works with UniBeast 1.5.3 and below.

UniBeast only needs InstallESD.dmg from the Install OS X ***.app. So, if you already have that file, you do not need to download the installer from Apple AppStore anymore. All you need to do is to create a directory structure (so called package) to satisfy UniBeast.

% cd /Applications
% mkdir -p "Install OS X Mountain"
% cd "Install OS X Mountain"
% cp path_to_InstallESD.dmg SharedSupport/
% mkdir -p "_MASReceipt"
% dd if=/dev/random count=$[RANDOM % 4000] bs=1 of=_MASReceipt/receipt
% echo '' >> _MASReceipt/receipt
% dd if=/dev/random count=$[3000 + RANDOM % 4000] bs=1 >> _MASReceipt/receipt

Remember to change Mountain Lion to the appropriate version that you took InstallESD.dmg from. Repeat the last three commands if the size of generated receipt is smaller than 4800 bytes.

Monday, September 24, 2012

Comparing multiple columns in MySQL

Few days ago, I needed to write an SQL query on combined unique key. The situation is that I have a table with one unique constraint on two columns. Let's call those columns A, and B. The table had these values:

| A | B |
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |

At first, I tried this query below:

SELECT * FROM tbl WHERE (A, B) IN ((1, 1))

It worked beautifully. MySQL EXPLAIN command said it only needed to look at one row.

So I ventured further to put in one more value to search for:

SELECT * FROM tbl WHERE (A, B) IN ((1, 1), (1, 2))

This time, EXPLAIN told me it needed to scan the whole table (4 rows). I thought it was odd and tried to make a comparison with the regular query:

SELECT * FROM tbl WHERE (A=1 AND B=1) OR (A=1 AND B=2)

The third query turned out to scan only two rows!

Perhaps the IN operator has not been optimized in this case yet so for now I've got to do with a clumsier query. I could, of course, shorten the third query to A=1 AND B IN (1, 2) to make it less wieldy.

On a related note, SQLite does not even support tuple comparison.

Tuesday, September 11, 2012

Database connection pooling in Django made easy

Django does not pool its SQL connections. Every single web request is served with one new connection to the DBMS. If the web server and DB server are far apart, the cost of establishing new connection could be expensive.

Fortunately, it is easy to take advantage of SQL Alchemy's excellent pooling support in Django. What one needs to do is simply to call sqlalchemy.pool.manage on the Database name in the Django DB backend base module.

For example, the MySQL backend could be made to use connection pooling with these three lines:

from django.db.backends.mysql import base
from sqlalchemy import pool
base.Database = pool.manage(base.Database)

Extra parameters can be passed to manage to tune the pool.

This monkey patching should (or rather must) be done before any call to Django ORM. If you, for instance, use gunicorn_django to serve your application, you can place those lines in its configuration file. Alternatively, you can also stash them in your application's file.

Tuesday, September 4, 2012

Protecting Android application

Few days ago, Tim "diff" Strazzere gave a presentation about different ways that developers can protect their Android applications.

I am noting down here a few key points that he nailed down succintly. These are also the same points that I have always advocated for. I'm happy to find them resonating so well in the community.

  • Do not provoke the crackers. Instead, try to make peace with them. Most of the respected (the elite tier) crackers are actually pretty decent guys and can be reasoned with.
  • Limit the information crackers can get. This includes function names, debug logs, crash report, web service request parameters, and so on. Any descriptive name, hint, or keyword could give away the much needed information to the crackers.
  • Delay feedback to the crackers. Accept registration information easily early on. And fail when the feature is actually invoked.
  • Phone home. Or use server provided data as parameters in the client. A valid user would get sensible data; an invalid one would get crippled data.

These practices could be generalized into just a few simple, commonsensical principles. Basically, the compilation process is a lossy process. A compiler produces a "less documented" artifact from a "better documented" source code. And we should do our best to maintain that leverage against crackers. Then, the classic laws of locality come into effect. Disrupting locality of time and locality of space is the goal to aim for. And lastly, as Bruce Schneier put it, "amateurs attack machines; professionals target people." Battling against crackers needs to be a battle against the way people do things, against the crackers' way of thinking. Beside technological measures, other factors such as behavioral, and psychological aspects must be also considered.

Sunday, September 2, 2012

virtualenv and pip

Update: virtualenv 1.8.1 released on September 3 would fix this problem with --never-download.

pip 1.2 was released about 10 hours ago. It is so new and shiny that it breaks my current deployment process that has worked fine with pip 1.1.

Before today, I have a requirements.txt file that specifies all dependencies for my Python web app. My deployment script basically creates a virtual environment with virtualenv and then does a 2-pass pip'ing like this:

pip install --download=pipcache -r requirements.txt
pip install --find-links file://absolute/path/to/pipcache -r requirements2.txt

requirements2.txt is derived from requirements.txt to eliminate all URLs. For example, requirements.txt has this line,

while requirements2.txt has this:


The reason for two passes is to cache downloaded packages and ignore PyPI. The option --download-cache does help with the first goal but always need PyPI.

That deployment process was working fine, until 10 hours ago. Suddenly pip errors out in the second pass, saying PyMySQL cannot be found. Looking at the log file, I suspect that pip does a case sensitive string comparison between pymysql and PyMySQL. Where pip gets the string pymysql from, I do not know.

Then I immediately looked for a way to force virtualenv to pin pip 1.1. Of course, it's not possible. Which is kind of ironic because virtualenv encourages pip over setuptools, yet latest pip fails. A case of oversight, I guess.

So, the solution is to let virtualenv pick the latest pip as usual, and then use pip to degrade itself to 1.1.

Friday, August 31, 2012

First Kittendome meeting in PyCon 2013 Program Committee

Yesterday marked the first Kittendome meeting in preparation for US PyCon 2013.

I attended the meeting in #pycon-pc on Freenode. It went surprisingly well. This is all thanks to the hard work that the committee has put in over the years and the enormous energy that the chairs have spent ensuring the facilities are in good shape.

The pycon_bot was very helpful, too. In fact, it was kinda neat to have a bot tally all votes and organize the meeting. It even pestered voters who had not voted! I would absolutely consider using the IRC bot if I ever hold any meeting over IRC.

Sunday, August 26, 2012

Pair work in a classroom

Problem statement

You’re in charge of teaching a classroom of students and in the next assessment, each student has the option to work either in pairs or alone. As it turns out, the students are seated in an M x N rectangular grid, and each student can work with up to one adjacent classmate -- a student in the center of the classroom has up to 4 choices of partners, while those at the corners have up to 2.

We assume that the productivity of each student (when working alone) is represented by a positive integer. When two adjacent students work as a pair, their combined productivity is equal to the product of their individual productivities. For example, if two adjacent students have productivities of 3 and 5 each, then their total productivity would be 8 when working separately, but 15 when working together as a pair. Students with productivity 1 always prefer to work alone and should never be placed in a pair.

Task: Given the productivity for each individual in the class (as a grid of positive integers), design an algorithm that computes the maximum total productivity achievable by pairing students in the class. Note that the total productivity consists of the summed productivities of all pairs and students working alone; note also that productivity scores should only be counted once per pairing, not once for each individual in the pairing.


Based on a checker pattern, students can be divided into two groups: those on the white spots, and those on the black spots. Position (0, 0) is the first white spot; next position (0, 1) is the first black spot and so on, as illustrated in figure below.

(0, 0)(0, 1)(0, 2)
(1, 0)(1, 1)(1, 2)
(2, 0)(2, 1)(2, 2)

Observation: students on the white spots can only pair with students on the black spots and vice versa.

The maximum productivity value is attained when we can pair all students up in a way that the sum of their productivity values is greatest. And this problem can be easily solved with min-cost bipartie matching algorithm such as Hungarian algorithm.

With the input below, I got 1859 as the answer. What do you get?

inp = [
 '7 5 9 2 3 1 9 3 3 6 2 1 5 3 4',
 '1 2 7 1 5 2 8 6 1 3 4 3 4 7 3',
 '2 9 5 4 4 2 5 1 1 6 1 8 9 5 2',
 '1 8 2 2 4 3 8 8 8 9 6 7 3 6 7',
 '8 1 4 2 4 7 9 9 1 5 4 7 7 6 2',
 '4 3 6 3 2 3 8 9 5 2 7 4 7 1 1',
 '2 7 2 6 1 2 3 3 3 6 5 6 1 1 7',
 '4 6 2 3 6 2 6 4 3 2 1 1 5 7 7',
 '6 5 9 6 4 1 2 3 6 2 4 9 8 4 7',
 '5 9 6 8 5 2 3 5 1 7 2 8 5 7 5',

Tuesday, July 24, 2012

Introducing auto_argument

One of the subtle bugs that often bites new Python programmers is default argument values. People expect those default values to be (re-)created and supplied to the function at each invocation. And so, when it does not happen that way, developers scratch their heads.

It is easy, though, to understand why Python acts that way. As explained in the Python Tutorial:

The default values are evaluated at the point of function definition in the defining scope.

And that:

The default value is evaluated only once. This makes a difference when the default is a mutable object such as a list, dictionary, or instances of most classes.

To prevent the default from being shared between subsequent calls, one usually writes code like this:

def f(a, L=None):
    if L is None:
        L = []
    # do something with L

The if block creates a new actual default value to replace the None marker value.

To alleviate repeated typings and clutter if blocks, I wrote a simple module auto_argument that automatically inspects the function arguments on each invocation and replaces them with new values. The code above would look cleaner with callable_auto_argument like this:

@callable_auto_argument([('L', None, dict)])
def f(a, L=None):
    # do something with L

Of course, developers are very welcome to make their own decorator by subclassing one of these auto_argument classes to save redundant code. See auto_dict_argument and test_subclass in the source code for example.

And here is auto_argument in all its glory :-). The code is released under MIT license.

# Copyright 2012, Nam T. Nguyen
# Released under MIT license

'''Automatically replace a default argument with some other (potentially
dynamic) value.

The default argument is usually guarded like this::

    def func(arg=None):
        if arg is None:
            arg = dict()
        // use arg

With decorators provided in this module, one can write::

    __marker = object()

    @callable_auto_argument([('arg', __marker, dict)])
    def func(arg=__marker):
        // use arg

See class:`callable_auto_argument`.

Also, the standard Python behavior could be thought as::

    __marker = object()

    @passthru_auto_argument([('arg', __marker, {})])
    def func(arg=__marker):
        // ...

See class:`passthru_auto_argument`.

These classes can be used by themselves or serve as base classes for more
customizations. For example, to eliminate repeated typings, one can subclass
``callable_auto_argument`` like this::

    class auto_dict_argument(callable_auto_argument):

        def __init__(self, *names):
            names_markers_values = []
            for name in names:
                names_markers_values.append((name, None, dict))
            super(auto_dict_argument, self).__init__(names_markers_values)

And then apply this on methods like this::

    @auto_dict_argument('arg_1', 'arg_2', 'arg_3')
    def method(arg_1=None, arg_2=None, arg_3=None):
        # arg_1, arg_2, arg_3 are new dicts unless specified otherwise
        # and these lines are no longer required
        # if arg_1 is None: arg_1 = {}
        # if arg_2 is None: arg_2 = {}
        # if arg_3 is None: arg_3 = {}


import unittest

class auto_argument(object):
    '''The base class for automatic argument.

    Subclasses must implement method:`create` to create appropriate value for
    the argument.


    def __init__(self, names_markers_values):
        '''Construct an auto argument decorator with a collection of variable
        names, their markers, and their supposed values.

        The __supposed__ value objects are used in method:`create` to produce
        final value.


            names_markers_values (collection of 3-tuples): A collection of
                (string, object, object) tuples specifying (in that order) the
                names, the marker objects, and the supposed value objects


        self.names_markers_values = names_markers_values

    def create(self, name, current_value, value):
        '''Return a value based for the named argument and its current value.

        This final value will be used to replace what is currently passed in
        the invocation.

        Subclasses MUST override this method to provide more meaningful


            name (string): The argument's name
            current_value: Its current value in this invocation
            value: The supposed value passed in during construction time


            Final value for this argument


        raise NotImplementedError()

    def __call__(self, orig_func):
        def new_func(*args, **kw_args):
            for name, marker, value in self.names_markers_values:
                # check kw_args first
                    if kw_args[name] is marker:
                        kw_args[name] = self.create(name, kw_args[name], value)
                except KeyError:
                    # ignored

                # then check args
                # we need to instropect the arg names from orig_func
                co_obj = orig_func.func_code
                # number of required arguments
                nr_required = (co_obj.co_argcount -
                for i in range(nr_required, co_obj.co_argcount):
                    if co_obj.co_varnames[i] != name:
                    # is it supplied in args?
                    if i < len(args):
                        if args[i] is marker:
                            if isinstance(args, tuple):
                                args = list(args)
                            args[i] = self.create(name, args[i], value)
                    # it is not, so, check defaults
                        default = orig_func.func_defaults[i - nr_required]
                        if default is marker:
                            kw_args[name] = self.create(name, default, value)

            # invoke orig_func with new args
            return orig_func(*args, **kw_args)

        return new_func

class callable_auto_argument(auto_argument):

    def create(self, name, current_value, value):
        # call on value
        return value()

class passthru_auto_argument(auto_argument):

    def create(self, name, current_value, value):
        # just return it directly
        return value

class AutoArgumentTest(unittest.TestCase):

    def test_keyword_1(self):

        marker = 'replace_me'

        @passthru_auto_argument([('arg_name', marker, 'done')])
        def orig_func_0(arg_name=marker):
            return arg_name

        self.assertEqual('done', orig_func_0())
        self.assertEqual('done', orig_func_0(marker))
        self.assertEqual('not_replace', orig_func_0('not_replace'))
        self.assertEqual('done', orig_func_0(arg_name=marker))
        self.assertEqual('not_replace', orig_func_0(arg_name='not_replace'))

        @passthru_auto_argument([('arg_name', 'replace_me', 'done')])
        def orig_func_1(junk, arg_name='replace_me'):
            return junk, arg_name

        self.assertEqual(('ignore', 'done'), orig_func_1('ignore'))
        self.assertEqual(('ignore', 'done'),
                orig_func_1('ignore', marker))
        self.assertEqual(('ignore', 'not_replace'),
                orig_func_1('ignore', 'not_replace'))
        self.assertEqual(('ignore', 'done'),
                orig_func_1('ignore', arg_name=marker))
        self.assertEqual(('ignore', 'not_replace'),
                orig_func_1('ignore', arg_name='not_replace'))

    def test_keyword_2(self):

        marker_1 = 'replace_me'
        marker_2 = 'replace_too'

        @passthru_auto_argument([('arg_1', marker_1, 'done'),
                     ('arg_2', marker_2, 'enod')])
        def orig_func_0(arg_1=marker_1, arg_2=marker_2):
            return arg_1, arg_2

        self.assertEqual(('done', 'enod'), orig_func_0())
        self.assertEqual(('not_replace', 'enod'), orig_func_0('not_replace'))
        self.assertEqual(('done', 'not'), orig_func_0(marker_1, 'not'))
        self.assertEqual(('done', 'enod'), orig_func_0(marker_1, marker_2))
        self.assertEqual(('not_1', 'not_2'), orig_func_0('not_1', 'not_2'))

        @passthru_auto_argument([('arg_1', marker_1, 'done'),
                     ('arg_2', marker_2, 'enod')])
        def orig_func_1(junk, arg_1=marker_1, arg_2=marker_2):
            return junk, arg_1, arg_2

        self.assertEqual(('.', 'done', 'enod'), orig_func_1('.'))
        self.assertEqual(('.', 'not_replace', 'enod'),
                orig_func_1('.', 'not_replace'))
        self.assertEqual(('.', 'done', 'not'),
                orig_func_1('.', marker_1, 'not'))
        self.assertEqual(('.', 'done', 'enod'),
                orig_func_1('.', marker_1, marker_2))
        self.assertEqual(('.', 'not_1', 'not_2'),
                orig_func_1('.', 'not_1', 'not_2'))

    def test_subclass(self):

        class auto_dict_argument(callable_auto_argument):

            def __init__(self, *names):
                names_markers_values = []
                for name in names:
                    names_markers_values.append((name, None, dict))
                super(auto_dict_argument, self).__init__(names_markers_values)

        @auto_dict_argument('arg_1', 'arg_2')
        def test_func(arg_1=None, arg_2=None):
            arg_1['1'] = 'arg_1'
            arg_2['2'] = 'arg_2'
            return (arg_1, arg_2)

        self.assertEqual(({'1': 'arg_1'}, {'2': 'arg_2'}), test_func())
        arg_1, arg_2 = test_func()
        self.assertNotEqual(id(arg_1), id(arg_2))

if __name__ == '__main__':

Friday, July 20, 2012

Interface programming with intfprgm

Based on the blog post from two days ago, I continued to make two more additions to the set of decorators. The resulting module is called intfprgm, for interface programming. The repository is hosted at

A careful reader will notice that the LICENSE file is actually MIT license when the source is released to the public domain. I'd like to confirm that the code is indeed placed in the public domain. Enjoy!

Wednesday, July 18, 2012

Enforcing concrete class in Python

Interface-based programming is one of the two principles that Gang of Four book advocates for. This style of programming makes the contract between a provider and a consumer crystal clear while securely hides the implementation details and allows for easy modification.

Hence, we can find the concepts of interface and abstract class in virtually all object oriented programming languages. An interface provides a pure contract without any implementation. A concrete class implements all of them. And an abstract class lies somewhere in between with some unimplemented contracts, and some implemented ones.

Python, in its pureness, does not facilitate this separation of concerns. There is no interface nor abstract class in Python. And the usual way to define an interface or abstract class in Python is to raise NotImplementedError in the methods which subclasses need to implement.

Even though it usually works fine in practice, this does not enforce a concrete subclass to implement all unimplemented methods. When a concrete subclass forgets to implement some methods, the Python interpreter still happily runs the code until it hits the raise statement.

So I wrote a quick decorator to ensure that if a class is marked concrete then all its functions must not raise NotImplementedError. A SyntaxError would be thrown if any method (even inherited one) was left unimplemented. The check happens at class definition time.

To be clear, Python does have Abstract Base Class (ABC) but that is different from the regular interface/abstract class concept.

'''Set of (class) decorators to help with interface programming.


    class subclass(parent):

Released to the public domain.

Nam T. Nguyen, 2012


import dis
import types
import unittest

def concrete(orig_class):
    '''A decorator to ensure that a concrete class has no un-implemented

    An un-implemented method is defined similarly to::

        def method(...):
            raise NotImplementedError()

    This, when translated to bytecode, looks like::

        LOAD_GLOBAL       0 (NotImplementedError)
        CALL_FUNCTION     1
        RAISE_VARARGS     1

    Or when ``raise NotImplementedError``::

        LOAD_GLOBAL       0 (NotImplementedError)
        RAISE_VARARGS     1

    The check here is for such pattern.


    for name in dir(orig_class):
        func = getattr(orig_class, name)
        # correct type?
        if type(func) not in (types.FunctionType, types.MethodType):
        # check if first name is NotImplementedError
        if len(func.func_code.co_names) < 1 or \
                func.func_code.co_names[0] != 'NotImplementedError':
        # and RAISE_VARARGS somewhere after that
        for position in (3, 6):
            if len(func.func_code.co_code) < position:
            opcode = ord(func.func_code.co_code[position])
            if dis.opname[opcode] == 'RAISE_VARARGS':
                raise SyntaxError('Function %s.%s must be implemented.' % (
                        orig_class.__name__, func.func_name))

    return orig_class

class ConcreteTest(unittest.TestCase):

    def test_raise(self):
        class test1:
            def method(self):
                raise NotImplementedError
        self.assertRaises(SyntaxError, concrete, test1)

        class test2:
            def method(self):
                raise NotImplementedError()
        self.assertRaises(SyntaxError, concrete, test2)

    def test_not_raise(self):
        class test:
            def method(self):
                return NotImplementedError()

    def test_subclass(self):
        class parent(object):
            def override_me(self):
                raise NotImplementedError()
        class test(parent):
            def leave_it(self):
        self.assertRaises(SyntaxError, concrete, test)

if __name__ == '__main__':

Monday, June 11, 2012

Invalidating cookie-backed session from the server

I joined in a really interesting discussion about session management in web application this week. One of the issues raised in this discussion was how to invalidate a cookie-backed session. In this post, I am going to explain how I would tackle this problem. All comments are welcome.

Brief context

Every web developer knows that basically HTTP is a stateless protocol. The connection is terminated after one request has been served. Each request appears to the web server as a new request.

It is easy, however, to make requests related to each other, essentially establish a long session spanning many requests, via a mechanism called cookie. A cookie is a piece of information that the web server wants the user agent, the browser, to remember and pass back to the server in next requests. Therefore, a cookie is set by the server, but stored on the client. The value of a cookie (or the lack thereof) that is received by the server is not trustable.

To make a session more meaningful, web application usually associates more information than just the identifier to a session. These associated data are often stored on the server so that from a small session identifier, the web application can retrieve bigger session data. The browser would only know about the session identifier and nothing about those associated data. These session identifiers are generally random, and long enough and the associated data could be user objects, some binary blobs, or almost anything else.

Sometimes, web application embeds session data together with session identifier into the cookie itself. This is called cookie-backed session. The advantage of this is that no server side storage is required. The disadvantage is that the server can no longer trust session data because they are controlled by the client.


In order to ensure the integrity of cookie backed session, the server must have a way to identify whether a session has been invalidated before accepting its data. The challenge is in doing exactly that without requiring as much server side storage space as other session back ends.


And here's a solution that I derived.

The session cookie would have extra mandatory fields timestamp and checksum. The timestamp is the moment in time that this cookie was created or updated. The checksum is the MAC value of the whole cookie.

The cookie is considered invalid when one of these conditions are true:
  1. The elapsed time based on the timestamp is over the life time of the cookie.
  2. The MAC is not equal to the calculated value.
  3. The session identifier is in a list of invalidated session identifiers.
So the question boils down to if I can quickly work out the third condition without requiring too much space and time.

And I can, by following these steps:
  1. I will have a set called invalidated_sessions that stores the session identifiers.
  2. I will have a first in first out queue called logbook whose elements are tuples of (timestamp, session identifier).
  3. When a session is invalidated, the logbook is updated with a new tuple. Its timestamp value is of the session timestamp plus cookie life time plus some extra window to prevent race condition, and the session identifier of the session being invalidated. The invalidated_sessions set will also record the same session identifier.
  4. Before validating the session, the web application needs to prune expired entries in the logbook and invalidated_sessions. The application will peek into the first element of the queue to see if it has a timestamp that is older than the current time. If that is the case, the corresponding entry in invalidated_sessions is removed, this entry popped out of the queue, and the process repeated.
  5. After validating its timestamp and MAC value, the session is then matched against invalidated_sessions. The session is consider invalid if it is in the set.
In short, I am fundamentally implementing a limited memcache in the web application.


When the web server goes down, this in-memory memcache is lost, and hence all current, and untampered sessions are valid.

The set is not shared across all insances of the web application. This can cause a session to be wrongly accepted when it is served by a different instance than the one that it has been invalidated on.

After all, if this is just memcache, one might as well use the real memcache and take advantage of its cache expiration mechanism and sharing of data across all instances.

Friday, June 8, 2012

Speed of light is still the limit

The latest update from CERN has concluded that neutrinos respect the cosmic speed limit. This, I guess, puts an end to whatever imagination that project OPERA conjured.

Monday, May 28, 2012

One week with Riak

I spent a few days working on a Riak cluster and here are some helpful notes to help others:
  1. When setting up a cluster, the most important number to worry about is ring_creation_size. This number, at the moment, cannot be changed once the cluster is up and running. Therefore, you need to plan ahead and pick a proper number for your cluster. One node should preferably handle at least 10 partitions. For example, if you intend to run a 5-node cluster right now, and foresee it grows to 10 nodes later, 128, and 256 are good values.
  2. Just like other database systems out there, file descriptor limit is a critical factor in Riak. The conservative figure is about 20 times the number of partitions (which is the same as ring_creation_size). So if you choose to have 128 partitions, you need to at least set open file limit (ulimit -n) to 2560. This should be done with pam_limits.
  3. Always prefer noatime mount option, unless you really need to know last access time.
  4. Again, prefer SSD for low latency disk access. If you are running on EC2, try not to use its Elastic Block Store.
  5. Bitcask storage backend requires that there is enough memory to store all keys. LevelDB introduces more latency when more levels are used. To reduce latency with LevelDB, add more nodes to the cluster. Secondary indexes only work with LevelDB. Secondary indexes are slow, very slow, so try to favor direct access with keys, or usage of another system to store these indexes.
  6. A Riak node does not need to leave a cluster in order to conduct repair work. Just do riak stop, and start repairing it. When it's ready, riak start. Try to wait for the cluster to converge before working on another node.
  7. Lastly, the best thing about Riak is its fault tolerance, true to the spirit of Erland. The cluster will eventually self heal, and converge, even if it seems slow at times.
Time will tell if Riak can handle the tasks we throw at it. For now, I do appreciate the simplicity in setting up a Riak cluster and its reliability. Kudos to Basho!

PS: There is a good technical presentation given by the guys at Kiip about their use of Riak at The presentation discusses good tips and pain points in a real world deployment.

Friday, March 16, 2012

Python is going to have randomized hash function

Barry Warsaw committed a fix for "Randomize hashes of xml atributes in the hash table internal to the pyexpat module" to 2.6 on March 14. The patch was written by David Malcolm based on work by Victor Stinner and with modifications by the Expat project.

This is a collateral damage resulting from the much publicized hash function collision exploit. Expat is an XML parsing library that Python has a wrapper around. It was known to be vulnerable to the same O(N2) bug when one parses an XML element with many same-hash attributes such as wzzgDM51, ezzDlT7W, hzzfbyVV and so on.

Starting from Python 2.6.8rc2, there is a new switch -R, and an environment variable PYTHONHASHSEED that control this randomization.

The bottom line is: if you use Python 32-bit, do think of upgrading to 2.6.8 at the earliest convenience.

Duqu and Time and Space disruption

There has been much talk about Duqu (the worm) these last few days.

I'm not sure what is new here but the classic, foundational security idea: Time and Space disruption.

There is only one single pillar in optimization, and probably in everything we do in life too. It is called locality: Temporal locality and Spatial locality.

Temporal locality says that a program usually favors instructions that it has just recently executed. Think of a loop. A program runs through the same bunch of instructions again and again in that loop.

Spatial locality says that a program usually favors data that are close to each other. A loop, for example, will often iterate through data in the same array, one after another one in a sequential manner.

Locality of time and space are what make our habits and create order from chaos in this life. We wake up in the morning, eat breakfast and go to work every week day. That is locality of time. We usually go to the restaurant around the corner, the cinema a few blocks away, the toilet a few steps away.

Temporal and spatial localities help us understand and predict how a program or even a person would behave. So naturally, the best way, and may I so brazenly declare the one principle, to confuse all reverse engineering attempts and analysis is to disrupt this pattern. The more you can do that, the less predictable and understandable you become. As a figurative example, chasing after a crazy person is way more difficult than a sane one.

Duqu actively aimed for that. From what I can gather from public write ups, Duqu aggressively employs dynamic calls. This is certainly a great way to break spatial locality. Normally, you would have code similar to:

push 2
push 1
call func

Instead, in Duqu, you have:

push 2
push 1
call [mem_location]

What is the function being called? You can only find out at runtime. This makes static analysis virtually impossible. It breaks spatial locality.

Unsurprisingly, this is nothing new. This is exactly how virtual functions are implemented in C++. So, here's a tip for software writers in the battle against crackers: Use more virtual functions to better confuse crackers! That also suggests some object orientation implementation in the case of Duqu.

Beside using function pointers, Duqu employs event driven model. This disrupts temporal locality in architecture. Regular application follows a sequential model where we do everything step by step. Duqu does some small part, then moves on to do some other tiny unrelated part, then gets back to the previous work. It breaks the flow of analysis and requires great patience to trace the worm.

I may have missed many discussion but to me that's about it. I have not seen any active attempt to randomly place different code blocks at different locations, garbage control flows, and other techniques to actively disrupt localities.

But if with just this one technique (probably a result from some discipline way of writing code), Duqu has greatly confused security experts, you know that time and space disruption is at the core of all obfuscations.

PS: On a side note, I used to make a tool that follows exactly this simple principle.

Wednesday, March 7, 2012


I observed this very interesting MySQL issue. Basically, I have a huge tables with millions of rows in it. It has an ID column which is an auto-incremented integer primary key.

It took me 20 seconds to do SELECT COUNT(*) FROM tbl yet it only took me 4 seconds to do SELECT COUNT(*) FROM tbl WHEN id > 0.

Basically, these two queries return the same value. One would have expected the later to run a bit slower but the experiment proved otherwise. It seems quite counterintuitive. Why?

Tuesday, March 6, 2012

AMD CPU bug!

Does anyone remember Intel Pentium bug? Well, now we have one other CPU bug from AMD.

The description:

And the confirmation from AMD:

Basically, under very special conditions, the processor will incorrectly update the stack pointer. The most common symptom is a spurious segmentation fault.

This is too hot. Will there be recalls?

Tuesday, January 24, 2012

A simple event loop with PEP 380

On January 13, Nick Coghlan pushed up the C implementation of PEP 380, which officially supports yield from syntax, to Python 3. I have toyed around with yield from idea since P. J. Eby cooked up his fiendishly clever trampoline in pure Python. The official C implementation is a welcomed news because it alleviates the need for such a trampoline from app to app.

Here's a simple event loop that makes use of PEP 380.

import heapq
import inspect
import time

_processes = []

def sleep(delay):
    yield 'sleep', delay

def task(name, delay):
    for i in range(10):
        print('hello {0} {1} {2}'.format(time.time(), name, i))
        yield from sleep(delay)

def spawn(proc, *args, **kw_args):
    def generatorize(f, *args, **kw_args):
        yield 'ignore', f(*args, **kw_args)
    global _processes
    if not inspect.isgenerator(proc):
        if inspect.isgeneratorfunction(proc):
            proc = proc(*args, **kw_args)
            proc = generatorize(proc, *args, **kw_args)
    heapq.heappush(_processes, (time.time(), proc))

def run():
    global _processes
    while _processes:
        (start_time, proc) = heapq.heappop(_processes)
        if start_time > time.time():
            heapq.heappush(_processes, (start_time, proc))
            command, arg = next(proc)
        except StopIteration:
            if command == 'sleep':
                heapq.heappush(_processes, (time.time() + arg, proc))

def not_a_task():
    print('not a task')

spawn(task, 'task 0.5', 0.5)
spawn(task, 'task 1.0', 1.0)


The main functions are spawn and run. In spawn, a function object is converted into a generator if needed and then pushed into a heap queue of scheduled processes. These processes are given a chance to run when they are up in schedule. They can voluntarily yield execution to other processes by calling yield from sleep, or some other utility functions such as read or write.

Overall, I think the code is quite compact and readable.

Friday, January 20, 2012

How to build Python on Mac OS X Lion

Apparently this question comes up from time to time on python-dev@ (which, by the way, isn't the right list for these sorts of question).

So here's the short answer:

./configure CC=clang

There, enjoy!

For more information, one should refer to this bug report:

Tuesday, January 10, 2012

Google+ integrated into Google search

It's now official! Google has integrated its social network Google+ into Google search result. The feature is quite aptly called Google Plus Your World.

Back in August, I wrote a short prediction that this would happen. It certainly raises some eyebrows up. Facebook, for one, would be in serious disadvantage if "Your World" included popular celebrities that one isn't linked to but is "naturally" shown in search result nonetheless. A really strong cross promotion bridge here. And of course, regulators should be closely watching the development of this feature.

And I also am interested in seeing how Google avert future anti-trust on its monopoly in search. It would make for a very smart business maneuver.

Hello, 2012

Awesome! A new year has come! Let's kick it into another great one.

And may all your resolutions be fulfilled.