Saturday, December 17, 2011

Retrieving million of rows from MySQL

There are times when your query returns a very large number of rows. If you use the default cursor, chances are your process will be killed while retrieving the rows. The reason is by default MySQL clients (e.g. Java connector, Python driver) retrieve all rows and buffer them in memory before passing the result set to your code. If you run out of memory while doing that, your process is certainly killed.

The fix is to use streaming result set. In Python, you can use MySQLdb.cursors.SSCursor for this purpose.

import MySQLdb
conn = MySQLdb.connect(...)
cursor = MySQLdb.SSCursor(conn)
cursor.execute(...)
while True:
    row = cursor.fetchone()
    if not row:
        break
    ...

There are two important things to remember here:
  1. You use an SSCursor instead of the default cursor. This can be done like shown above, or by passing the class name to cursor() call such as conn.cursor(MySQLdb.SSCursor).
  2. Use fetchone to fetch rows from the result set, one row at a time. Do not use fetchall. You can use fetchmany but it is the same as calling fetchone that many times.
One common misconception is to treat SSCursor as a server side cursor. It is not! This class is in fact only an unbuffered cursor. It does not read all result set into memory like the default cursor does (hence a buffered cursor). What it does is reading from the response stream in chunks and returning record by record to you. There is another more appropriate name for this: a streaming result set.

Because SSCursor is only an unbuffered cursor, (I repeat, not a real server side cursor), there are several restrictions applied to it:
  1. You must read ALL records. The rational is that you send one query, and the server replies with one answer, albeit a really long one. Therefore, before you can do anything else, even a simple ping, you must completely finish this response.
  2. This brings another restriction that you must process each row quickly. If your processing takes even half a second for each row, you will find your connection dropped unexpectedly with error 2013, "Lost connection to MySQL server during query." The reason is by default MySQL will wait for a socket write to finish in 60 seconds. The server is trying to dump large amount of data down the wire, yet the client is taking its time to process chunk by chunk. So, the server is likely to just give up. You can increase this timeout by issuing a query SET NET_WRITE_TIMEOUT = xx where xx is the number of seconds that MySQL will wait for a socket write to complete. But please do not rely on that to be a workable remedy. Fix your processing instead. Or if you cannot reduce processing time any further, you can quickly chuck the rows somewhere local to complete the query first, and then read them back later at a more leisure rate.
  3. The first restriction also means that your connection is totally held up while you are retrieving the rows. There is no way around it. If you need to run another query in parallel, do it in another connection. Otherwise, you will get error 2014, "Commands out of sync; you can't run this command now."
I hope this post will help some of you.

5 comments:

  1. Your restriction number 2 for SSCursor does not hold true. I added a lag of one second in processing each row and the query has been going for minutes and i haven't got the error

    I tried fetching rows one by one so i created a SSDictCursor
    the variable rows is an object of my iterator class(it iterates on the mysql cursor)

    rs = []
    for r in rows:
    rs.append(r)
    sleep(1) # sleep for 1 second

    ReplyDelete
  2. I agree. The second restriction is not a hard rule. It is more of a good practice. It's great that it works for you all the times. But when it doesn't, remember you've been informed.

    ReplyDelete