Friday, December 23, 2011

Pure message passing and fault tolerance

I just finished watching Joe Armstrong's talk on Systems That Never Stop. Around 38:00, he mentioned this general consensus:

198x: pure message passing (where all parameters are passed by values) was considered inefficient because one could pass a pointer instead of copying data.
200x: pure message passing is considered efficient because it allows massive parallelization.

That's a 180-degree flip! I think it makes senses. And it is very much in alignment with The Network Is The Computer.

His talk is highly recommended. Here are the six laws with the most important one bolded to keep a system from failing:


  1. Isolation: Processes must be totally separated from each other. Failure on one must not affect others. (This is why pure message passing is required. You definitely don't want to pass a point from one process to another.)
  2. Concurrency: Spawn million of processes!
  3. Failure detection: Failures must be remotely detectable.
  4. Failure identification: And failures must be analyzable, often after the fact, to find root causes.
  5. Live code upgrade: The system must be able to roll-forward, or backward, without shutting down.
  6. Stable storage: If you store something, it should be there forever. This implies multiple copies, distributions.

Wednesday, December 21, 2011

Find max and min with abs

A colleague of mine showed me two neat tricks he found on Stack Overflow.

To find max and min of a, and b, you can do:

max = ((a + b) + abs(a - b)) / 2
min = ((a + b) - abs(a - b)) / 2

However, I wouldn't recommend any code like this, unless you have no other choice. To quote Abelson and Sussman:
Programs must be written for people to read.
Those two lines aren't really meant for people to read. A little bit cryptic, don't you think? Just like a swap with exclusive ors.

Tuesday, December 20, 2011

Top ten reasons why talents leave their coys

Forbes published an article on Top Ten Reasons Why Large Companies Fail to Keep Their Best Talent last week. And I think it is a good one.

Not that I claim to be a talent, but I do feel some of the reasons are so closely related to me, and probably to everyone else, regardless of what positions they are holding. They are so generic, commonsense yet so hard to be noticed and fixed.

Like, bureaucracy. Darn. Everyone hates paperworks, and dumb rules yet they keep on creeping into the organization. And they affect everyone. Maybe not the bosses because I wonder if they were to do these senseless stuffs, would they have noticed the problem themselves? Oh, wait, bosses have personal assistants.

The second point about great, high-impact projects don't come to best people is another good one even though I don't quite agree with it. For me, a project need not be an important project. It should, however, be interesting enough that it can spark the passion in me to work on it. For example, I worked on a fully relocatable x86 disassembly engine and I found it to be very interesting while no one thought it would be useful. High-impact? Not at all. Interesting? Very much. I think people just need a dopamine (interesting project) every now and then instead of a bragging right (high-impact project). Of course, it is best if the project is both interesting and impactful.

The seventh point about great talent likes to be surrounded by other talents is somewhat a nice-to-have. When in Rome, do what the Romans do, right? You definitely don't want to staff a productive person in a team full of procrastinators. Like the three-legged race game, the best team is one that can move or stop in sync.  But since talents are hard to come by, it is really hard to build an all-star team. Then again, an all-star team may not function as well as one wishes (look at football, people!). And so, I don't think it is convincing enough to explain why organizations fail to keep their best employees.

Though I don't wholly agree with the article, I do totally feel and have experienced most of the issues he summarized. I urge anyone who is in management to read this piece right away so that you can make your coy a little more talent-friendly.

Monday, December 19, 2011

Three online classes ended

Last week was the last of the remaining two modules, Introduction to Artificial Intelligence (http://www.ai-class.com) and Machine Learning (http://www.ml-class.org).

So, all three pioneering classes have ended. And sixteen more just opened up for registration! I sure will take on some of them. Anatomy is definitely one. But this time, I'm gonna do a basic enrollment only, just for the knowledge, not the challenge.

You should enroll in some courses that interest you most, too! Please do. These courses are great! Awesome stuffs.

Saturday, December 17, 2011

Retrieving million of rows from MySQL

There are times when your query returns a very large number of rows. If you use the default cursor, chances are your process will be killed while retrieving the rows. The reason is by default MySQL clients (e.g. Java connector, Python driver) retrieve all rows and buffer them in memory before passing the result set to your code. If you run out of memory while doing that, your process is certainly killed.

The fix is to use streaming result set. In Python, you can use MySQLdb.cursors.SSCursor for this purpose.

import MySQLdb
conn = MySQLdb.connect(...)
cursor = MySQLdb.SSCursor(conn)
cursor.execute(...)
while True:
    row = cursor.fetchone()
    if not row:
        break
    ...

There are two important things to remember here:
  1. You use an SSCursor instead of the default cursor. This can be done like shown above, or by passing the class name to cursor() call such as conn.cursor(MySQLdb.SSCursor).
  2. Use fetchone to fetch rows from the result set, one row at a time. Do not use fetchall. You can use fetchmany but it is the same as calling fetchone that many times.
One common misconception is to treat SSCursor as a server side cursor. It is not! This class is in fact only an unbuffered cursor. It does not read all result set into memory like the default cursor does (hence a buffered cursor). What it does is reading from the response stream in chunks and returning record by record to you. There is another more appropriate name for this: a streaming result set.

Because SSCursor is only an unbuffered cursor, (I repeat, not a real server side cursor), there are several restrictions applied to it:
  1. You must read ALL records. The rational is that you send one query, and the server replies with one answer, albeit a really long one. Therefore, before you can do anything else, even a simple ping, you must completely finish this response.
  2. This brings another restriction that you must process each row quickly. If your processing takes even half a second for each row, you will find your connection dropped unexpectedly with error 2013, "Lost connection to MySQL server during query." The reason is by default MySQL will wait for a socket write to finish in 60 seconds. The server is trying to dump large amount of data down the wire, yet the client is taking its time to process chunk by chunk. So, the server is likely to just give up. You can increase this timeout by issuing a query SET NET_WRITE_TIMEOUT = xx where xx is the number of seconds that MySQL will wait for a socket write to complete. But please do not rely on that to be a workable remedy. Fix your processing instead. Or if you cannot reduce processing time any further, you can quickly chuck the rows somewhere local to complete the query first, and then read them back later at a more leisure rate.
  3. The first restriction also means that your connection is totally held up while you are retrieving the rows. There is no way around it. If you need to run another query in parallel, do it in another connection. Otherwise, you will get error 2014, "Commands out of sync; you can't run this command now."
I hope this post will help some of you.

Friday, December 16, 2011

Links to various online classes

After three successful online classes, Stanford is opening up several more! Yeah!

Here are the links to these new classes.
  1. Lean Launchpad at http://www.launchpad-class.org/ starts in February
  2. Technology Entrepreneurship at http://www.venture-class.org/ starts in January
  3. Anatomy at http://www.anatomy-class.org/ starts in January
  4. Making Green Buildings at http://www.greenbuilding-class.org/ starts in January
  5. Information Theory at http://www.infotheory-class.org/ starts in March
  6. Model Thinking at http://www.modelthinker-class.org/ starts in January
  7. CS 101 at http://www.cs101-class.org/ starts in February
  8. Machine Learning at http://jan2012.ml-class.org/ starts in January
  9. Software Engineering for Software as a Service at http://www.saas-class.org/ starts in February
  10. Human Computer Interaction at http://www.hci-class.org/ starts in January
  11. Natural Language Processing at http://www.nlp-class.org/ starts in January
  12. Game Theory at http://www.game-theory-class.org/ starts in February
  13. Probabilistic Graphic Models at http://www.pgm-class.org/ starts in January
  14. Cryptography at http://www.crypto-class.org/ starts in January
  15. Design and Analysis of Algorithms I at http://www.algo-class.org/ starts in January
  16. Computer Security at http://www.security-class.org/ starts in February
Thank you very much, Professors! You're making the world a better place, one class at a time.

Thursday, December 15, 2011

Small role model open source Python projects

As recommended by a well known Python developer, here they are:
  1. Itty at https://github.com/toastdriven/itty: A tiny web server and REST publisher (framework?). It has some good use of decorator and property.
  2. Tornado at http://www.tornadoweb.org/ and a blog on its core IO loop http://golubenco.org/2009/09/19/understanding-the-code-inside-tornado-the-asynchronous-web-server-powering-friendfeed/: A very essential version of Twisted. It is a good material for getting into asynchronous/event-based programming.
  3. DEXML at https://github.com/rfk/dexml: A dead simple Object-XML mapper. Minimal model of ORM concepts and beautiful use of metaclasses.

Wednesday, December 14, 2011

Google search (redirect) breaks back button

Update (Jan 10, 2012): Apparently, this only happens when you have logged into your Google Account (such as Gmail, Google+).

What the F*CK is Google doing with their search result?

Apparently, Google muddles the original URL with some sort of a redirection. Instead of giving you a direct link to what you're searching for, Google hijacks your click so that you will visit a Google page. This page then redirects you to the original site.

The problem with this is when you click Back, you are back to that redirection page which then again shoves you to the site. Then you click Back again, and you're redirected to the site again.

This is total crap! Why would you do that, Google? Can't JavaScript work for you?

Crap, total crap.

Monday, December 12, 2011

DB class ended

Update: Sixteen more classes just opened!

Today marks the end of Prof Widow's Database class. Woot! One down and two more to go.

Overall, to me, the class was extremely useful. There were concepts I did not know of, or use before. I was glad that NoSQL was also covered cursorily.

There were also sections that I wish Prof Widow explained more. Transaction, especially, was really difficult to visualize. The interaction between multiple transactions is the crux of the problem but it was talked about very little. This is not anyone's fault because concurrency is indeed a hard concept to understand.

I would highly recommend others to take this class.

Saturday, December 10, 2011

Pro Git -- Git's companion

I just found out the book Pro Git today. This post is a short note to myself that Pro Git is THE ONE GIT BOOK. Its Chapter 9 and Chapter 3 are enlighteners and such a delight to read. Pro Git succeeded in making me use Git for the first time.

Thursday, December 8, 2011

Textual IRC client violates the GPL

Update (Feb 02, 2012): Dougal (see comment below) pointed out that Textual IRC actually was forked from a BSD-licensed LimeChat. So this post is invalid.

Textual IRC client is a commercial software sold in Apple AppStore. It is priced at $4.99. Not too steep for a nice IRC client, and not too cheap either. The problem is that Codeux (the maker) is sort of shady in its license.

Textual IRC client derives from LimeChat which is licensed under the GPL v2. Textual itself is licensed in BSD. This alone violates the GPL because BSD is not compatible with GPL. In this case, Codeux is not allowed to re-license GPL v2 code under the BSD.

Without knowing about the incompatibility, I assume, Codeux further "friendly requests" that no binary distribution be made so that their app in Apple AppStore could be sold well. It's a good tactic to earn some bucks from the app, and it's perfectly fine to do so. However, threatening to close source if someone goes ahead and releases a binary version is moronic and a violation of the GPL, as well as the BSD. It is a violation of the GPL because as long as you still derive from GPL code, your code must be licensed under a compatible license that usually gives users permission to redistribute both source and binary at their own will. And assuming that you can relicense LimeChat under the BSD, the BSD license does also give users permission to distribute both source and binary versions. That is a granted permission, not an "abuse" as Codeux put it.

So, then, if you don't want to play fair, feel free to bring your toys home. Otherwise, look at X-Chat, and X-Chat WDK and learn a thing or two from them.

HipChat vs IRC

I've been using HipChat for a week now. Here are some short comparisons:

HipChat has a decent web interface. In fact, I think its interface is just sweet and just right. Not too cluttering, not too simple. It supports video and image embedding. It supports the hip memes such as the Rage FFFUUUUU face. These aren't supported in IRC protocol per se but are supported by clients such as Colloquy. Beside web interface, HipChat also provides desktop and mobile clients. If those are not enough, people can join HipChat with any Jabber/XMPP client such as Pidgin, and Adium too.

Functionality wise, HipChat is also pretty full featured. You can create group chat, and you can send private messages just as you could with IRC. However, you will be notified via email if someone mentions you while you are away. This is something IRC doesn't support and I don't know any IRC server that supports such feature either. At the same time, HipChat does not support things that IRC does. One of them is moderated chat rooms.

So, my impression is that HipChat is a nice offering for a company's chat. I would love to have it integrated with some pastebin solutions such that when you press Ctrl-C, if the buffer is long enough, it will be automatically redirected to a pastebin. That'll help so much with technical discussions. Furthermore, maybe integration with bug trackers would also be a nice feature to have.

Wednesday, December 7, 2011

F.lux, a nice utility to reduce eyestrain

I was introduced to F.lux today. So far, I've found it quite pleasant. In the morning, it brightens up the screen and in the evening, it dims the screen down. Please check it out! It runs on Mac OS, Windows, and Linux. Neat stuff.

Tuesday, December 6, 2011

Rethinking security

Okay, the last few weeks were too hectic for me. I changed job, moved to a new place, and got down with a terrible cold. On top of that, some random dude was trying to scam me out of Craig's list ;-).

I am hopefully back on my feet now, or at least half backed up. And one thing occurred to me. The state of security industry is broken. This may be a big news to you but breaking stuff isn't creating value.

Before the advent of information technology, people build stuffs, actual stuffs such as a house, a block of steel. That creates value. That is something that people in general can exchange for something else. You bring a goat to the market to exchange it for a chicken.

Security, however, does not build stuffs. You don't build security by itself. Security must go hand-in-hand with a more concrete product. Therefore, the value that security creates is absorbed into the value of that actual product.

That brings me to the realization that the security industry is probably not functioning well at the moment. The economic model just does not support it. What it is that security sells? Information. You could probably sell it to one, two, or maybe ten persons but that's probably about it. The scarcity power is drastically diminished the more people you sell  your secrets (exploits, bug details, etc.) to. And that's a sad news. We have not yet found out any alternative to secrecy in security.

Besides, the incentives to work in security is highly asymetrical. The attackers are awarded much more than the defenders. I suppose this could create a conflict with general human nature. We are peaceful at heart and that would only mean there are less attackers than there are defenders. Yet, there are more works to be done in defending, more companies, more applications to be protected than there are attackers.

My minds are not in a coherent state right now so I'll let this thought ponder for a while.

But what do you think about the state of security industry?