Friday, December 23, 2011

Pure message passing and fault tolerance

I just finished watching Joe Armstrong's talk on Systems That Never Stop. Around 38:00, he mentioned this general consensus:

198x: pure message passing (where all parameters are passed by values) was considered inefficient because one could pass a pointer instead of copying data.
200x: pure message passing is considered efficient because it allows massive parallelization.

That's a 180-degree flip! I think it makes senses. And it is very much in alignment with The Network Is The Computer.

His talk is highly recommended. Here are the six laws with the most important one bolded to keep a system from failing:


  1. Isolation: Processes must be totally separated from each other. Failure on one must not affect others. (This is why pure message passing is required. You definitely don't want to pass a point from one process to another.)
  2. Concurrency: Spawn million of processes!
  3. Failure detection: Failures must be remotely detectable.
  4. Failure identification: And failures must be analyzable, often after the fact, to find root causes.
  5. Live code upgrade: The system must be able to roll-forward, or backward, without shutting down.
  6. Stable storage: If you store something, it should be there forever. This implies multiple copies, distributions.

Wednesday, December 21, 2011

Find max and min with abs

A colleague of mine showed me two neat tricks he found on Stack Overflow.

To find max and min of a, and b, you can do:

max = ((a + b) + abs(a - b)) / 2
min = ((a + b) - abs(a - b)) / 2

However, I wouldn't recommend any code like this, unless you have no other choice. To quote Abelson and Sussman:
Programs must be written for people to read.
Those two lines aren't really meant for people to read. A little bit cryptic, don't you think? Just like a swap with exclusive ors.

Tuesday, December 20, 2011

Top ten reasons why talents leave their coys

Forbes published an article on Top Ten Reasons Why Large Companies Fail to Keep Their Best Talent last week. And I think it is a good one.

Not that I claim to be a talent, but I do feel some of the reasons are so closely related to me, and probably to everyone else, regardless of what positions they are holding. They are so generic, commonsense yet so hard to be noticed and fixed.

Like, bureaucracy. Darn. Everyone hates paperworks, and dumb rules yet they keep on creeping into the organization. And they affect everyone. Maybe not the bosses because I wonder if they were to do these senseless stuffs, would they have noticed the problem themselves? Oh, wait, bosses have personal assistants.

The second point about great, high-impact projects don't come to best people is another good one even though I don't quite agree with it. For me, a project need not be an important project. It should, however, be interesting enough that it can spark the passion in me to work on it. For example, I worked on a fully relocatable x86 disassembly engine and I found it to be very interesting while no one thought it would be useful. High-impact? Not at all. Interesting? Very much. I think people just need a dopamine (interesting project) every now and then instead of a bragging right (high-impact project). Of course, it is best if the project is both interesting and impactful.

The seventh point about great talent likes to be surrounded by other talents is somewhat a nice-to-have. When in Rome, do what the Romans do, right? You definitely don't want to staff a productive person in a team full of procrastinators. Like the three-legged race game, the best team is one that can move or stop in sync.  But since talents are hard to come by, it is really hard to build an all-star team. Then again, an all-star team may not function as well as one wishes (look at football, people!). And so, I don't think it is convincing enough to explain why organizations fail to keep their best employees.

Though I don't wholly agree with the article, I do totally feel and have experienced most of the issues he summarized. I urge anyone who is in management to read this piece right away so that you can make your coy a little more talent-friendly.

Monday, December 19, 2011

Three online classes ended

Last week was the last of the remaining two modules, Introduction to Artificial Intelligence (http://www.ai-class.com) and Machine Learning (http://www.ml-class.org).

So, all three pioneering classes have ended. And sixteen more just opened up for registration! I sure will take on some of them. Anatomy is definitely one. But this time, I'm gonna do a basic enrollment only, just for the knowledge, not the challenge.

You should enroll in some courses that interest you most, too! Please do. These courses are great! Awesome stuffs.

Saturday, December 17, 2011

Retrieving million of rows from MySQL

There are times when your query returns a very large number of rows. If you use the default cursor, chances are your process will be killed while retrieving the rows. The reason is by default MySQL clients (e.g. Java connector, Python driver) retrieve all rows and buffer them in memory before passing the result set to your code. If you run out of memory while doing that, your process is certainly killed.

The fix is to use streaming result set. In Python, you can use MySQLdb.cursors.SSCursor for this purpose.

import MySQLdb
conn = MySQLdb.connect(...)
cursor = MySQLdb.SSCursor(conn)
cursor.execute(...)
while True:
    row = cursor.fetchone()
    if not row:
        break
    ...

There are two important things to remember here:
  1. You use an SSCursor instead of the default cursor. This can be done like shown above, or by passing the class name to cursor() call such as conn.cursor(MySQLdb.SSCursor).
  2. Use fetchone to fetch rows from the result set, one row at a time. Do not use fetchall. You can use fetchmany but it is the same as calling fetchone that many times.
One common misconception is to treat SSCursor as a server side cursor. It is not! This class is in fact only an unbuffered cursor. It does not read all result set into memory like the default cursor does (hence a buffered cursor). What it does is reading from the response stream in chunks and returning record by record to you. There is another more appropriate name for this: a streaming result set.

Because SSCursor is only an unbuffered cursor, (I repeat, not a real server side cursor), there are several restrictions applied to it:
  1. You must read ALL records. The rational is that you send one query, and the server replies with one answer, albeit a really long one. Therefore, before you can do anything else, even a simple ping, you must completely finish this response.
  2. This brings another restriction that you must process each row quickly. If your processing takes even half a second for each row, you will find your connection dropped unexpectedly with error 2013, "Lost connection to MySQL server during query." The reason is by default MySQL will wait for a socket write to finish in 60 seconds. The server is trying to dump large amount of data down the wire, yet the client is taking its time to process chunk by chunk. So, the server is likely to just give up. You can increase this timeout by issuing a query SET NET_WRITE_TIMEOUT = xx where xx is the number of seconds that MySQL will wait for a socket write to complete. But please do not rely on that to be a workable remedy. Fix your processing instead. Or if you cannot reduce processing time any further, you can quickly chuck the rows somewhere local to complete the query first, and then read them back later at a more leisure rate.
  3. The first restriction also means that your connection is totally held up while you are retrieving the rows. There is no way around it. If you need to run another query in parallel, do it in another connection. Otherwise, you will get error 2014, "Commands out of sync; you can't run this command now."
I hope this post will help some of you.

Friday, December 16, 2011

Links to various online classes

After three successful online classes, Stanford is opening up several more! Yeah!

Here are the links to these new classes.
  1. Lean Launchpad at http://www.launchpad-class.org/ starts in February
  2. Technology Entrepreneurship at http://www.venture-class.org/ starts in January
  3. Anatomy at http://www.anatomy-class.org/ starts in January
  4. Making Green Buildings at http://www.greenbuilding-class.org/ starts in January
  5. Information Theory at http://www.infotheory-class.org/ starts in March
  6. Model Thinking at http://www.modelthinker-class.org/ starts in January
  7. CS 101 at http://www.cs101-class.org/ starts in February
  8. Machine Learning at http://jan2012.ml-class.org/ starts in January
  9. Software Engineering for Software as a Service at http://www.saas-class.org/ starts in February
  10. Human Computer Interaction at http://www.hci-class.org/ starts in January
  11. Natural Language Processing at http://www.nlp-class.org/ starts in January
  12. Game Theory at http://www.game-theory-class.org/ starts in February
  13. Probabilistic Graphic Models at http://www.pgm-class.org/ starts in January
  14. Cryptography at http://www.crypto-class.org/ starts in January
  15. Design and Analysis of Algorithms I at http://www.algo-class.org/ starts in January
  16. Computer Security at http://www.security-class.org/ starts in February
Thank you very much, Professors! You're making the world a better place, one class at a time.

Thursday, December 15, 2011

Small role model open source Python projects

As recommended by a well known Python developer, here they are:
  1. Itty at https://github.com/toastdriven/itty: A tiny web server and REST publisher (framework?). It has some good use of decorator and property.
  2. Tornado at http://www.tornadoweb.org/ and a blog on its core IO loop http://golubenco.org/2009/09/19/understanding-the-code-inside-tornado-the-asynchronous-web-server-powering-friendfeed/: A very essential version of Twisted. It is a good material for getting into asynchronous/event-based programming.
  3. DEXML at https://github.com/rfk/dexml: A dead simple Object-XML mapper. Minimal model of ORM concepts and beautiful use of metaclasses.

Wednesday, December 14, 2011

Google search (redirect) breaks back button

Update (Jan 10, 2012): Apparently, this only happens when you have logged into your Google Account (such as Gmail, Google+).

What the F*CK is Google doing with their search result?

Apparently, Google muddles the original URL with some sort of a redirection. Instead of giving you a direct link to what you're searching for, Google hijacks your click so that you will visit a Google page. This page then redirects you to the original site.

The problem with this is when you click Back, you are back to that redirection page which then again shoves you to the site. Then you click Back again, and you're redirected to the site again.

This is total crap! Why would you do that, Google? Can't JavaScript work for you?

Crap, total crap.

Monday, December 12, 2011

DB class ended

Update: Sixteen more classes just opened!

Today marks the end of Prof Widow's Database class. Woot! One down and two more to go.

Overall, to me, the class was extremely useful. There were concepts I did not know of, or use before. I was glad that NoSQL was also covered cursorily.

There were also sections that I wish Prof Widow explained more. Transaction, especially, was really difficult to visualize. The interaction between multiple transactions is the crux of the problem but it was talked about very little. This is not anyone's fault because concurrency is indeed a hard concept to understand.

I would highly recommend others to take this class.

Saturday, December 10, 2011

Pro Git -- Git's companion

I just found out the book Pro Git today. This post is a short note to myself that Pro Git is THE ONE GIT BOOK. Its Chapter 9 and Chapter 3 are enlighteners and such a delight to read. Pro Git succeeded in making me use Git for the first time.

Thursday, December 8, 2011

Textual IRC client violates the GPL

Update (Feb 02, 2012): Dougal (see comment below) pointed out that Textual IRC actually was forked from a BSD-licensed LimeChat. So this post is invalid.

Textual IRC client is a commercial software sold in Apple AppStore. It is priced at $4.99. Not too steep for a nice IRC client, and not too cheap either. The problem is that Codeux (the maker) is sort of shady in its license.

Textual IRC client derives from LimeChat which is licensed under the GPL v2. Textual itself is licensed in BSD. This alone violates the GPL because BSD is not compatible with GPL. In this case, Codeux is not allowed to re-license GPL v2 code under the BSD.

Without knowing about the incompatibility, I assume, Codeux further "friendly requests" that no binary distribution be made so that their app in Apple AppStore could be sold well. It's a good tactic to earn some bucks from the app, and it's perfectly fine to do so. However, threatening to close source if someone goes ahead and releases a binary version is moronic and a violation of the GPL, as well as the BSD. It is a violation of the GPL because as long as you still derive from GPL code, your code must be licensed under a compatible license that usually gives users permission to redistribute both source and binary at their own will. And assuming that you can relicense LimeChat under the BSD, the BSD license does also give users permission to distribute both source and binary versions. That is a granted permission, not an "abuse" as Codeux put it.

So, then, if you don't want to play fair, feel free to bring your toys home. Otherwise, look at X-Chat, and X-Chat WDK and learn a thing or two from them.

HipChat vs IRC

I've been using HipChat for a week now. Here are some short comparisons:

HipChat has a decent web interface. In fact, I think its interface is just sweet and just right. Not too cluttering, not too simple. It supports video and image embedding. It supports the hip memes such as the Rage FFFUUUUU face. These aren't supported in IRC protocol per se but are supported by clients such as Colloquy. Beside web interface, HipChat also provides desktop and mobile clients. If those are not enough, people can join HipChat with any Jabber/XMPP client such as Pidgin, and Adium too.

Functionality wise, HipChat is also pretty full featured. You can create group chat, and you can send private messages just as you could with IRC. However, you will be notified via email if someone mentions you while you are away. This is something IRC doesn't support and I don't know any IRC server that supports such feature either. At the same time, HipChat does not support things that IRC does. One of them is moderated chat rooms.

So, my impression is that HipChat is a nice offering for a company's chat. I would love to have it integrated with some pastebin solutions such that when you press Ctrl-C, if the buffer is long enough, it will be automatically redirected to a pastebin. That'll help so much with technical discussions. Furthermore, maybe integration with bug trackers would also be a nice feature to have.

Wednesday, December 7, 2011

F.lux, a nice utility to reduce eyestrain

I was introduced to F.lux today. So far, I've found it quite pleasant. In the morning, it brightens up the screen and in the evening, it dims the screen down. Please check it out! It runs on Mac OS, Windows, and Linux. Neat stuff.

Tuesday, December 6, 2011

Rethinking security

Okay, the last few weeks were too hectic for me. I changed job, moved to a new place, and got down with a terrible cold. On top of that, some random dude was trying to scam me out of Craig's list ;-).

I am hopefully back on my feet now, or at least half backed up. And one thing occurred to me. The state of security industry is broken. This may be a big news to you but breaking stuff isn't creating value.

Before the advent of information technology, people build stuffs, actual stuffs such as a house, a block of steel. That creates value. That is something that people in general can exchange for something else. You bring a goat to the market to exchange it for a chicken.

Security, however, does not build stuffs. You don't build security by itself. Security must go hand-in-hand with a more concrete product. Therefore, the value that security creates is absorbed into the value of that actual product.

That brings me to the realization that the security industry is probably not functioning well at the moment. The economic model just does not support it. What it is that security sells? Information. You could probably sell it to one, two, or maybe ten persons but that's probably about it. The scarcity power is drastically diminished the more people you sell  your secrets (exploits, bug details, etc.) to. And that's a sad news. We have not yet found out any alternative to secrecy in security.

Besides, the incentives to work in security is highly asymetrical. The attackers are awarded much more than the defenders. I suppose this could create a conflict with general human nature. We are peaceful at heart and that would only mean there are less attackers than there are defenders. Yet, there are more works to be done in defending, more companies, more applications to be protected than there are attackers.

My minds are not in a coherent state right now so I'll let this thought ponder for a while.

But what do you think about the state of security industry?

Friday, November 4, 2011

Good news, Microsoft contributes code to Samba!

Original mail is archived at https://lists.samba.org/archive/samba-technical/2011-October/079780.html.

Samba, the open source implementation of SMB/CIFS protocol, is the de-facto infrastructure piece in any mixed Linux-Windows environment. Samba is a clean room implementation in which most (if not all) technical details were obtained by observing the communication between different Windows machine, and not by having protocol specification at hand nor any proprietary code. In fact, the protocol wasn't available until 2007 due to Microsoft vs European Commission antitrust case.

That is to say a patch submission from coders at Microsoft would have been amazing to the point of unthinkable (Chris Hertel of Samba team).

However, that changed as of October 10th, 2011 when Microsoft Open Source Technology Center contributed a proof of concept for extended protection (channel and service binding) for Firefox and Samba (Stephen Zarkos of MS OSTC). That is certainly a big deal (Jeremy Allison of Samba). Hooray!

This news was reported in ZDnet on November 2nd.

Thursday, November 3, 2011

Some tips to win an OpenTTD game

Okay, this is a fruit of four days playing OpenTTD. I'm going to document a few important tricks that would hopefully help you win your game against AI players.

Firstly, be clear of the goals! You win this game usually by satisfying about ten conditions at once. See Company Rating in Game Mechanics. Even if you are much much more richer than your competitors, you may still lose the game because your company only focuses on limited cargo types, or that you only run a few airports. That means your company isn't diversified enough.

Now that the goal is clear. The way to win this game is by fulfilling these goals and being careless about the cash. So, let's open up more stations, even intra-town stations. Let's carry more cargo types, even with just one little bus. Let's run more buses to fill the vehicle count requirement. Those are basically three most often overlooked winning conditions.

Okay, so here're the juicier tips:
  1. The most important point is: the longer the route, the more money you earn! Always favor longer routes than shorter ones. However, this must be balanced with opportunity cost, that is when your fleet is away, your competitors might be sucking dry the cargos.
  2. Airplanes earn the most, then trains and (oil) ships, and buses the least. So if you have about 200 thousand dollars, get yourself an airport as soon as possible. Favor the biggest airports (Intercontinental one) upfront because it will be very difficult to upgrade later on.
  3. Some cargo types are best transported with some vehicle types. For example, oil are best served by ships, valuables by buses, and passengers by airplanes. For example, a Dinger 200 airplane could fly 400 passengers off to another city in a jiffy. That one air plane could easily earn you half a million dollars in a year.
  4. It is best to favor the most abundant resource producers. If you want to get passengers, find big cities. If you want to get into woods carving, find the biggest forests. If you want to supply oil to refineries, find the richest oil rigs.
  5. Feeder service that transfers cargo from one station to another is very useful. Remember, that is a cargo transfer, not cargo drop off. Feeder service also helps keep the mines/farms/etc. producing resources. For example, let's suppose you have a train station nearby a coal mine and this train serves a remote power station. You can open one lorry (truck) station next to the mine, and another that is joint (either distant-joint or adjacent) with the train station. Then you run one bus to carry cargo from an unjoint lorry station to the joint lorry station. Pick up the coal at the former, and transfer it at the later. This is an extremely short feeder service.
    +------+ +------+  road  +--------+
    | Mine | | Lor1 |--------| Lor1 + |
    +------+ +------+        | Train  |
                             +--------+
  6. In a well-run system, waiting time is minimized. Do favor more reliable, and faster vehicles. Also, make sure to split your vehicles to different separate stations or different stations of one joint-station.
    For example, if the buses are entering the station from one horizontal direction, you can split them into two branches by placing two vertical stations at different road segments at the junction, joining them into one joint station.
            ^
            |
            S1
            |

    -->-->--+--> (bus direction)
            |
            S2
            |
            v

    The idea is to place the stations on different road segments so that the total distances (of the whole route) when passing through either one of these stations are roughly the same so that buses don't favor S1 over S2, or vice verse. And remember, after passing through a station that is a part of a joint station, buses will avoid another in that same joint station. In the same vein of minimizing waiting time, maximum speed is, of course, at high priority. Nevertheless, break downs can drag down even the fastest airplane. Those supersonic planes that carry limited passengers, at break down, fly at a mere 300km/h. Another note to remember is try to minimize sharp turns. Make them as wide as possible.
  7. If the previous points fail you, try this last resort tactic. This is a defensive move that you can use to secure access to important resources and block competitors from accessing them. You can purchase all the tiles within the vicinity of the mines. Then only you can build stations that supply those resources. And because roads are shared with all, you should use railways instead. Moreover, you can also purchase tiles surrounding the competitors' stations so that they cannot build roads connecting to them. This is dirty play at work!
I am glad that I finally was able to beat the AI. That has freed me from the grab of this addictive game. Time to uninstall OpenTTD and get back to more productive stuffs.

Wednesday, November 2, 2011

Three days into OpenTTD

Well, I didn't see this one coming. I have spent three days playing OpenTTD and forgotten about the world outside!

I started with OpenTTD because it reminded me of younger years with one of my favorites, Transport Tycoon Deluxe. And I just couldn't let it go. The images got burnt in my minds. The brain kept thinking about best strategies to build roads, run trains, earn money, even in short late night dreams.

Yet, I have so far not able to beat the Return on Investment ratios of AI players. Damn. My bus earns about 10 grants each year while AI bus earns about 30Ks. Why?

Friday, October 28, 2011

Three weeks into AI, ML and DB classes

It's been three weeks into the programs. I find them all very useful. some are more than others.

The AI class is the least approachable one, at least for me. Till date, I still haven't seen how to apply the knowledge in real life. The content is pretty dry an less practical. Most of the time, the two instructors taught us about mathematical concepts such as probability, Bayes rule, and lengthy calculations instead of some usable examples. I hope few units more into the course, we'll see more concrete examples. Besides, this may be only me again, I can't find slides for review purposes! Must I watch through all the videos again? Besides, I occasionally failed to catch the words, both visually and acoustically.

On the other hand, the DB class is very understandable in general. It does throw in difficult challenges, and sometimes utterly complicated ones, to pique the brain. These challenges are good yet quite frustrating or even demotivating. Take for example Relational Algebra quiz. Things that could be done so easily in SQL such as MAX function are undoubtedly hard in RA alone. Then again, I'm not an excellent student, so this may be only me. I only wish Prof Widow would speak a little bit more slowly.

The ML class is in between. It is approachable and practical. I can follow Prof Ng quite easily for he speaks slowly. I also find the quiz okay, at an appropriate difficulty for me. And, best of all, I can use Octave to try out what he teaches right away. You have to agree that is much more hands-on.

Overall, I highly applaud this initiative of the professors to bring world-class courses to the masses.

Thursday, October 27, 2011

Con cop con, con cop cha, con cop

Có con cọp cắn con cọp con con của con cọp cạnh con cọp cha của  con con cọp có con cắn con cọp con của con con cọp cạnh con cọp cha.

Hỏi: có bao nhiêu con cọp?

Wednesday, October 26, 2011

Notes about Python optimization

Yesterday, Dropbox posted a story detailing how they optimized a frequently used function in their code.

http://tech.dropbox.com/?p=89

The conclusions are:
  1. String concatenation is more favored than string formatting.
  2. Built-in set and dict types are fast.
  3. Computing is more favored in C than Python. Try to minimize Python code with function inlining, implicit loop and move as much computing to C as possible, not necessarily your own C extension. In fact, set and dict are in C.
  4. Common wisdoms such as optimizing inner loops, or better flatten (unroll) the loop itself, using local instead global variables to take advantage of cache, are helpful but not by much in Python.
  5. Measurements are a must.
Quite a nice read. The comments are also very insightful.

Tuesday, October 25, 2011

Spiral evolution: The network is the computer

Decades ago, Sun Microsystems tagline was The network is the computer.

At that time, I believe, Sun was aiming for a holistic computing infrastructure which could be treated wholly as one single entity, a computer.

That dream, I believe, failed somewhat. Surely Sun made great architectures, good hardwares, and wonderful softwares but the network part was mostly not realized. Computers was not able to work with each other as one. It was difficult to program applications that could work across different network segments, leave alone different cities, or countries. The software stack was not conducive, almost all products must have re-invented distributed computing primitives that were fragile and too low level to build robust applications from. And most importantly, the link was slow. Before the advent of dotcom bubble, the link was limited to ISDN, probably 128kbps. This was the hard limiting factor for Sun to fully realize its dream and therefore the major architecture of that time was still very much centralized with big boxes crunching numbers by themselves.

Nowadays, we are seeing distributed computing almost everywhere. Big data companies have big pipes connecting them to almost everywhere in the world. Their computing grid could easily span across three different continents. All of these achievements are largely thanks to better networking technologies. At the same time, better software libraries have made it easier for developers to roll out distributed applications. Among them are great infrastructures such as Hadoop, Cassandra, HBase and more low level nuts and bolts such as ZeroMQ. They all have enabled a more reliable stack that frees us from having to take care of minute details and allows us to focus more on the business logic. Our applications now can be composed of literally hundreds of nodes, each doing different tasks at once in a coordinated manner regardless of where they are physically hosted. They can grow or shrink as easily as flicking a palm. They enable the network is the computer thinking.

As an analogy, glues like ZeroMQ are the data buses and the nodes are the RAM, the peripheral devices, the GPU and the main application (the coordinator) is the CPU of this new machine. I think that makes a lot of sense.

Monday, October 24, 2011

Facebook vs Disqus

I am back from a long trip! Cool.

Today I realized that Facebook has something called Comments that is used to power sites like TechCrunch and others. This plugin takes over self-made user comment "widget" usually shown at the bottom of a page. It provides more functions than a bland traditional commenting system such as analytics to the site owners, and badges, points, games to the readers. It entices more people to participate in commenting. And it is a hosted service (managed) by Facebook.

Then, Disqus immediately sprang into my mind. Since a few years ago, the Internet start-up Disqus has been providing the exact same service to many sites. This was a niche, extremely niche, market. I did wonder then what if bigger providers such as Google or Yahoo! or Microsoft jumped in. Would Disqus be able to fend themselves off? With better infrastructure? With more enticing features? More engaging methods?

And the prediction has come true. Now, Facebook is in, with its strongest capital: the populous user base. Will Google (Plus or not) follow? Maybe in one year time?

I think this is a right move from Facebook, and a big blow to Disqus. But then, David won over Goliath, didn't he?


Wednesday, October 5, 2011

Second attempt at PyPy

Update (Oct 30): The fourth translation failed due to insufficient memory again. I wonder if PyPy buildbot is a 32GB RAM machine ;).

A few days ago I made my first attempt at running PyPy. It was unsuccessful for my need due to bug #887. So today I gave it another try with a patch from Justin applied. To cut the story short, today was another failure. Read on for more long-winded story.

I checked out the source from BitBucket. Then apply the patch that was provided in the same issue above.

So, my first attempt was to use a pre-built PyPy binary to translate this source. It took about an hour or so before the process was killed. I was away when this happened so I did not know what caused the termination. Experience told me, though, that it might be due to out-of-memory behavior of Linux systems.

The second time I still used PyPy to make the translation. This time, I made sure to free as much memory as I could, and monitor its use. Another hour passed, the translation process once again was killed half way. Luckily, I had seen the cause clearly. Indeed, the translation process used too much memory and was killed forcefully.

Then I used the less memory hungry command to translate PyPy with pre-built PyPy. It consumed a little bit above 3 GB of RAM, enough to give me hope that it would succeed. However, before launching make -j 4 to compile generated C sources, translate.py did not release (or garbage collect) all of its unused objects. This left a waste of 3 GB and set aside only 1 GB (I was on a 64-bit CentOS 5.6 with 4GB of RAM) for GCC to do its work. Apparently, this wasn't enough and the same forced-kill story repeated.

Now I am in the fourth translation run. This time I do it with vanilla Python 2.6 instead of with pre-built PyPy. The process is indeed much slower, but it also consumes much less memory. The translation has run for an hour and a half and it is still running. Wish me luck.

Tuesday, October 4, 2011

Cyberlympics and a chance to win big prizes

The Global CyberLympics is a not-for-profit initiative led and organized by EC-Council. Its goal is to raise awareness towards increased education and ethics in information security. The mission statement of the Global CyberLympics is Unifying Global Cyber Defense through the Games.

Some members in our local security group already took the lengthy online individual qualifying quiz and confirmed its similarity to EC-Council CEH materials. It is not difficult nor is it in-depth. You should feel at home taking this quiz.

Monday, October 3, 2011

db-class.org and ml-class.org are live

A month ago there was ai-class.org. Then a few days ago, db-class.org and ml-class.org went live, too. This is good news for newbies like me.

I enrolled in all three classes (AI, DB, and ML) but only tried out DB class. They have some recorded videos for the first week even though the class has not started yet. My gripe is that my Internet connection is not fast enough for their videos. Their videos are more than 800-pixel wide. That is probably twice the required resolution for such presentations. The pluses are the content is really good. Excellent lectures, excellent questions and quiz, just excellent.

So, if you are interested in introductory materials to either Artificial Intelligence, or Database, or Machine Learning, you've gotta enroll in these free online classes by Stanford.

Friday, September 30, 2011

Annoying select.error

I needed to move from Gevent back to wsgiref. The change was quite minimal thanks to WSGI abstraction.

But to my surprise, and much dismay, the server kept on crashing with select.error on CentOS 5.6. I believe it also exhibits this behavior on other POSIX compliant OSes. Some searching brought me to an issue filed back in February 2010.

http://bugs.python.org/issue7978

Among the messages, Antoine Pitrou said select was wrapped in a try/finally block in Python 2.6 presumably to support signals and interrupted system calls. Why, then, do I still see interrupted syscall error?

In the end, I had to put a try/except clause similar to what was suggested in the second message (copied from Twisted) around serve_forever. This code looks quite funny with while True and serve_forever. You get what I mean?

Thursday, September 29, 2011

My first PyPy try-out

My first PyPy try-out was not so smooth.

I have a CentOS 5.6 64-bit server. Of course the OS comes with Python 64-bit. I, however, wanted to run some 32-bit modules so I compiled Python 2.6 32-bit on my own.

I grabbed PyPy 1.6 source and tried to translate the 32-bit Python installation. After about 12 hours, I had to hold Ctrl and type C to kill the process. But it wasn't terminated! A prejudice kill -9 was required.

The source installation was a no-go for me so I tried with a pre-built package. It didn't work out of the box on CentOS due to different library naming. Things such as libssl, libcrypto, libbz2 needed new symlinks. With those shiny new symlinks, pypy ran breezily through simple interactive tests.

Then I tried running Hsalf with PyPy. And it crashed! A new bug was filed https://bugs.pypy.org/issue887.  My initial investigation showed that PyPy missed out an opcode in its translation process. This opcode is used in Python Imaging Library to bridge between its Python module and its C extension.

Now, I'm waiting for a new PyPy build with that opcode plugged. A cursory look at PyPy source code shows many promises. I surely hope it delivers.

Wednesday, September 28, 2011

Intel AMT KVM and free viewer

Intel latest chips (most of Core i5 and Core i7 chips) support remote KVM functionality based on VNC protocol. Basically, the chip runs an embedded VNC server that can talk out-of-band over optional TLS layer with a supported VNC client.

The problem is supported VNC clients are usually costly. The one that is often recommended for use is RealVNC Viewer Plus. That costs $100 a license.

Intel, however, provides one free viewer in their AMT Software Development Kit. It is the Integrated Viewer Application sample.

Tuesday, September 27, 2011

Efika MX and H264 video tag

Just a quick post to say that I have managed to get Chromium to play HTML5 video tag with H264 source.

The trick is to install extra codecs for Chromium.

apt-get install chromium-codecs-ffmpeg-extra

If you do that, it will ask for your permission to remove the chromium-codecs-ffmpeg package. It's okay to do so. That's all, folks.

Monday, September 26, 2011

Extracting Screen Video with Hsalf

Hsalf repository has a new utility script called svout.py. You can use this script to extract Screen Video frames to separate image files or to dump raw RGB values to stdout.

First, you need to list out all Screen Video streams:

svout.py -i screen.swf
Found Screen Video stream ID 2

Then, extract that stream (2):

svout.py -i screen.swf -s 2 -o output_name

By default, the output format is PNG.

You will have many files whose names are output_name concatenated with number and a proper extesion (.png) such as output_name00001.png.

You can change the output format with -f switch. Use jpg if you want to write JPEG files, or rgb if you want to dump raw RGB data.

svout.py -i screen.swf -s 2 -o output_name -f jpg



If the output file name is a lone hyphen (-), stdout will be used. This is useful if you want to pipe the pictures to video encoders such as FFmpeg or MEncoder.

svout.py -i screen.swf -s 2 -o - -f rgb | ffmpeg -r 24 -s 640x480 -f rawvideo -pix_fmt rgb24 -i - screen.mp4



The above command dumps Screen Video frames to raw RGB data and sends the output through a pipe to FFmpeg to produce a MPEG-4 video file.

Friday, September 23, 2011

Light is no longer impassable

Scientists at CERN have found that the neutrino particles could travel faster than light in vacuum. This breaks one of the pillars in modern physics!

http://www.telegraph.co.uk/science/science-news/8783011/Speed-of-light-broken-by-scientists.html

This is serious. Too serious. We will have to wait for confirmation from other scientists.

Wednesday, September 21, 2011

Anti-virus and Windows 8

As anyone could have guessed it earlier, Windows 8 promises to include Microsoft Security Essentials out of the box. This is a major step forward in the right direction for Microsoft because Windows is still the most popular target for viruses. This move certainly will help curb most malicious codes from infecting Joe's PC.

Also, it raises the bar for other AV vendors. The question is can your products beat something free, something usable, something common, something built-in, and enabled by default? I reckon some AV vendors will have to move up the ladder and leave this ground to Microsoft. Maybe they'll shift their focus on "managing"/cooperating with Security Essentials instead of competing with it.

We shall see.

Tuesday, September 20, 2011

Fansipan & Python 3


Some inconsiderate soul pasted the damn sticker on this landmark. And some inconsiderate soul put this piece of metal object in this peaceful nature. Could we not use some wood, or rock? Duh!

Python rocks, though.

Monday, September 19, 2011

Extracting MP3 from SWF with Hsalf

With revision 13a5495aa50b, Hsalf provides a tool to extract MP3 from SWF.

The gist of this tool is these few simple lines:
# open swf file
fi = swf.SwfFile(inp)
# open output file
fo = open(outp, 'wb')
# iterate through all tags
for tag in fi.iter_body():
  # is this a SoundStreamBlock?
  if isinstance(tag, swf.SoundStreamBlockTag):
    data = StringIO(tag.sound_data)
    mp3 = swf.Mp3StreamSoundData().deserialize(data)
    data = mp3.sound_data.frames
    fo.write(data)
# done, close all files
fo.close()
fi.close()
Basically, we find all SoundStreamBlock tags. These tags hold data of embedded sound streams. With each tag, we take the data, strip off Flash-related headers, and write out pure MP3 frames to output file.

That's quite simple, yea?

Friday, September 16, 2011

Is Flash dying?

With Windows 8 going as HTML5-only as possible, will Flash be losing more grounds?

So, Microsoft cannot compete against Adobe with Silverlight, they make a right move by removing both Flash and Silverlight, opting for HTML5.

Luckily, I've been working on Hsalf to hopefully help me make the transition from Flash to HTML5 less troublesome. I already have some ideas how to play video with plain JavaScript. Maybe one day I'll make a demo.

Thursday, September 15, 2011

Game of Life Contest

The Python for Vietnamese Group is holding a contest taking Game of Life as its theme.

The best performing code will walk away with a prize equivalent to SGD 250!

Come join the fun!

Wednesday, September 14, 2011

Apathy

Thật là buồn khi đọc http://dantri.com.vn/c728/s728-517309/hang-tram-nguoi-dan-dung-nhin-hiep-si-bat-trom.htm.

Đã đành rằng báo chí có thể thổi phồng, hoặc dìm ép hay nói chung là bóp méo sự thật, nhưng sự việc này quả thật là một sự thờ ơ đáng khinh.

Bài báo này làm tôi nhớ đến một câu chuyện trong Superfreakonomics http://en.wikipedia.org/wiki/Murder_of_Kitty_Genovese.

Ước chi ai cũng có một tinh thần trách nhiệm với những người xung quanh như với chính họ.

Tuesday, September 13, 2011

Python automatic int to long conversion

The language says that if a value fits within an int, it should be an int, otherwise, it is automatically promoted to a long.


Let's take a look at these examples on a 32-bit CPython, these could make great interview questions ;):

>>> type(2**31)
<type 'long'>
>>> type(-2**31)
<type 'long'>
>>> type((-2)**31)
<type 'int'>
>>>

The first value is obviously off. Two to the power of thirty-one is one greater than what could be presented by an int. The second one seems to be an int at first, but operator precedence requires power to be evaluated first so we got a minus of a long which should be a long. The final value is within bound.

Just a side note, I wanted to have the minimum 32-bit integer value without using sys.maxint (because it may be different on 64-bit interpreter) while working on hsalf. And these were the experiments ;).

Monday, September 12, 2011

Introducing Hsalf, the reversed Flash

I am delighted to introduce a Python library to read and write Flash file formats: Hsalf.

This library is a result of my attempt at extracting Screen Video from SWF file. Naturally, it supports SWF format, and is capable of dumping embedded video frame :-). More complete tag support will be added eventually.

Friday, September 9, 2011

Game of Life

I was asked an interesting question today: What I understand about John Conway's Game of Life.

Game of Life is the name of a board simulation. The board is a matrix of NxM cells which can either be on or off. Initially, the some cells on the board are on, the rest is off. Each simulation step produces a new board state. The rules are:
  1. A cell neighbors are those surrounding it. There are at most eight of them.
  2. If a cell is surrounded by exactly three cells, it will be turned on.
  3. If a cell is surrounded by exactly two cells, it will remain what it is currently.
  4. Otherwise, a cell will be turned off.
Eric Raymond uses a pattern of this game called a Glider as Hacker Emblem.
My answer was that this simulation was truly Life. When you live in an overcrowded place (more than 3), you get crushed, you die. When you live in a deserted place (less than 2), you also die. Only when you are surrounded by enough people (2 or 3), you remain unchanged. And when the community is just good (3), new life is born. Generation after generation. Things come and go.

How about you? What do you feel about the simulation?

Thursday, September 8, 2011

A fail case of optimization

I had a piece of code like this:

if (is_enemy(side, new_r, new_c)) {
    if (tolower(board[new_r][new_c]) == 'n') {
        return true;
    }
}

I optimized away the second if statement like this:

if (is_enemy(side, new_r, new_c)) {
    return tolower(board[new_r][new_c]) == 'n';
}

At the surface, this optimized version seems equivalent to the original version. Integrated into a bigger picture, this code fails terribly. Original code is like this:

for (int i = 0; i < 8; ++i) {
    if (is_enemy(side, new_r, new_c)) {
        if (tolower(board[new_r][new_c]) == 'n') {
            return true;
        }
    }
}

The optimized version forces a return maybe too early in the loop.

Wednesday, September 7, 2011

Screen Video in Flash SWF

I had a chance to play with Flash two days ago. I wanted to extract a frame from a Swift (SWF) file version 7.

http://klaus.geekserver.net/libflv/screen.swf

[03c]        10 DEFINEVIDEOSTREAM defines id 0001 (79 frames, 466x311 codec 0x03)
                -=> 01 00 4f 00 d2 01 37 01 00 03
[01a]         6 PLACEOBJECT2 places id 0001 at depth 0001
                -=> 06 01 00 01 00 00
[03d]    182562 VIDEOFRAME adds information to id 0001 (frame 0) 352x288 P-frame deblock 1  quant: 14 
                -=> 01 00 00 00 13 31 d2 31 37 0f ee 78 da d5 5a 59
                -=> 93 db c6 11 9e 1e 80 dc 55 25 cf a9 24 92 ac d5
                -=> 1e bc 00 82 20 0e 72 57 eb 54 1e 52 f9 ff bf 20

Dumping with swfdump -d produced something like the above.

After skimming through Adobe's SWF File Format Specification version 10, I got the meanings of two related tags DEFINEVIDEOSTREAM and VIDEOFRAME.

DEFINEVIDEOSTREAM says I am going to create a video from embedded data. The codec that I am going to use is 0x03, which is the identifier for Screen Video codec.

VIDEOFRAME augments the other tag with actual data of a frame. This is what I am going after.

The specification says VIDEOFRAME composes of the tag, two bytes to identify the stream that was defined with DEFINEVIDEOSTREAM, two bytes to identify the frame in that stream, and then the payload. So, with the above dump, the first two bytes (01 00) identify the stream 0001, the next two (00 00) identify frame 0 in this stream, then the rest from 13 31 d2 31 37... is the payload. The payload is to be interpreted accordingly to the codec that was defined in DEFINEVIDEOSTREAM.

I then read up on Screen Video codec (from page 239). A SCREENVIDEOPACKET composes of 4 bits for block width, 12 bits for image width, 4 bits for block height, 12 bits for image height, and the remaining data are for image blocks. Using that to dissect 13 31 we get block width of 1, which means the actual block width is 32 pixels, and image width of 0x331 (817) pixels.

Obviously, that value isn't sound. The movie is not that wide. Because the interpretation is correct, there has got to be something wrong with the specs.

So I checked out SWF File Format version 7, when Screen Video codec was introduced, to see if they had better explanation. Surely, they had different descriptions and discrepancies (I'll come back to this later) but nothing related to the problem I was facing. The tag formats stay the same in two documents.

So I reread and reread, searched and searched on Screen Video, hoped to find some light. And light I found. In the introductory paragraph, the specs says
In a keyframe, every block is sent. In an interframe, one or more blocks will contain no data...
Ahh, keyframe and interframe. What are they? How to determine if one is a keyframe? A search for keyframe in the v7 spec brought me to VIDEODATA tag. This tag belongs to FLV format, not SWF format. It says that the first nibble is a CodecID, the next nibble is FrameType and then comes the payload which could be SCREENVIDEOPACKET if the CodecID is 3. This information is not mentioned in v10 of the spec.

Applying this interpretation to five bytes 13 31 d2 31 37 did not exactly yield desirable result. The first nibble (1) is not a known CodecID. However, switching the order of CodecID and FrameType gave reasonable meaning to these values. FrameType 1 is a keyframe, CodecID 3 is Screen Video. Then comes the block width of 3, image width of 0x1d2 (466), block height of 3, image height of 0x137 (311). Much more sensible. Continued with that interpretation, I was able to decode the whole VIDEOFRAME packet.

Ultimately, I needed to produce an image out of these raw data. Here comes the discrepancy between v7 and v10. In v7, the blocks are arranged from top left to bottom right row by row. In each block, pixels are arranged from top left to bottom right row by row. In v10, the blocks are arranged from bottom left to top right row by row. In each block, pixels are arranged from bottom left to top right row by row.

Funnily, I followed a mixed approach at first, blocks are arranged from bottom left to top right, but in each block, pixels are arranged from top left to bottom right. The reason was I read v10 first, then while fixing the interpretation above, I switched to v7 and continued with it. So, half the idea came from v10, the other half came from v7. In the end, the correct arrangement is depicted in v10, bottom up, left to right, similar to a BMP file.

Here is the first frame in that SWF file.

First frame extracted from screen.swf
 If this whole post is rather too long for you, here are the takeaways:
  1. If CodecID is 3 in DEFINEVIDEOSTREAM, it is Screen Video.
  2. If it is Sreen Video, the VIDEOFRAME packet, as documented in the spec, is wrong. In reality, it has two extra nibbles right before the SCREENVIDEOPACKET payload. The first nibble is FrameType (either 1 if this is a keyframe, or 2 if this is an interframe), the second is CodecID (which is 3).
  3. A Screen Video frame is divided into blocks with the first one located at the bottom left of the frame, going left to right, bottom to top.
  4. In each block, pixels are arranged bottom up, left to right.
<sarcastic>So much thanks to Adobe for opening up SWF file format.</sarcastic>

Tuesday, September 6, 2011

Hacked Certificate Authority

DigiNotar, a root CA, was compromised last week. That led to many rogue certificates for respected names such as Google, Yahoo being issued. It is assumed that these certificates were issued for the sake of carrying on man in the middle attack against these companies. The list of rogue certificates can be found on Tor project web site:

https://blog.torproject.org/files/rogue-certs-2011-09-04.csv

Of particular interest among the rogue certificates are two for *.*.com and *.*.org. RFC 2818 does not say if these certificates are valid. Accordingly ,the interpretation of whether they are allowed to pass host verification is up to the browsers. Fortunately, most browsers do the logical thing, they require at least two concrete domain name components. This is in alignment with RFC 2109 for cookie management, too.

Let's hope all browsers are sane.

Monday, September 5, 2011

__repr__ and unicode

__repr__ method returning Unicode string does not work in Python 2.5, 2.6 and 2.7 when called implicitly or via repr function.

class TestObject(object):
    def __repr__(self):
        return u'\u1234'
t = TestObject()
print (t.__repr__(), )
print (t, ) # same as print (repr(t), )

Running the above code produces

(u'\u1234',)
(Traceback (most recent call last):
  File "test_repr.py", line 6, in <module>
    print (t, )
UnicodeEncodeError: 'ascii' codec can't encode character u'\u1234' in position 0: ordinal not in range(128)

This bug has been filed in 2009, http://bugs.python.org/issue5876.

Thursday, September 1, 2011

Blogger has a new face

Sweet. Very Google-Plusy. Sweet.

Please, Google, could you integrate Blogger with Google Plus too?

Wednesday, August 31, 2011

Python IDE in Visual Studio

Microsoft released their first version of Python Tools for Visual Studio on August 29. It is a free and open source IDE built on top of a free Visual Studio Shell. Basically, you get a familiar (not to me) interface of Visual Studio to manage projects and Python language support with Intellisense and Refactoring. You also get an interactive Python console, a debugger, a code profiler. Of course, you get IronPython support too.

In case you are wondering, this free tool works best with Visual Studio Ultimate. Thank you for your purchase.

Tuesday, August 30, 2011

Don't be Evil, anyone?

Ironically, an admired company who was once famous for its corporate philosophy "You can make money without doing evil" repeatedly carried on several evil deeds.


http://feeds.betanews.com/~r/bn/~3/o4eoiGpeUag/

The willful violation of Java and the illegal advertisement of drugs are the most recent incidents. We might even see an anti-trust case against Google for its integration of Google+ and Search.

What worries me the most is how similar Google actions are to Jeffrey Skilling's.
My job as a businessman is to be a profit center and to maximize return to the shareholders. It's the government's job to step in if a product is dangerous.

Let's make money first, worry about getting caught later, right?

Apparently, keeping to the words is too tough. <sarcastic>We all can sympathize with that.</sarcastic>

Friday, August 26, 2011

And the cloud is gone

This morning I attended a small conference on Cloud Computing. IaaS, PaaS, SaaS etc. marketing talks bore me. Antivirus vendors talked about obtaining signature "from the cloud." Firewall vendors talked about managed security provider "from the cloud." People talked about "the cloud" as some sort of cure-all solution.

Then in the evening I saw this question http://www.quora.com/What-are-the-best-cloud-based-IDEs. Well, okay, this one is new. Now suddenly a website becomes a cloud.

To be honest, I don't know what a cloud is. Neither do I understand what cloud computing is about. Apparently I am not alone in this regard. If you have some spared time to do a search for cloud computing definitions, you will then have a hard time reconciling them into one.
Clearly, the term "cloud computing" has lost most of its meanings and core attributes. This occurred not by anybody redefining what it is, but by billions of marketing dollars that simply shout down the thought leaders in this space who call BS on all the cloud-washing.
But, apparently, "vendors who strive to be accurate, precise, real and relevant [in their cloud computing strategy] are winning deals right now and transcending the hype cycle to close sales." (http://softwarestrategiesblog.com/2011/07/27/gartner-releases-their-hype-cycle-for-cloud-computing-2011/.) Great! As it should be.

So, anyway, what is cloud computing? What do you use it for? How do you create one? How do you use one?

Thursday, August 25, 2011

EfikaMX -- a great product

Today I upgraded my Efika Smart Top device with the latest image and, boy, I was amazed.

The Smart Top is a small, light device with one RJ-45 port, one HDMI port, one headphone, one microphone, two USB ports, one SD slot, built-in wireless chip, and a HD decoder.

I got this device for my mom. It hooked up perfectly to the big screen TV we had in the living room. Running Ubuntu with Firefox and Vietnamese language pack is perfect for my purpose.

The was a slight problem at first. My Efika MX came with old image (February, I think), which defaulted to 1080p resolution. Many modern TVs do not, or at least according to the device manufacturer Genesis, work right to the HDMI specs and cause trouble with the Smart Top. The fix is usually to Ctrl-F1 to get to the console and manually switch to 720p resolution.

The next disappointment was the Feb image did not have 2D accelerated X11 driver (Xv). Loading GUI screens were really slow. It may easily take 10 seconds to display Firefox GUI. Certainly, playing a movie at 1 frame every few seconds isn't in anyone's interest.

No more annoyances, though. With the latest July image, the Smart Top box is nearly at its max capacity. Beside the lack of HD decoding and Flash player, it is a sweet as honey. Movie is arguably playable.

For the price tag of 135 Euros, this is surely a good buy. My wish is it would come with two network interfaces instead of one. That will make it an ultimate Linux box!

Wednesday, August 24, 2011

Temporary fix for Apache Killer

Update (September 07): Apache released version 2.2.20 to fix this issue.

Update (August 26): Request-Range header needs blocked as well.

A few days ago KingCope published a small Perl script to launch DoS attack against Apache HTTPD. The problem is it is too efficient for its own good. I had a good time playing with it and came to some pointers that might help others.
  1. Make sure that your MPM settings are appropriate for your server resources. For example, you should not expect a 256MB RAM server to run 100 instances of Apache.
  2. Disable DEFLATE output filter with RemoveOutputFilter DEFLATE.
  3. Disable Partial Content with headers_module RequestHeader unset Range.
You will lose some features such as resuming download or GZip encoding. But this is definitely better than iptables packet inspection on every single packet as someone suggested.

Tuesday, August 23, 2011

Detecting file types

This question pops up every now and then.

There is a utility called file on Unix platforms. This tool tells you what a particular file likely is. It works by sampling file content and make an educated guess. For example, a ZIP file usually has two bytes PK at the beginning, an EXE file would have MZ, a PDF file %PDF and so on.

Albeit how logical it sounds, it is still a guess. And a guess can be wrong.

In Python, there is built-in module mimetypes that works with file extensions. If file extension isn't available, the module filetypes on PyPi that works similarly as file. Worst case, you can always sample in a few bytes from a file and do a signature match-up against your own database as described earlier.

Monday, August 22, 2011

Simple job scheduling with Gevent

There are times when we need to run a function at repeated intervals. If we are lucky to already be using Gevent, these two lines make for a neat function to do that.

def schedule(delay, func, *args, **kw_args):
    gevent.spawn_later(0, func, *args, **kw_args)
    gevent.spawn_later(delay, schedule, delay, func, *args, **kw_args)

The idea is we let Gevent schedule the schedule() function in its event loop. This is like a (flat) tail-recursion call.

If we are not using Gevent, however, the package apscheduler is an advanced job scheduler similar Java Quartz and worth a serious look.

Friday, August 19, 2011

Asymmetricity in Security

This is one of many topics I find very interesting.

People often say that it is easier to break than build. That is clearly an asymmetricity. Think about this, an organization IT group of, say, 10 persons has to constantly fight against an unknowing number (possibly large) of attackers, days and nights.

Additionally, one seemingly simple vulnerability could cause a collapse of the whole system and perhaps related external dependencies too. Think about the blackout in New York a few years back. That is another asymmetricity. To build, one has to be very careful to check for all possibilities of weak links. To break, an attacker only needs find one weak link.

However, the reverse is also true. It is most of the time easier to fix a bug than to exploit it. For example, a cross site scripting bug is easily fixed by encoding HTML output. However, to take advantage of that bug, an attacker will have to jump through many hoops. How about a buffer overflow? Fixing a buffer overflow is seldom a difficult task but exploiting that vulnerability requires deep knowledge and multiple stages, various tricks to bypass additional checks such as ASLR, Non-Executable stack and so on.

What is your thought about this?

Wednesday, August 17, 2011

Re: Fit or Future? Which is more important when hiring?

This article "Fit or Future? Which is more important when hiring?" appeared on Artima in June.

http://www.artima.com/articles/fit_or_future.html

Fit refers to candidates who have better skill match to a job description and future refers to those who would quickly gain such skill provided that they were supported. The question is which would you pick.

As you would expect, the article does not dictate whether Fit or Future is more preferred. Obviously it is context dependent. The article does, however, list out several factors that could affect your decision. Among them is Urgency.

To me, the Urgency factor alone is sufficient to determine whether to pick Fit or Future. Why? Because Future is always my default choice, unless Urgency comes in. Nevertheless, I would think only companies that did not plan for growth would have to resort to that. It is not an urgency until you make it is, right?

And here's why I pick Future. First of all, there is always a period for new candidate to adapt to new organizational culture, new environment and so on. Learning new skill in this period is also the norm. That is to say there usually is a buffer of time for someone who has solid ground to attain new skills. It is not so urgent after all. Secondly, people do not spend four to five years in University to learn "advanced" usage of Java Enterprise, or PHP. They spend time learning foundational knowledges because these often matter more. They allow people to grow, organically. Thirdly, diversity is, most of the times, a good thing in the same vein that a diversified investment portfolio usually is less risky. People with different backgrounds, cultures, skill sets form a better team than people coming from one mold. And finally, you don't make hiring an urgency, do you?

Do you?

Tuesday, August 16, 2011

Google+ Games

Two words: Improvements Needed. Make it three: Much.

  1. Currently, caching mechanism makes it very difficult to receive (help, invite, gift, etc.) requests from friends. This makes real-time game play less likely to happen.
  2. As far as I know, the Notification does not show Game posts immediately. Similar to the above point, this is a downer for social gamers. They just don't know when someone does something to them.
  3. There needs to be a way to totally separate game streams from other streams, maybe a flag to say all posts from this Circle are game related. Google did the right thing to create Game Notifications for, well, games. But that does not prevent people from cross-posting to the regular streams with requests. Post-pollution, anyone?
  4. There needs to be a way to totally obliterate game(s) data. I want to delete my city!
So, have you got http://plus.google.com/games?

Monday, August 15, 2011

Spiral evolution of technologies

Non sense alert: This post is pretty much non-sensical.

We are observing an interesting trend to move towards NoSQL. Let the misconception aside, I am talking about the spiral evolution from dumb data, to structured/smart data, and back to dumb data.

Before SQL came about, people used to derive their own structs and dump them to a binary file. Reading back was a simple matter or seeking to the correct position and grab a chunk. No query was possible beside the obvious query by position.

Then came SQL and all the query goodies. This was possible because there was a so-called schema to describe the data, that this field was a string, that field was an integer, in an almost universal way. The schema allowed for intelligence. SQL server is able to understand its data and make logical relational queries on them.

Nowadays, we are seemingly moving back to the previous dumb model with NoSQL. This is what I meant by spiral evolution, we are back to where we were once, but at a higher level. Apparently, people realize that some data do not need to be smart. They are just chunks of bits and bytes. What is more important in dealing with them is how to store them efficiently. And there born NoSQL whose main purpose (among others) is to be a very good way to persist data.

What is observable in NoSQL is that they tend to be schema-free. This usually translates into simple queries, and primitive relations. However, it reduces the overhead of the server. The server does not have to perform extra works to make sense of the data it manages. This makes sense, right? After all, it is the application which ultimately knows about its data.

Looking into the future, we probably will see a trend back to schema-enabled data few years or decades later, won't we?

Friday, August 12, 2011

A queue whose elements know their positions

Recently I needed a data structure that could support FIFO access and that every item in it would know about their position. Think about a queue in real life. The first person in the queue knows he is next. The second one knows he will be next after one person. You can come up to anyone randomly in the queue and ask "when is your turn." He will be able to tell you "after X more persons."

Here is a simple algorithm to implement such queue.

  1. Tag an increasing number to each element in the queue, called this slotNumber. When the queue is initialized, set slotNumber to zero. When an element is pushed to the queue, increase slotNumber by one and tag it to the element.
  2. The position of each element in the queue is therefore the difference between their slotNumber and the first element's slotNumber.
This algorithm was devised by an intern in our company.

I used a variant of this algorithm where I kept two counters instead of just one. I called them pushCounter and popCounter. Each time an element was pushed into the queue, I increased the pushCounter. When an element was popped, I increased the popCounter. The answer is the difference between slotNumber and popCounter.

Thursday, August 11, 2011

Captain Amercia, the movie

A so-so action flick. A super-super-hero type of movie.

The setting is unconvincing. I don't mean the city, the background, and the illogical laser guns. I'm talking about the inconsistency in the setting. So, our hero has a rare, extremely rare, shield made of some ultimate metal. And apparently this shield can deflect laser, and can be use as a hand weapon. But it looks like the shield only weights less than a book. I mean, when our hero straps the shield to his back, it flings up down and left right.

The hero is too powerful, even more powerful than Super Man or The Hulk. There is basically no knot to be untied in the plot.

I would recommend this movie if you want to kill time.

Wednesday, August 10, 2011

IRC Flood Control

IRC flood control detection is a simple, and elegant method to detect message flood. Simple yet intelligent enough to support random burst of messages.

The main idea is to keep a timer, called, say, recvTimer which is set to 0 initially. When a new message is received, this recvTimer is adjusted by following these few steps:

1. If the current time is higher than recvTimer, set recvTimer to current time.
2. Increase recvTimer by, say, 2 seconds.
3. If recvTimer is, say, more than 10 seconds ahead of current time, flood is detected.

With the hypothetical values as above, burst of five messages is supported, and any one who sends more than 5 messages in 10 seconds is a flooder. Simple, yet very elegant.

Tuesday, August 9, 2011

Using netsh to whitelist your application in Windows Firewall

It is quite easy to use netsh to whitelist your application in Windows Firewall.

netsh firewall set allowedprogram <absolute_path> <short_name> ENABLE

where:
<absolute_path> is the absolute path to the executable file
<short_name> is some descriptive short name so that users know what this entry in the firewall is for

That is all. Of course, this command must be launched with administrator privilege.

Monday, August 8, 2011

Problem 10137: The Trip

Original problem statement

Problem A: The Trip

A number of students are members of a club that travels annually to exotic locations. Their destinations in the past have included Indianapolis, Phoenix, Nashville, Philadelphia, San Jose, and Atlanta. This spring they are planning a trip to Eindhoven.

The group agrees in advance to share expenses equally, but it is not practical to have them share every expense as it occurs. So individuals in the group pay for particular things, like meals, hotels, taxi rides, plane tickets, etc. After the trip, each student's expenses are tallied and money is exchanged so that the net cost to each is the same, to within one cent. In the past, this money exchange has been tedious and time consuming. Your job is to compute, from a list of expenses, the minimum amount of money that must change hands in order to equalize (within a cent) all the students' costs.

The Input

Standard input will contain the information for several trips. The information for each trip consists of a line containing a positive integer, n, the number of students on the trip, followed by n lines of input, each containing the amount, in dollars and cents, spent by a student. There are no more than 1000 students and no student spent more than $10,000.00. A single line containing 0 follows the information for the last trip.

The Output

For each trip, output a line stating the total amount of money, in dollars and cents, that must be exchanged to equalize the students' costs.

Sample Input
3
10.00
20.00
30.00
4
15.00
15.01
3.00
3.01
0

Output for Sample Input
$10.00
$11.99

Restated problem

Given N integers representing the expenses in cents of N persons in a trip. Find the least total amount of money to exchange among them so that the final expense of anyone is within a cent of all others. That is, if there are 3 persons, A, B, and C, then the difference between A's and B's expenses is at most one cent, between A's and C's is at most one cent, and between B's and C's is also at most one cent.

Reading input

Many people use float, double, or long double to read in the amount (given in dollars). This leads to many rounding problems. Instead, we can use integer to store the amount in cents. For example, if the input is 3.14, we can read in 3, the dot, and 14, and the amount then is 3*100 + 14.

int dollar, cent, amount;
char dot;
cin >> dollar >> dot >> cent;
amount = dollar * 100 + cent;

Solution

Of course, we need to take the average. The trip goers are then divided into two groups, those spent more, and those did less. Our job is to get everyone spending as close to this average as possible.

Those who spent less must pay extra cash. This cash is use to offset those who spent more. Let's call this a virtual common fund which a less-spender must deposit in, and a more-spender can withdraw from. This fund could go below zero.

An important realization is that since we are using integer division, the average expense is closer to the less-spenders. The more-spenders, therefore, need withdraw from the common fund only as much as it takes to make his spending to average plus one cent. The less-spenders, however, must pay in full to make his spending to exactly the average. This way, we guarantee that everyone's expense is within one cent of all others'.

If after everything is settled and the fund is non-negative, we say that the less-spenders have repaid the more-spenders so that both are within one cent of the average expense. In addition, if the common fund is negative, the less-spenders need to pay that much more to make the fund balanced (to make it zero).

With that reasoning, the answer is simply the sum that less-spenders must pay to make their expense up to the average, plus any amount of negative fund.

In the end, the algorithm is a two-pass traversal of the expense array. The first pass is to find out the total of all expenses to calculate their average. The second pass performs fund adjustment depending on whether the spender is a less-spender or a more-spender. At the same time, the second pass also accumulates total sum that less-spenders must pay for (called top-up sum). If the fund is non-negative, the solution is this top-up sum; if it is negative, the solution is this top-up sum minus the fund (that is, plus the absolute value of the fund).

Friday, August 5, 2011

Gevent pywsgi and UnicodeDecodeError

When switching from the libevent-based wsgi to Python-based pywsgi, you might encounter some strange error like the one reported at Google code http://code.google.com/p/gevent/issues/detail?id=86.

The problem is WSGI (or the underlying HTTP) does not understand Unicode. You might think that because you can read other languages just fine on the Internet, certainly HTTP must understand Unicode, right? Wrong! HTTP only transfer bytes. How to decode these bytes into characters totally lies with the browser. There is charset hint from the Content-Type header, but WSGI does not use that header to encode your unicode response.

And so, WSGI response must not be unicode objects. All unicode objects must have been encoded into plain byte strings. This applies to everything: the status code, status message, the headers, and the body.

This condition must be true at the WSGI server level. In WSGI, we can stack/wrap several WSGI applications (so-called middleware) on top of each other. The bottommost layer, the one nearest to the WSGI server, must ensure that all strings are byte strings. For example, it could happen that Beaker wraps your application. Your application does not return any unicode string but you might still encounter UnicodeDecodeError problem. That is because Beaker may need to return some  headers (to set or delete session cookie). In case any attribute (such as the path, or domain) of this session cookie was a unicode string, the whole cookie header would be a unicode string. And this violates WSGI specification.

Thursday, August 4, 2011

The Case of Socket Timeout in smtplib in Python 2.6.4

In Python 2.6.4 (and probably earlier versions, and maybe some later versions too), it is extremely critical NOT to call connect again if you have passed the host, and port into smtplib's SMTP (and by virtue of inheritance, SMTP_SSL) contructor.

If host and port are supplied, the constructor will make a call to connect. In connect, self.socket is assigned a connected socket. Then self.getreply is called. Within getreply, if self.file is None then self.file is assigned a file object created from self.socket.

Therefore, when another call to connect is made, self.socket is reset to another (new) connected socket. However, self.file is not. The call to self.getreply still reads from the old self.file (old socket). And since the server has nothing more to send to the client (it has already sent everything in the previous connect), getreply blocks until the (old) socket is disconnected by the SMTP server due to timeout.

The fix is easy, reset self.file to None in connect.