Wednesday, August 31, 2011

Python IDE in Visual Studio

Microsoft released their first version of Python Tools for Visual Studio on August 29. It is a free and open source IDE built on top of a free Visual Studio Shell. Basically, you get a familiar (not to me) interface of Visual Studio to manage projects and Python language support with Intellisense and Refactoring. You also get an interactive Python console, a debugger, a code profiler. Of course, you get IronPython support too.

In case you are wondering, this free tool works best with Visual Studio Ultimate. Thank you for your purchase.

Tuesday, August 30, 2011

Don't be Evil, anyone?

Ironically, an admired company who was once famous for its corporate philosophy "You can make money without doing evil" repeatedly carried on several evil deeds.

The willful violation of Java and the illegal advertisement of drugs are the most recent incidents. We might even see an anti-trust case against Google for its integration of Google+ and Search.

What worries me the most is how similar Google actions are to Jeffrey Skilling's.
My job as a businessman is to be a profit center and to maximize return to the shareholders. It's the government's job to step in if a product is dangerous.

Let's make money first, worry about getting caught later, right?

Apparently, keeping to the words is too tough. <sarcastic>We all can sympathize with that.</sarcastic>

Friday, August 26, 2011

And the cloud is gone

This morning I attended a small conference on Cloud Computing. IaaS, PaaS, SaaS etc. marketing talks bore me. Antivirus vendors talked about obtaining signature "from the cloud." Firewall vendors talked about managed security provider "from the cloud." People talked about "the cloud" as some sort of cure-all solution.

Then in the evening I saw this question Well, okay, this one is new. Now suddenly a website becomes a cloud.

To be honest, I don't know what a cloud is. Neither do I understand what cloud computing is about. Apparently I am not alone in this regard. If you have some spared time to do a search for cloud computing definitions, you will then have a hard time reconciling them into one.
Clearly, the term "cloud computing" has lost most of its meanings and core attributes. This occurred not by anybody redefining what it is, but by billions of marketing dollars that simply shout down the thought leaders in this space who call BS on all the cloud-washing.
But, apparently, "vendors who strive to be accurate, precise, real and relevant [in their cloud computing strategy] are winning deals right now and transcending the hype cycle to close sales." ( Great! As it should be.

So, anyway, what is cloud computing? What do you use it for? How do you create one? How do you use one?

Thursday, August 25, 2011

EfikaMX -- a great product

Today I upgraded my Efika Smart Top device with the latest image and, boy, I was amazed.

The Smart Top is a small, light device with one RJ-45 port, one HDMI port, one headphone, one microphone, two USB ports, one SD slot, built-in wireless chip, and a HD decoder.

I got this device for my mom. It hooked up perfectly to the big screen TV we had in the living room. Running Ubuntu with Firefox and Vietnamese language pack is perfect for my purpose.

The was a slight problem at first. My Efika MX came with old image (February, I think), which defaulted to 1080p resolution. Many modern TVs do not, or at least according to the device manufacturer Genesis, work right to the HDMI specs and cause trouble with the Smart Top. The fix is usually to Ctrl-F1 to get to the console and manually switch to 720p resolution.

The next disappointment was the Feb image did not have 2D accelerated X11 driver (Xv). Loading GUI screens were really slow. It may easily take 10 seconds to display Firefox GUI. Certainly, playing a movie at 1 frame every few seconds isn't in anyone's interest.

No more annoyances, though. With the latest July image, the Smart Top box is nearly at its max capacity. Beside the lack of HD decoding and Flash player, it is a sweet as honey. Movie is arguably playable.

For the price tag of 135 Euros, this is surely a good buy. My wish is it would come with two network interfaces instead of one. That will make it an ultimate Linux box!

Wednesday, August 24, 2011

Temporary fix for Apache Killer

Update (September 07): Apache released version 2.2.20 to fix this issue.

Update (August 26): Request-Range header needs blocked as well.

A few days ago KingCope published a small Perl script to launch DoS attack against Apache HTTPD. The problem is it is too efficient for its own good. I had a good time playing with it and came to some pointers that might help others.
  1. Make sure that your MPM settings are appropriate for your server resources. For example, you should not expect a 256MB RAM server to run 100 instances of Apache.
  2. Disable DEFLATE output filter with RemoveOutputFilter DEFLATE.
  3. Disable Partial Content with headers_module RequestHeader unset Range.
You will lose some features such as resuming download or GZip encoding. But this is definitely better than iptables packet inspection on every single packet as someone suggested.

Tuesday, August 23, 2011

Detecting file types

This question pops up every now and then.

There is a utility called file on Unix platforms. This tool tells you what a particular file likely is. It works by sampling file content and make an educated guess. For example, a ZIP file usually has two bytes PK at the beginning, an EXE file would have MZ, a PDF file %PDF and so on.

Albeit how logical it sounds, it is still a guess. And a guess can be wrong.

In Python, there is built-in module mimetypes that works with file extensions. If file extension isn't available, the module filetypes on PyPi that works similarly as file. Worst case, you can always sample in a few bytes from a file and do a signature match-up against your own database as described earlier.

Monday, August 22, 2011

Simple job scheduling with Gevent

There are times when we need to run a function at repeated intervals. If we are lucky to already be using Gevent, these two lines make for a neat function to do that.

def schedule(delay, func, *args, **kw_args):
    gevent.spawn_later(0, func, *args, **kw_args)
    gevent.spawn_later(delay, schedule, delay, func, *args, **kw_args)

The idea is we let Gevent schedule the schedule() function in its event loop. This is like a (flat) tail-recursion call.

If we are not using Gevent, however, the package apscheduler is an advanced job scheduler similar Java Quartz and worth a serious look.

Friday, August 19, 2011

Asymmetricity in Security

This is one of many topics I find very interesting.

People often say that it is easier to break than build. That is clearly an asymmetricity. Think about this, an organization IT group of, say, 10 persons has to constantly fight against an unknowing number (possibly large) of attackers, days and nights.

Additionally, one seemingly simple vulnerability could cause a collapse of the whole system and perhaps related external dependencies too. Think about the blackout in New York a few years back. That is another asymmetricity. To build, one has to be very careful to check for all possibilities of weak links. To break, an attacker only needs find one weak link.

However, the reverse is also true. It is most of the time easier to fix a bug than to exploit it. For example, a cross site scripting bug is easily fixed by encoding HTML output. However, to take advantage of that bug, an attacker will have to jump through many hoops. How about a buffer overflow? Fixing a buffer overflow is seldom a difficult task but exploiting that vulnerability requires deep knowledge and multiple stages, various tricks to bypass additional checks such as ASLR, Non-Executable stack and so on.

What is your thought about this?

Wednesday, August 17, 2011

Re: Fit or Future? Which is more important when hiring?

This article "Fit or Future? Which is more important when hiring?" appeared on Artima in June.

Fit refers to candidates who have better skill match to a job description and future refers to those who would quickly gain such skill provided that they were supported. The question is which would you pick.

As you would expect, the article does not dictate whether Fit or Future is more preferred. Obviously it is context dependent. The article does, however, list out several factors that could affect your decision. Among them is Urgency.

To me, the Urgency factor alone is sufficient to determine whether to pick Fit or Future. Why? Because Future is always my default choice, unless Urgency comes in. Nevertheless, I would think only companies that did not plan for growth would have to resort to that. It is not an urgency until you make it is, right?

And here's why I pick Future. First of all, there is always a period for new candidate to adapt to new organizational culture, new environment and so on. Learning new skill in this period is also the norm. That is to say there usually is a buffer of time for someone who has solid ground to attain new skills. It is not so urgent after all. Secondly, people do not spend four to five years in University to learn "advanced" usage of Java Enterprise, or PHP. They spend time learning foundational knowledges because these often matter more. They allow people to grow, organically. Thirdly, diversity is, most of the times, a good thing in the same vein that a diversified investment portfolio usually is less risky. People with different backgrounds, cultures, skill sets form a better team than people coming from one mold. And finally, you don't make hiring an urgency, do you?

Do you?

Tuesday, August 16, 2011

Google+ Games

Two words: Improvements Needed. Make it three: Much.

  1. Currently, caching mechanism makes it very difficult to receive (help, invite, gift, etc.) requests from friends. This makes real-time game play less likely to happen.
  2. As far as I know, the Notification does not show Game posts immediately. Similar to the above point, this is a downer for social gamers. They just don't know when someone does something to them.
  3. There needs to be a way to totally separate game streams from other streams, maybe a flag to say all posts from this Circle are game related. Google did the right thing to create Game Notifications for, well, games. But that does not prevent people from cross-posting to the regular streams with requests. Post-pollution, anyone?
  4. There needs to be a way to totally obliterate game(s) data. I want to delete my city!
So, have you got

Monday, August 15, 2011

Spiral evolution of technologies

Non sense alert: This post is pretty much non-sensical.

We are observing an interesting trend to move towards NoSQL. Let the misconception aside, I am talking about the spiral evolution from dumb data, to structured/smart data, and back to dumb data.

Before SQL came about, people used to derive their own structs and dump them to a binary file. Reading back was a simple matter or seeking to the correct position and grab a chunk. No query was possible beside the obvious query by position.

Then came SQL and all the query goodies. This was possible because there was a so-called schema to describe the data, that this field was a string, that field was an integer, in an almost universal way. The schema allowed for intelligence. SQL server is able to understand its data and make logical relational queries on them.

Nowadays, we are seemingly moving back to the previous dumb model with NoSQL. This is what I meant by spiral evolution, we are back to where we were once, but at a higher level. Apparently, people realize that some data do not need to be smart. They are just chunks of bits and bytes. What is more important in dealing with them is how to store them efficiently. And there born NoSQL whose main purpose (among others) is to be a very good way to persist data.

What is observable in NoSQL is that they tend to be schema-free. This usually translates into simple queries, and primitive relations. However, it reduces the overhead of the server. The server does not have to perform extra works to make sense of the data it manages. This makes sense, right? After all, it is the application which ultimately knows about its data.

Looking into the future, we probably will see a trend back to schema-enabled data few years or decades later, won't we?

Friday, August 12, 2011

A queue whose elements know their positions

Recently I needed a data structure that could support FIFO access and that every item in it would know about their position. Think about a queue in real life. The first person in the queue knows he is next. The second one knows he will be next after one person. You can come up to anyone randomly in the queue and ask "when is your turn." He will be able to tell you "after X more persons."

Here is a simple algorithm to implement such queue.

  1. Tag an increasing number to each element in the queue, called this slotNumber. When the queue is initialized, set slotNumber to zero. When an element is pushed to the queue, increase slotNumber by one and tag it to the element.
  2. The position of each element in the queue is therefore the difference between their slotNumber and the first element's slotNumber.
This algorithm was devised by an intern in our company.

I used a variant of this algorithm where I kept two counters instead of just one. I called them pushCounter and popCounter. Each time an element was pushed into the queue, I increased the pushCounter. When an element was popped, I increased the popCounter. The answer is the difference between slotNumber and popCounter.

Thursday, August 11, 2011

Captain Amercia, the movie

A so-so action flick. A super-super-hero type of movie.

The setting is unconvincing. I don't mean the city, the background, and the illogical laser guns. I'm talking about the inconsistency in the setting. So, our hero has a rare, extremely rare, shield made of some ultimate metal. And apparently this shield can deflect laser, and can be use as a hand weapon. But it looks like the shield only weights less than a book. I mean, when our hero straps the shield to his back, it flings up down and left right.

The hero is too powerful, even more powerful than Super Man or The Hulk. There is basically no knot to be untied in the plot.

I would recommend this movie if you want to kill time.

Wednesday, August 10, 2011

IRC Flood Control

IRC flood control detection is a simple, and elegant method to detect message flood. Simple yet intelligent enough to support random burst of messages.

The main idea is to keep a timer, called, say, recvTimer which is set to 0 initially. When a new message is received, this recvTimer is adjusted by following these few steps:

1. If the current time is higher than recvTimer, set recvTimer to current time.
2. Increase recvTimer by, say, 2 seconds.
3. If recvTimer is, say, more than 10 seconds ahead of current time, flood is detected.

With the hypothetical values as above, burst of five messages is supported, and any one who sends more than 5 messages in 10 seconds is a flooder. Simple, yet very elegant.

Tuesday, August 9, 2011

Using netsh to whitelist your application in Windows Firewall

It is quite easy to use netsh to whitelist your application in Windows Firewall.

netsh firewall set allowedprogram <absolute_path> <short_name> ENABLE

<absolute_path> is the absolute path to the executable file
<short_name> is some descriptive short name so that users know what this entry in the firewall is for

That is all. Of course, this command must be launched with administrator privilege.

Monday, August 8, 2011

Problem 10137: The Trip

Original problem statement

Problem A: The Trip

A number of students are members of a club that travels annually to exotic locations. Their destinations in the past have included Indianapolis, Phoenix, Nashville, Philadelphia, San Jose, and Atlanta. This spring they are planning a trip to Eindhoven.

The group agrees in advance to share expenses equally, but it is not practical to have them share every expense as it occurs. So individuals in the group pay for particular things, like meals, hotels, taxi rides, plane tickets, etc. After the trip, each student's expenses are tallied and money is exchanged so that the net cost to each is the same, to within one cent. In the past, this money exchange has been tedious and time consuming. Your job is to compute, from a list of expenses, the minimum amount of money that must change hands in order to equalize (within a cent) all the students' costs.

The Input

Standard input will contain the information for several trips. The information for each trip consists of a line containing a positive integer, n, the number of students on the trip, followed by n lines of input, each containing the amount, in dollars and cents, spent by a student. There are no more than 1000 students and no student spent more than $10,000.00. A single line containing 0 follows the information for the last trip.

The Output

For each trip, output a line stating the total amount of money, in dollars and cents, that must be exchanged to equalize the students' costs.

Sample Input

Output for Sample Input

Restated problem

Given N integers representing the expenses in cents of N persons in a trip. Find the least total amount of money to exchange among them so that the final expense of anyone is within a cent of all others. That is, if there are 3 persons, A, B, and C, then the difference between A's and B's expenses is at most one cent, between A's and C's is at most one cent, and between B's and C's is also at most one cent.

Reading input

Many people use float, double, or long double to read in the amount (given in dollars). This leads to many rounding problems. Instead, we can use integer to store the amount in cents. For example, if the input is 3.14, we can read in 3, the dot, and 14, and the amount then is 3*100 + 14.

int dollar, cent, amount;
char dot;
cin >> dollar >> dot >> cent;
amount = dollar * 100 + cent;


Of course, we need to take the average. The trip goers are then divided into two groups, those spent more, and those did less. Our job is to get everyone spending as close to this average as possible.

Those who spent less must pay extra cash. This cash is use to offset those who spent more. Let's call this a virtual common fund which a less-spender must deposit in, and a more-spender can withdraw from. This fund could go below zero.

An important realization is that since we are using integer division, the average expense is closer to the less-spenders. The more-spenders, therefore, need withdraw from the common fund only as much as it takes to make his spending to average plus one cent. The less-spenders, however, must pay in full to make his spending to exactly the average. This way, we guarantee that everyone's expense is within one cent of all others'.

If after everything is settled and the fund is non-negative, we say that the less-spenders have repaid the more-spenders so that both are within one cent of the average expense. In addition, if the common fund is negative, the less-spenders need to pay that much more to make the fund balanced (to make it zero).

With that reasoning, the answer is simply the sum that less-spenders must pay to make their expense up to the average, plus any amount of negative fund.

In the end, the algorithm is a two-pass traversal of the expense array. The first pass is to find out the total of all expenses to calculate their average. The second pass performs fund adjustment depending on whether the spender is a less-spender or a more-spender. At the same time, the second pass also accumulates total sum that less-spenders must pay for (called top-up sum). If the fund is non-negative, the solution is this top-up sum; if it is negative, the solution is this top-up sum minus the fund (that is, plus the absolute value of the fund).

Friday, August 5, 2011

Gevent pywsgi and UnicodeDecodeError

When switching from the libevent-based wsgi to Python-based pywsgi, you might encounter some strange error like the one reported at Google code

The problem is WSGI (or the underlying HTTP) does not understand Unicode. You might think that because you can read other languages just fine on the Internet, certainly HTTP must understand Unicode, right? Wrong! HTTP only transfer bytes. How to decode these bytes into characters totally lies with the browser. There is charset hint from the Content-Type header, but WSGI does not use that header to encode your unicode response.

And so, WSGI response must not be unicode objects. All unicode objects must have been encoded into plain byte strings. This applies to everything: the status code, status message, the headers, and the body.

This condition must be true at the WSGI server level. In WSGI, we can stack/wrap several WSGI applications (so-called middleware) on top of each other. The bottommost layer, the one nearest to the WSGI server, must ensure that all strings are byte strings. For example, it could happen that Beaker wraps your application. Your application does not return any unicode string but you might still encounter UnicodeDecodeError problem. That is because Beaker may need to return some  headers (to set or delete session cookie). In case any attribute (such as the path, or domain) of this session cookie was a unicode string, the whole cookie header would be a unicode string. And this violates WSGI specification.

Thursday, August 4, 2011

The Case of Socket Timeout in smtplib in Python 2.6.4

In Python 2.6.4 (and probably earlier versions, and maybe some later versions too), it is extremely critical NOT to call connect again if you have passed the host, and port into smtplib's SMTP (and by virtue of inheritance, SMTP_SSL) contructor.

If host and port are supplied, the constructor will make a call to connect. In connect, self.socket is assigned a connected socket. Then self.getreply is called. Within getreply, if self.file is None then self.file is assigned a file object created from self.socket.

Therefore, when another call to connect is made, self.socket is reset to another (new) connected socket. However, self.file is not. The call to self.getreply still reads from the old self.file (old socket). And since the server has nothing more to send to the client (it has already sent everything in the previous connect), getreply blocks until the (old) socket is disconnected by the SMTP server due to timeout.

The fix is easy, reset self.file to None in connect.

Wednesday, August 3, 2011

Free Online Introduction to Artificial Intelligence Class

Taught by Sebastian Thrun and Peter Norvig the same way it is taught in Stanford University. By "the same way" I mean there are home works as well as midterm and final exams. The instructors advise participants to spend at least ten hours a week to learn if they are to pass this class. There is certificate from Stanford for those who pass ;-).

Sign up at from now to September 10.

Tuesday, August 2, 2011

MongoDB in OpenVZ

When you run MongoDB in OpenVZ, you might see a warning that OpenVZ is not supported. Some web searching can tell you that OpenVZ has a different guest memory management approach than the host system that leads to mongod being unable to detect if it runs out of memory.

The temporary fix until this issue is worked out is to put a ulimit command in mongod start up script (usually /etc/init.d/mongod).

ulimit -v kbytes

where kbytes is the number of kilobytes of virtual memory that mongod can use.

Copied from