Friday, October 28, 2011

Three weeks into AI, ML and DB classes

It's been three weeks into the programs. I find them all very useful. some are more than others.

The AI class is the least approachable one, at least for me. Till date, I still haven't seen how to apply the knowledge in real life. The content is pretty dry an less practical. Most of the time, the two instructors taught us about mathematical concepts such as probability, Bayes rule, and lengthy calculations instead of some usable examples. I hope few units more into the course, we'll see more concrete examples. Besides, this may be only me again, I can't find slides for review purposes! Must I watch through all the videos again? Besides, I occasionally failed to catch the words, both visually and acoustically.

On the other hand, the DB class is very understandable in general. It does throw in difficult challenges, and sometimes utterly complicated ones, to pique the brain. These challenges are good yet quite frustrating or even demotivating. Take for example Relational Algebra quiz. Things that could be done so easily in SQL such as MAX function are undoubtedly hard in RA alone. Then again, I'm not an excellent student, so this may be only me. I only wish Prof Widow would speak a little bit more slowly.

The ML class is in between. It is approachable and practical. I can follow Prof Ng quite easily for he speaks slowly. I also find the quiz okay, at an appropriate difficulty for me. And, best of all, I can use Octave to try out what he teaches right away. You have to agree that is much more hands-on.

Overall, I highly applaud this initiative of the professors to bring world-class courses to the masses.

Thursday, October 27, 2011

Con cop con, con cop cha, con cop

Có con cọp cắn con cọp con con của con cọp cạnh con cọp cha của  con con cọp có con cắn con cọp con của con con cọp cạnh con cọp cha.

Hỏi: có bao nhiêu con cọp?

Wednesday, October 26, 2011

Notes about Python optimization

Yesterday, Dropbox posted a story detailing how they optimized a frequently used function in their code.

http://tech.dropbox.com/?p=89

The conclusions are:
  1. String concatenation is more favored than string formatting.
  2. Built-in set and dict types are fast.
  3. Computing is more favored in C than Python. Try to minimize Python code with function inlining, implicit loop and move as much computing to C as possible, not necessarily your own C extension. In fact, set and dict are in C.
  4. Common wisdoms such as optimizing inner loops, or better flatten (unroll) the loop itself, using local instead global variables to take advantage of cache, are helpful but not by much in Python.
  5. Measurements are a must.
Quite a nice read. The comments are also very insightful.

Tuesday, October 25, 2011

Spiral evolution: The network is the computer

Decades ago, Sun Microsystems tagline was The network is the computer.

At that time, I believe, Sun was aiming for a holistic computing infrastructure which could be treated wholly as one single entity, a computer.

That dream, I believe, failed somewhat. Surely Sun made great architectures, good hardwares, and wonderful softwares but the network part was mostly not realized. Computers was not able to work with each other as one. It was difficult to program applications that could work across different network segments, leave alone different cities, or countries. The software stack was not conducive, almost all products must have re-invented distributed computing primitives that were fragile and too low level to build robust applications from. And most importantly, the link was slow. Before the advent of dotcom bubble, the link was limited to ISDN, probably 128kbps. This was the hard limiting factor for Sun to fully realize its dream and therefore the major architecture of that time was still very much centralized with big boxes crunching numbers by themselves.

Nowadays, we are seeing distributed computing almost everywhere. Big data companies have big pipes connecting them to almost everywhere in the world. Their computing grid could easily span across three different continents. All of these achievements are largely thanks to better networking technologies. At the same time, better software libraries have made it easier for developers to roll out distributed applications. Among them are great infrastructures such as Hadoop, Cassandra, HBase and more low level nuts and bolts such as ZeroMQ. They all have enabled a more reliable stack that frees us from having to take care of minute details and allows us to focus more on the business logic. Our applications now can be composed of literally hundreds of nodes, each doing different tasks at once in a coordinated manner regardless of where they are physically hosted. They can grow or shrink as easily as flicking a palm. They enable the network is the computer thinking.

As an analogy, glues like ZeroMQ are the data buses and the nodes are the RAM, the peripheral devices, the GPU and the main application (the coordinator) is the CPU of this new machine. I think that makes a lot of sense.

Monday, October 24, 2011

Facebook vs Disqus

I am back from a long trip! Cool.

Today I realized that Facebook has something called Comments that is used to power sites like TechCrunch and others. This plugin takes over self-made user comment "widget" usually shown at the bottom of a page. It provides more functions than a bland traditional commenting system such as analytics to the site owners, and badges, points, games to the readers. It entices more people to participate in commenting. And it is a hosted service (managed) by Facebook.

Then, Disqus immediately sprang into my mind. Since a few years ago, the Internet start-up Disqus has been providing the exact same service to many sites. This was a niche, extremely niche, market. I did wonder then what if bigger providers such as Google or Yahoo! or Microsoft jumped in. Would Disqus be able to fend themselves off? With better infrastructure? With more enticing features? More engaging methods?

And the prediction has come true. Now, Facebook is in, with its strongest capital: the populous user base. Will Google (Plus or not) follow? Maybe in one year time?

I think this is a right move from Facebook, and a big blow to Disqus. But then, David won over Goliath, didn't he?


Wednesday, October 5, 2011

Second attempt at PyPy

Update (Oct 30): The fourth translation failed due to insufficient memory again. I wonder if PyPy buildbot is a 32GB RAM machine ;).

A few days ago I made my first attempt at running PyPy. It was unsuccessful for my need due to bug #887. So today I gave it another try with a patch from Justin applied. To cut the story short, today was another failure. Read on for more long-winded story.

I checked out the source from BitBucket. Then apply the patch that was provided in the same issue above.

So, my first attempt was to use a pre-built PyPy binary to translate this source. It took about an hour or so before the process was killed. I was away when this happened so I did not know what caused the termination. Experience told me, though, that it might be due to out-of-memory behavior of Linux systems.

The second time I still used PyPy to make the translation. This time, I made sure to free as much memory as I could, and monitor its use. Another hour passed, the translation process once again was killed half way. Luckily, I had seen the cause clearly. Indeed, the translation process used too much memory and was killed forcefully.

Then I used the less memory hungry command to translate PyPy with pre-built PyPy. It consumed a little bit above 3 GB of RAM, enough to give me hope that it would succeed. However, before launching make -j 4 to compile generated C sources, translate.py did not release (or garbage collect) all of its unused objects. This left a waste of 3 GB and set aside only 1 GB (I was on a 64-bit CentOS 5.6 with 4GB of RAM) for GCC to do its work. Apparently, this wasn't enough and the same forced-kill story repeated.

Now I am in the fourth translation run. This time I do it with vanilla Python 2.6 instead of with pre-built PyPy. The process is indeed much slower, but it also consumes much less memory. The translation has run for an hour and a half and it is still running. Wish me luck.

Tuesday, October 4, 2011

Cyberlympics and a chance to win big prizes

The Global CyberLympics is a not-for-profit initiative led and organized by EC-Council. Its goal is to raise awareness towards increased education and ethics in information security. The mission statement of the Global CyberLympics is Unifying Global Cyber Defense through the Games.

Some members in our local security group already took the lengthy online individual qualifying quiz and confirmed its similarity to EC-Council CEH materials. It is not difficult nor is it in-depth. You should feel at home taking this quiz.

Monday, October 3, 2011

db-class.org and ml-class.org are live

A month ago there was ai-class.org. Then a few days ago, db-class.org and ml-class.org went live, too. This is good news for newbies like me.

I enrolled in all three classes (AI, DB, and ML) but only tried out DB class. They have some recorded videos for the first week even though the class has not started yet. My gripe is that my Internet connection is not fast enough for their videos. Their videos are more than 800-pixel wide. That is probably twice the required resolution for such presentations. The pluses are the content is really good. Excellent lectures, excellent questions and quiz, just excellent.

So, if you are interested in introductory materials to either Artificial Intelligence, or Database, or Machine Learning, you've gotta enroll in these free online classes by Stanford.