Tuesday, August 23, 2011

Detecting file types

This question pops up every now and then.

There is a utility called file on Unix platforms. This tool tells you what a particular file likely is. It works by sampling file content and make an educated guess. For example, a ZIP file usually has two bytes PK at the beginning, an EXE file would have MZ, a PDF file %PDF and so on.

Albeit how logical it sounds, it is still a guess. And a guess can be wrong.

In Python, there is built-in module mimetypes that works with file extensions. If file extension isn't available, the module filetypes on PyPi that works similarly as file. Worst case, you can always sample in a few bytes from a file and do a signature match-up against your own database as described earlier.

