Wednesday, September 7, 2011

Screen Video in Flash SWF

I had a chance to play with Flash two days ago. I wanted to extract a frame from a Swift (SWF) file version 7.

http://klaus.geekserver.net/libflv/screen.swf

[03c]        10 DEFINEVIDEOSTREAM defines id 0001 (79 frames, 466x311 codec 0x03)
                -=> 01 00 4f 00 d2 01 37 01 00 03
[01a]         6 PLACEOBJECT2 places id 0001 at depth 0001
                -=> 06 01 00 01 00 00
[03d]    182562 VIDEOFRAME adds information to id 0001 (frame 0) 352x288 P-frame deblock 1  quant: 14 
                -=> 01 00 00 00 13 31 d2 31 37 0f ee 78 da d5 5a 59
                -=> 93 db c6 11 9e 1e 80 dc 55 25 cf a9 24 92 ac d5
                -=> 1e bc 00 82 20 0e 72 57 eb 54 1e 52 f9 ff bf 20

Dumping with swfdump -d produced something like the above.

After skimming through Adobe's SWF File Format Specification version 10, I got the meanings of two related tags DEFINEVIDEOSTREAM and VIDEOFRAME.

DEFINEVIDEOSTREAM says I am going to create a video from embedded data. The codec that I am going to use is 0x03, which is the identifier for Screen Video codec.

VIDEOFRAME augments the other tag with actual data of a frame. This is what I am going after.

The specification says VIDEOFRAME composes of the tag, two bytes to identify the stream that was defined with DEFINEVIDEOSTREAM, two bytes to identify the frame in that stream, and then the payload. So, with the above dump, the first two bytes (01 00) identify the stream 0001, the next two (00 00) identify frame 0 in this stream, then the rest from 13 31 d2 31 37... is the payload. The payload is to be interpreted accordingly to the codec that was defined in DEFINEVIDEOSTREAM.

I then read up on Screen Video codec (from page 239). A SCREENVIDEOPACKET composes of 4 bits for block width, 12 bits for image width, 4 bits for block height, 12 bits for image height, and the remaining data are for image blocks. Using that to dissect 13 31 we get block width of 1, which means the actual block width is 32 pixels, and image width of 0x331 (817) pixels.

Obviously, that value isn't sound. The movie is not that wide. Because the interpretation is correct, there has got to be something wrong with the specs.

So I checked out SWF File Format version 7, when Screen Video codec was introduced, to see if they had better explanation. Surely, they had different descriptions and discrepancies (I'll come back to this later) but nothing related to the problem I was facing. The tag formats stay the same in two documents.

So I reread and reread, searched and searched on Screen Video, hoped to find some light. And light I found. In the introductory paragraph, the specs says
In a keyframe, every block is sent. In an interframe, one or more blocks will contain no data...
Ahh, keyframe and interframe. What are they? How to determine if one is a keyframe? A search for keyframe in the v7 spec brought me to VIDEODATA tag. This tag belongs to FLV format, not SWF format. It says that the first nibble is a CodecID, the next nibble is FrameType and then comes the payload which could be SCREENVIDEOPACKET if the CodecID is 3. This information is not mentioned in v10 of the spec.

Applying this interpretation to five bytes 13 31 d2 31 37 did not exactly yield desirable result. The first nibble (1) is not a known CodecID. However, switching the order of CodecID and FrameType gave reasonable meaning to these values. FrameType 1 is a keyframe, CodecID 3 is Screen Video. Then comes the block width of 3, image width of 0x1d2 (466), block height of 3, image height of 0x137 (311). Much more sensible. Continued with that interpretation, I was able to decode the whole VIDEOFRAME packet.

Ultimately, I needed to produce an image out of these raw data. Here comes the discrepancy between v7 and v10. In v7, the blocks are arranged from top left to bottom right row by row. In each block, pixels are arranged from top left to bottom right row by row. In v10, the blocks are arranged from bottom left to top right row by row. In each block, pixels are arranged from bottom left to top right row by row.

Funnily, I followed a mixed approach at first, blocks are arranged from bottom left to top right, but in each block, pixels are arranged from top left to bottom right. The reason was I read v10 first, then while fixing the interpretation above, I switched to v7 and continued with it. So, half the idea came from v10, the other half came from v7. In the end, the correct arrangement is depicted in v10, bottom up, left to right, similar to a BMP file.

Here is the first frame in that SWF file.

First frame extracted from screen.swf
 If this whole post is rather too long for you, here are the takeaways:
  1. If CodecID is 3 in DEFINEVIDEOSTREAM, it is Screen Video.
  2. If it is Sreen Video, the VIDEOFRAME packet, as documented in the spec, is wrong. In reality, it has two extra nibbles right before the SCREENVIDEOPACKET payload. The first nibble is FrameType (either 1 if this is a keyframe, or 2 if this is an interframe), the second is CodecID (which is 3).
  3. A Screen Video frame is divided into blocks with the first one located at the bottom left of the frame, going left to right, bottom to top.
  4. In each block, pixels are arranged bottom up, left to right.
<sarcastic>So much thanks to Adobe for opening up SWF file format.</sarcastic>

No comments:

Post a Comment