We have 2 approaches for overlaying payload on a video:

Spatial domain: The overlay has a given fixed size which occludes a portion of the background video. To retain the background video characteristics, the modified video must only be overlayed by a given fraction. Motion vector estimation for the overlay is expected to be different than that of the regular background video, either by introducing movement where it originally did not occur, either by occluding rapid movement scenes.

Frequency domain: The overlay could occupy all the background video area. Instead of completely occluding the background video, the overlay could be applied with a given alpha channel so that motion vector estimation gets close to that of the background video. (This means differences in color, however. How does that translate into H264 encoding differences and unobservability?)

Overlay matrix construction

A Bit per Pixel

A first approach to encode information into a frame's pixels was to encode a data bit into each available pixel. Using a black/white color scheme for 0 and 1, respectively, a 640×480 SD frame could hold about 38,4 KB of data.

This represents the maximum capacity of this scheme. However, once a video is encoded and decoded by Skype at both ends, the final image we obtain exhibits a lot of color blending between adjacent pixels. This means that a pixel between a black and white one can be gray and we are left clueless about it's true value.

A Bit per Group of Pixels

In order to mitigate the aforementioned issue about the decision of the color of a pixel, another approach is to use a group of pixels to encode a given bit. For instance, if a 2×2 pixel group encode a single bit, an average of the pixels in the group yields a more accurate estimate of the bit value.

Bit 2×2 → 320×240 bits → 9,6 KB data
Bit 4×4 → 160×120 bits → 2,4 KB of data
Bit 8×8 → 80×60 bits → 600 B of data

The correctness of the bit value retrieval increases with the size of the pixel group, with the trade-off of less data per frame.

A Byte per Group of Pixels

A second approach for the increase of the encoding efficiency is to get each group of pixels to encode a given symbol, instead of a single bit.

The base64 alphabet can represent binary data through a radix-64 representation. This means that if we are able to encode 64 symbols into different colors, we may be able to encode a full 76,8 KB in a single frame where a symbol is represented by a 2×2 group of pixels.

Byte 2×2 → 320×240 bytes → 76,8 KB data
Byte 4×4 → 160×120 bytes → 19,2 KB of data
Byte 8×8 → 80×60 bytes → 4,8 KB of data

Due to color blending on encoding/decoding, the insigth is to encode bytes in RGB values which are as far apart as possible, in order to diminish the possibility of overlap on color measurements between adjacent colored cells.

Upon experiments, due to quantization and the codec block prediction modes we are unable to get enough separate color ranges to identify a given byte.

Encoding Information in Video Frames

Overlaying information in the background video

Overlay matrix construction

A Bit per Pixel

A Bit per Group of Pixels

A Byte per Group of Pixels

A Nibble per Group of Pixels

Three bits per Group of Pixels