We have 2 approaches for overlaying payload on a video:
A first approach to encode information into a frame's pixels was to encode a data bit into each available pixel. Using a black/white color scheme for 0 and 1, respectively, a 640×480 SD frame could hold about 38,4 KB of data.
This represents the maximum capacity of this scheme. However, once a video is encoded and decoded by Skype at both ends, the final image we obtain exhibits a lot of color blending between adjacent pixels. This means that a pixel between a black and white one can be gray and we are left clueless about it's true value.
In order to mitigate the aforementioned issue about the decision of the color of a pixel, another approach is to use a group of pixels to encode a given bit. For instance, if a 2×2 pixel group encode a single bit, an average of the pixels in the group yields a more accurate estimate of the bit value.
The correctness of the bit value retrieval increases with the size of the pixel group, with the trade-off of less data per frame.
A second approach for the increase of the encoding efficiency is to get each group of pixels to encode a given symbol, instead of a single bit.
The base64 alphabet can represent binary data through a radix-64 representation. This means that if we are able to encode 64 symbols into different colors, we may be able to encode a full 76,8 KB in a single frame where a symbol is represented by a 2×2 group of pixels.
Due to color blending on encoding/decoding, the insigth is to encode bytes in RGB values which are as far apart as possible, in order to diminish the possibility of overlap on color measurements between adjacent colored cells.
Upon experiments, due to quantization and the codec block prediction modes we are unable to get enough separate color ranges to identify a given byte.
Same as above. The difference is the expansion in color range that we are able to achieve with only 16 different color cells.
Same as above, 8 different color cells.