Mpeg-4 – Sony SNC-CS50P User Manual

Page 4

Advertising

P-VOPs and Motion Compensation

“Motion Compensation” is the key to predicting
movement within an image and forming P-VOP data to
efficiently compress MPEG-4 video. This section briefly
introduces this technique.
As described above, P-VOP data is generated by predicting
the difference between the previous VOP (reference VOP)
and the current image that is input from the camera. To
predict this movement, “blocks” consisting of 16 x 16
pixels, called macroblocks are first formed within the
image. Next, motion vectors are calculated based on the
predicted movement within each macroblock. The
prediction process is such that the movement within each
macroblock between the reference VOP and the current
image is compared. The resultant “shift” of the
comparison is represented as a motion vector.

In MPEG-4, sub-blocks consisting of 8 x 8 pixels within
the 16 x 16 macroblocks can also be used to predict the
current VOP (Fig. 3). The smaller the “frame” is divided,
the more accurately movement can be predicted, which
can result in an even higher compression ratio.

MPEG-4

Before looking at the MPEG-4 compression format
adopted by these cameras, it is important to clarify the
term “MPEG-4.” MPEG-4 is a series of standards
developed by ISO/IEC MPEG (Motion Pictures Experts
Group) and has many “Parts,” “Profiles,” and “Levels”
related to multimedia content. Among these “Parts,”
“Profiles,” and “Levels,” the SNC-RX550/RZ50/CS50
Series of network cameras employs MPEG-4 Part 2
(ISO/IEC 14496-2) Simple Profile Level 3, and MPEG-4 Part
10 (ISO/IEC 14496-10), which is also called H.264 and
was jointly developed with ITU-T. In the following text,
“MPEG-4” refers to MPEG-4 Part 2 Simple Profile Level 3
and “H.264” refers to MPEG-4 Part 10.

Structure of MPEG-4

Let’s take a look at the structure of MPEG-4. A video
“frame” in MPEG-4 is referred to as a Video Object Plane
(VOP). There are two types of VOPs: an I-VOP (initial) and a
P-VOP (predictive). A Group of VOPs (GOV) consists of an
I-VOP and several P-VOPs. In these cameras, a GOV makes
up one second

of video (Fig. 2).

An I-VOP is compressed using the intra-frame compression
technique and is similar to a single JPEG image. This initial
“frame” of a GOV is often called an “anchor.” I-VOPs are
much larger in data size than P-VOPs; however, they are
essential in the GOV structure, and are required when
searching image data.
P-VOP data is generated by predicting the difference
between the “current image” and the previously encoded
I-VOP or P-VOP (reference frame). This is performed using
inter-frame compression. As explained in the section on
“Basics of Video Compression,” this method of prediction
takes advantage of the video property that two consecutive
“frames” are very similar. Because P-VOP data contains
information related only to the difference between two
frames (i.e. VOPs) and not the image data itself, the data size
of P-VOPs are greatly reduced when compared to I-VOPs.

Fig. 2 MPEG-4 GOV Structure

1 GOV , 1 sec = 1 I-VOP and 29 P-VOPs

3 P-VOPs

I-VOP

16 pixels

8 pixels

Fig. 3 MPEG-4 Motion Compensation Blocks

The default GOV setting of SNC-RX550/RZ50/CS50 Series of network cameras is one second. The length of a GOV can be set between one and five
seconds.

The actual prediction process utilizes a number of feedback loops and complicated algorithms including triggers to reset the I-VOP when there are
extreme movement patterns. This method helps to accurately produce motion vectors. Further technical details are beyond the scope of this paper.

Advertising