diffrence between mvhd box timescale and mdhd box timescale in isobmff format

101 Views Asked by At

what is diffrence between mvhd box timescale and mdhd box timescale in isobmff format??

I find the definition in official document.

movie box timescale is

timescale is an integer that specifies the time-scale for the entire presentation; this is the number of 
time units that pass in one second. For example, a time coordinate system that measures time in 
sixtieths of a second has a time scale of 60

mdhd box timescale is

timescale is an integer that specifies the number of time units that pass in one second for this media. 
For example, a time coordinate system that measures time in sixtieths of a second has a time scale 
of 60

If Movie Box timescale is 1000, and fps 24.
then mdhd timescale value is 24000 of video track.
Is it Correct?
(My thought is video mdhd timescale is (fps * mvhd timescale) and
audio mdhd timescale is Sampling Rate(48000kHz.. etc)

I am curious about some files of mvhd timescale value is 30,
some file has 90000 value in case of video fragment files.

below picture has mdhd timescale 30 enter image description here
below picture has mdhd timescale 90000 enter image description here

1

There are 1 best solutions below

1
VC.One On

MVHD = global/movie timescale

For Movie time : frequency of (usually set as) 1000 ticks to represent 1 second of real-world clock.

MDHD = media-specific timescale

For Video time : This specified frequency shall represent 1 second of real-world clock.

video note:
This is connected to (and affected by) the FPS and sample duration in the STTS atom/box.
video example:
If FPS is 24 and Sample Duration is 1000 then in mdhd we set: 24000 ticks per 1 second.

We are saying that 1 sample (frame) should last 1/24 of a second in real-clock time.
24 samples == 1 sec.

//## where FPS mode is Constant (not Variable)
//## eg: STTS_sample_duration is 1000
FPS = ( MVHD_timescale / STTS_sample_duration )

For Audio time : This specified frequency shall represent 1 second of real-world clock.

audio note:
This is usually the Rate of PCM Samples-per-second (in hertz) of the audio data.
audio example:
48khz is 48 000 PCM samples (per second), so in mdhd we set: 48000 ticks per 1 second.

In the above example, the total number of expected PCM audio samples for one second is 48000.

You can imagine for example, we now divide those 48000 samples into 24 audio frames. How many PCM samples per audio frame in this example?
It is 2000 because: ( 2000 samples x 24 frames ) = 48000 total samples.

  • Write in STTS a sample duration of: 2000 PCM samples per audio frame
  • Write in MDHD a timescale of: 24000 ticks per second
  • Write in MVHD a timescale of: 1000 ticks per second

In STTS the sample duration is a count of audio samples in the frame, not a count of audio time per frame.

At 24 audio frames-per-sec, one audio frame holds a count of 2000 samples, so it has 41.666 milliseconds worth of audio time.

( 48000 samples / 24 frames ) == 2000 samples length per audio frame
( 1000 ticks per sec / 24 frames ) == 41.666 milliseconds per audio frame

So you can calculate:

( ( frame_duration_msec * MDHD_timescale ) / MVHD_timescale ) = total audio time per 24 frames
( ( 41.666 * 2400 ) / 1000 ) = 999.984 milliseconds of audio per 24 frames 

//#same as: 
( 41.666 msecs * 24 frames ) = 999.984 milliseconds

Inside MP4, an audio frame will actually be an AAC frame.
It holds a different number of expected samples per frame
For 44100 or 48000, an AAC frame holds 1024 samples (or 21.333 ms of sound/PCM data).

How many AAC frames (audio frames) each with 1024 PCM samples are needed to play the expected 48000 PCM audio samples in 1 second?

The answer is 46.875 frames. An audio decoder reads 47 AAC frames though, and the remaining 128 PCM audio samples from those 47 frames is carried over into the next second of sound

( 48000 samples / 46.875 AAC frames ) == 1024 samples length per audio frame
( 1000 ticks per sec / 46.875 AAC frames ) == 21.333 milliseconds per audio frame

(2) Regarding side queries...

"If Movie Box timescale is 1000, and fps 24. then mdhd timescale value is 24000 of video track. Is it Correct?"

Your video must be using Constant Frame Rate for that logic to work.
Your STTS must have only one entry saying all video frames apply the same sample duration of 1000, then in MDHD_timescale you can set 24000, and also in MVHD_timescale you can set 1000.

"My thought is: video mdhd timescale is (fps * mvhd timescale) and
audio mdhd timescale is Sampling Rate (48000kHz.. etc)"

Both audio/video timescales in MDHD are for saying how many "ticks" are needed to make 1 second of media time. In STTS you are saying how much (ratio) of the MDHD timescale this current frame represents.

In video:
The MDHD is 24000 because each video sample (frame) in STTS has a 1000 ticks duration.
STTS tells us that 24 video frames are needed to match the MDHD value.

In audio:
The MDHD tick is 48000 because each audio frame in STTS holds 1024 ticks of PCM audio.
STTS tells us that 47 audio frames are needed to match the MDHD value.

"I am curious about some files of mvhd timescale value is 30, some file has 90000 value in case of video fragment files."

Those are ratios that are specific to whatever other numbers are used in MDHD and STTS entries.

90,000 is a good value for getting usable integers out of the many frame rates out there:

90000 / 3750 == 24.000 
90000 / 3600 == 25.000
90000 / 3003 == 29.970 
90000 / 3000 == 30.000
90000 / 1500 == 60.000

In the case of 23.976 FPS, you could use: ( 24000 / 1001 ) == 23.976
In the case of 59.940 FPS, you could use: ( 60000 / 1001 ) == 59.940