Video Codec Testing and Quality Measurement

Summary: Needs a YES. Has 2 DISCUSSes.

Benjamin Kaduk Discuss

Discuss (2019-06-12 for -08)
I suspect I will end up balloting Abstain on this document, given how
far it is from something I could support publishing (e.g., a
freestanding clear description of test procedures), but I do think
there are some key issues that need to be resolved before publication.
Perhaps some of them stem from a misunderstanding of the intended goal
of the document -- I am reading this document as attempting to lay out
procedures that are of general utility in evaluating a codec or codecs,
but it is possible that (e.g.) it is intended as an informal summary of
some choices made in a specific operating environment to make a
specific decision.  Additional text to set the scope of the discussion
could go a long way.

Section 2

There's a lot of assertions here without any supporting evidence or
reasoning.  Why is subjective better than objective?  What if objective
gets a lot better in the future?  What if a test should be important but
the interested people don't have the qualifications and the qualified
people are too busy doing other things?

Section 2.1

Why is p<0.5 an appropriate criterion?  Even where p-values are still
used in the scientific literature (which is decreasing in popularity),
the threshold is more often 0.05, or even 0.00001 (e.g., for high-energy

Section 3

Normative C code contained outside of the RFC being published is hardly
an archival way to describe an algorithm.  There isn't even a git commit
hash listed to ensure that the referenced material doesn't change!

Section 3.5, 3.6, 3.7

I don't see how MSSSIM, CIEDE2000, VMAF, etc. are not normative
references.  If you want to use the indicated metric, you have to follow
the reference.

Section 4.2

There is a dearth of references here.  This document alone is far from
sufficient to perform these calculations.

Section 4.3

There is a dearth of references here as well.  What are libaom and
libvpx?  What is the overlap "BD-Rate method" and where is it specified?

Section 5.2

This mention of "[a]ll current test sets" seems to imply that this
document is part of a broader set of work.  The Introduction should make
clear what broader context this document is to be interpreted within.
(I only note this once in the Discuss portion, but noted some other
examples in the Comment section.)
Comment (2019-06-12 for -08)
Section 1

Please give the reader a background reading list to get up to speed with
the general concepts, terminology, etc.  (E.g., I happen to know what
the "luma plane" is, but that's not the case for all consumers of the
RFC series.)

Section 2.1

It seems likely that we should note that the ordering of the algorithms
in question should be randomized (presented as left vs. right,
first vs. second, etc.)

Section 2.3

   A Mean Opinion Score (MOS) viewing test is the preferred method of
   evaluating the quality.  The subjective test should be performed as
   either consecutively showing the video sequences on one screen or on
   two screens located side-by-side.  The testing procedure should

When would it be appropriate to perform the test differently?

   normally follow rules described in [BT500] and be performed with non-
   expert test subjects.  The result of the test will be (depending on

(I couldn't follow the links to [BT500] and look; is this a
restricted-distribution document?)

Section 3.4

A forward reference or other expansion for BD-Rate would be helpful.

Section 3.7

   perception of video quality [VMAF].  This metric is focused on
   quality degradation due compression and rescaling.  VMAF estimates

nit: "due to"

Section 4.1

Decibel is a logarithmic scale that requires a fixed reference value in
order for numerical values to be defined (i.e., to "cancel out the
units" before the transcendental logarthmic function is applied).  I
assume this is intended to take the reference as the full-fidelity
unprocessed original signal, but it may be worth making that explicit.

Section 4.2

Why is it necessary to mandate trapezoidal integration for the numerical
integration?  There are fairly cheap numerical methods available that
have superior performance and are well-known.

Section 5.2.x

How important is it to have what is effectively a directory listing in
the final RFC?

Section 5.2.2, 5.2.3

              This test set requires compiling with high bit depth

Compiling?  Compiling what?  Again, this needs to be set in the broader

Section 5.3

Please expand CQP on first usage.  I don't think the broader scope in
which the "operating modes" are defined has been made clear.

Section 5.3.4, 5.3.5

   supported.  One parameter is provided to adjust bitrate, but the
   units are arbitrary.  Example configurations follow:

Example configurations *of what*?

Section 6.2

   Normally, the encoder should always be run at the slowest, highest
   quality speed setting (cpu-used=0 in the case of AV1 and VP9).
   However, in the case of computation time, both the reference and

What is "the case of computation time"?

   changed encoder can be built with some options disabled.  For AV1, -
   disable-ext_partition and -disable-ext_partition_types can be passed
   to the configure script to substantially speed up encoding, but the
   usage of these options must be reported in the test results.

Again, this is assuming some context of command-line tools that is not
clear from the document.

Roman Danyliw Discuss

Discuss (2019-06-13 for -08)
(1) There appear to be deep and implicit dependencies in the document to the references [DAALA-GIT] and [TESTSEQUENCES].  I applaud the intent to provide tangible advice on testing and evaluation to the community with them.  I have a few questions around their use.

(1.a) Why aren’t [DAALA-GIT] and [TESTSEQUENCES] normative references as they are needed to fully understand the testing approach and provide the test data?

(1.b) What should readers of the RFC do should these external references no longer be available?  How is the change control of these references handled?  

(1.c) In the case of [DAALA-GIT] which version of the code in the repo should be used?  Formally, what version of C is in that repo?   

(1.d) Per the observation that there are implicit assumptions made by the document about familiarity with [DAALA-GIT] and [TESTSEQUENCES], here are a few places where additional clarity is required:

-- Section 4.3, Per “For individual feature changes in libaom or libvpx , the overlap BD-Rate method with quantizers 20, 32, 43, and 55 must be used”, what are libaom and libvpx and what is their role?

-- Section 5.3.  Multiple subsection in 5.3.* list what look like settings for tools (e.g., “av1: -codec=av1 -ivf -frame-parallel=0 …”). What exactly are those?  How to read them/use them?

(2) The full details of some of the testing regimes need to be more fully specified (or cited as normative):
-- Section 3.1. The variable MAX is not explained in either equation.

-- Section 3.1.  This section doesn’t explain or provide a reference to calculate PSNR.  I’m not sure how to calculate or implement it.

-- Section 4.2.  Reference needed for Bjontegaard rate difference to explain its computation

-- The references [SSIM], [MSSIM], [CIEDE2000] and [VMAF] are needed to fully explain a given testing metric so they need to be normative

(3) An IANA Considerations section isn’t present in the document.

(4) A Security Considerations sections isn’t present in the document.
Comment (2019-06-13 for -08)
A few comments:

(5) Consider qualifying the title to more accurately capture the substance of this draft “Video Codec Testing and Quality Measurement {using the Daala Tool Suite or Xiph Tools and Data}.”

(6) Section 3.1 and 3.3.  Cite a reference for the source code files names in question – dump_psnr.c and dump_pnsrhvs.c (which are somewhere in the [DAALA-GIT] repo?)

(7) Editorial Nits
-- Section 2.1.  Expand PMF (Probability Mass Function) on first use.

-- Section 2.1. Explain floor.

-- Section 2.2.  Typo.  s/vidoes/videos/

-- Section 2.2. Typo. s/rewatched/re-watched/

-- Section 2.3.  Typo.  s/comparisions/comparisons/

-- Section 3.1.  Expand PSNR (Peak signal to noise ratio) on first use.

-- Section 3.1.  Typo.  s/drived/derived/

Alvaro Retana No Record

Erik Kline No Record

Francesca Palombini No Record

John Scudder No Record

Lars Eggert No Record

Martin Duke No Record

Martin Vigoureux No Record

Murray Kucherawy No Record

Robert Wilton No Record

Warren Kumari No Record

Zaheduzzaman Sarker No Record

Éric Vyncke No Record

(Adam Roach; former steering group member) Yes

Yes ( for -08)
No email
send info

(Alissa Cooper; former steering group member) (was Discuss) No Objection

No Objection (2020-02-07)
Thank you for addressing my DISCUSS.

Please respond to the Gen-ART review.

(Deborah Brungard; former steering group member) No Objection

No Objection ( for -08)
No email
send info

(Ignas Bagdonas; former steering group member) No Objection

No Objection ( for -08)
No email
send info

(Mirja Kühlewind; former steering group member) Abstain

Abstain (2019-06-05 for -08)
No email
send info
Update: This document has no security considerations section, while having this section is required.

This document reads more like a user manual of the Daala tools repository (together with the test sequences). I wonder why this is not simply archived within the repo? What’s the benefit of having this in an RFC? Especially I’m worried that this document is basically useless in case the repo and test sequences disappear, and are therefore not available anymore in future, or change significantly. I understand that this is referenced by OAM and therefore publication is desired, however, I don't think that makes my concern about the standalone usefulness of this document invalid. If you really want to publish in the RFC series, I would recommend to reduce the dependencies to these repos and try to make this document more useful as a standalone test description (which would probably mean removing most of section 4 and adding some additional information to other parts).

Also, the shepherd write-up seems to indicate that this document has an IPR disclosure that was filed after WG last call. Is the wg aware of this? Has this been discussed in the wg?

Other more concrete comments:
1) Quick question on 2.1: Is the tester supposed to view one image after the other or both at the same time? And if one ofter the other, could the order impact the results (and should maybe be randomly chosen therfore)?

2) Sec 2.3: Would it make sense to provide a (normative) reference to MOS? Or is that supposed to be so well know that that is not even necessary? 

3) Sec 3.1: maybe spell out PSNR on first occurrence. And would it make sense to provide a reference for PSNR?

4) Sec 3.2: “ The weights used by the dump_pnsrhvs.c tool in
   the Daala repository have been found to be the best match to real MOS
Maybe document these weights in this document as well…?

5) Sec 5.3: Maybe spell out CQP at first occurrence