Last Call Review of draft-ietf-codec-opus-
review-ietf-codec-opus-genart-lc-davies-2012-05-18-00

Request Review of draft-ietf-codec-opus
Requested rev. no specific revision (document currently at 16)
Type Last Call Review
Team General Area Review Team (Gen-ART) (genart)
Deadline 2012-05-10
Requested 2012-05-01
Draft last updated 2012-05-18
Completed reviews Genart Last Call review of -?? by Elwyn Davies
Genart Telechat review of -?? by Elwyn Davies
Assignment Reviewer Elwyn Davies
State Completed
Review review-ietf-codec-opus-genart-lc-davies-2012-05-18
Review completed: 2012-05-18

Review
review-ietf-codec-opus-genart-lc-davies-2012-05-18

I am the assigned Gen-ART reviewer for this draft. For background on 
Gen-ART, please see the FAQ at 
<

http://wiki.tools.ietf.org/area/gen/trac/wiki/GenArtfaq>. 

Please resolve these comments along with any other Last Call comments 
you may receive. 

Document: draft-ietf-codec-opus-12.txt
Reviewer:  Elwyn Davies
Review Date:  14 May 2012 (completed)
IETF LC End Date: 10 May 2012
IESG Telechat date: (if known) -

Summary: 
Before offering some views on the document, let me say that this piece
of work seems to be a tour de force on behalf of its developers.  It is
certainly one of the (if not the) most technically complex pieces of
work that has been presented to the IETF, and is far more mathematically
complex and rigorous than our usual run of standards.  It is also
perhaps a vindication of the rough concensus and running code
philosophy. So congratulations to the authors and the companies that
have supported this work.  But back to the review....

I came to this review with a very outdated view of what goes on inside
codecs, so it was rather a shock to the system.  Taking that into
account, I think that some additional high level, up front description
of the two parts (SILK and CELT) and the range encoding system would
make it easier for future (naive) readers and potential implementers to
understand what is going on.  For example, after struggling with the
description of CELT in the decoder (s4.3) I found that going off and
reading the slides on CELT presented at IETF 79 helped considerably.
Such additions might also help future dissemination of the technique by
giving potential users a quick overview.  It would also make it easier
to justify the decision to postpone the high level (and highly
mathematical/technical) view to section 5 after the detailed description
of the decoder.

I also found that the descriptions of the SILK and CELT decoders
appeared to be written with a different perspective (see detailed
comments) doubtless by different authors.  This is not ideal and,
despite the expected further increase in page count, some additional
text in s4.3 seems desirable.

By far the most difficult to parse section is 4.3.3 talking about bit
allocation in CELT.  Despite considerable study, I didn't feel I had
really got to grips with how this worked.  An example might help.

Finally, the most contentious aspect of this document is the requirement
to treat the code as normative.  There are just over 30,000 physical
lines of code and I haven't checked them hardly at all.  Slocount
reckons this represents 7.14 man years of effort.  As with the document,
there is a discrepancy between CELT and SILK with the latter code being
more heavily commented, especially as regards routine parameters.  A
proper validation of the claim that the code implements the description
would take several weeks of time: I guess that we have to take the
document shepherd's assurances on this.  One issue is that both code and
document are pretty much saturated with what we used to call 'magic
numbers'.  Although these are explained in the document, it does not
appear that this is always the case in the code.  I would also be
happier if the code contained Doxygen (or similar) dcoumentation
statements for all routines (rather than just the API).

So overall, the work is something that ought to be standardized but the
document needs further work - not quite ready for the IESG. 

Major issues: 
Can we accept the code as normative?  If not how do we proceed?

Minor issues:
Contrast between decoder descriptions of LP part and CELT part:  The
SILK descriptions go into gory detail on the values used in lots of
tables, etc., whereas the CELT part has a very limited treatment of the
numeric values used (assuming reliance on finding the values in the
reference implementation, either explictly or implicitly).  There are
things to be said for both techniques.  I was wondering (while reading
the SILK description) if the authors have any means of automatically
generating the tables from the code in the SILK part (or vice versa) to
avoid double maintenance. On the other hand, there are pieces of the
CELT decoder description (especially in s4.3.3 where knowing numbers of
bands, etc.) where some actual numbers would help comprehension.

s2 (and more generally):  The splitting of the signal in the frequency
domain into signal (components?) below and above 8kHz in the hybrid case
presumably requires that the output of the CELT encoder is windowed so
that only one of encoders gnerates output below 8kHZ.  I think something
is needed in s2 to explain how this is managed (presumably to do with
energy bands?).  I didn't find anything in s5 about what has to be done
for the encoder when running in hybrid mode which surprised me somewhat.

s4.1: Despite having found a copy of the range coding paper and tried to
read it, I found the description of range coding opaque (there are some
more words later in s5 but this is a bit late for understanding the
decoder).  Given that this is (I think) fairly novel, some additional
less opaque description could be very helpful to people trying to
understand what is going on. In particular knowledge of how entropy
coding works is pretty much a prerequisite. 

s4.2.5, para 3:
>    When switching from 20 ms to 10 ms, the 10 ms
>    20 ms Opus frame, potentially leaving a hole that needs to be
>    concealed from even a single packet loss.
How?

s4.2.5, para 4: 
> In order to properly produce LBRR frames under all conditions, an
>    encoder might need to buffer up to 60 ms of audio and re-encode it
>    during these transitions.  However, the reference implementation opts
>    to disable LBRR frames at the transition point for simplicity.
> 
Should this be phrased in RFC 2119 requiremenmts language?  The first
part sounds like a SHOULD with the second part being the get out , but
its not entirely clear what the consequences are other then simplicity.


======================================================================
            MINOR ISSUES ABOVE submitted as part 1 of review
======================================================================
general/s11.2:  Several references [Hadamard], [Viterbi}, etc., are to
Wikipaedia pages.  Whilst these are convenient (and only illustrative)
they are not guaranteed to be very stable.  Better (i.e., more stable)
references are desirable.

s1, para 2:
>    The decoder contains a great deal of integer and fixed-point
>    arithmetic which must be performed exactly, including all rounding
>    considerations,...
Is this a MUST?  There are instances in the text which might contradict
this (e.g., para 1 of s4.2.7.5 which has (capitalized) SHOULDs).

s4.3:
As a complete newcomer to CELT, I would have appreciated a more high
level understanding of what CELT is doing at this point.  I  tried
reading s4.3 without any additional input and found it very hard going.
Eventually I gave up and went looking for some additional input.  This
presentation seems to have a useful view 


http://www.ietf.org/proceedings/79/slides/codec-2.pdf



I think that it would be extremely helpful to have a description similar
to this at this point in the document, even though there is some
material in section 5.3 which could also be forward referenced.  Still
the material in s5.3 does not start from the basic principles that CELT
is using, and since these are essentially novel, it would be very good
to give prospective implementers/users an understanding of what is going
on.  Incidentally, I found the above IETF presentation more useful than


http://www.celt-codec.org/presentations/misc/lca-celt.pdf


Note that the SILK part seems rather less opaque.  It would also be
useful to indicate numerically how many bands are involved and what the
number of MDCT bins are in the various bands. 

s4.3, last para: Can the 'out of range error' occur in the LP decoder? (if not why not?)

======================================================================

Nits/editorial comments: 

global: bytes -> octets
global: The form/terminology Q<n> (e.g., Q13, Q15, Q16) ought to be
explained.

s1: Expand CELP

s1.1.6: Need to define floor().

s2: ? Expand SILK

s2: Reference for Vorbis.

s2.1.8: Expand AAC (and MP3).

s3: References for Ogg, Matroska, maybe RTP.

s3: Fig 1, Table 2 and intervening text:  Presumably SILK only (Table 2)
etc correspond to MODES 1-3 in Figure 1. This needs to be consistent.

s3.2.1: Make it clear which of the octets is len[0]/len[1].  To be
precise it might be better to say len0/len1 are the values of the two
length octets (in whichever order you intend). The form len[0] could be
misinterpreted as a function 'length of 0'.

s3.2.5: better s/figure below/Figure 5/

s3.2.5:
> In the CBR case, the compressed length of each frame in bytes is
>    equal to the number of remaining bytes in the packet after
>    subtracting the (optional) padding, (N-2-P), divided by M. This
>    number MUST be a non-negative integer multiple of M.
'This number' is not the  compressed length of each frame that is the
subject of the first sentence, but the number of remaining octets - this
needs rewording.

s3.2.5:
> The number of header bytes (TOC byte, frame
>    count byte, padding length bytes, and frame length bytes), plus the
>    length of the first M-1 frames themselves, plus the length of the
>    padding MUST be no larger than N, the total size of the packet.
Surely this is a non sequitur? This might be better phrased as 'The
total size of a well formed packet MUST be at least...' 

s3.3: The example diagrams ought to have figure numbers.

s3.4: I am not keen on duplicating normative requirements in this way
(double maintenance issue).  It would be better to put explicit numbered
requirements in the sections above an reference the resulting numbers
here. 

s4.1:
> The decoder initializes rng to 128 and initializes val to
>    127 minus the top 7 bits of the first input octet. 
How are the 'top seven bits' to be interpreted here? e.g. as the bottom
seven bits of a 8 bit integer field? an 8 bit integer with the lowest
bit zeroed out?


s4.1.1:  This is really a global point.  This section refers to
entdec.c.  Presumably (since we haven't reached the code yet) and it is
still compressed, there is some file structure.  I don't think this has
been said above.  It would be good to provide a list of the file
components (i.e., sectional structure of the code) at the start, maybe
even  giving line number positions within the decompressed code.


s4.1.1.1:
> Then it reads the next octet of the
>    payload and combines it with the left-over bit buffered from the
>    previous octet to form the 8-bit value sym.  It takes the left-over
>    bit as the high bit (bit 7) of sym, and the top 7 bits of the octet
>    it just read as the other 7 bits of sym.
This is not well phrased.  Better
     Then it reads the next octet of the payload [packet? payload hasn't
really been used before] and combines the left  over bit from the
previous octet (see Section 4.1 for starting this process) as the high
bit (bit 7)| of 'sym' and the top 7 bits of the octet as the other 7
bits of sym, leaving the remaining bit for the next iteration.  

s4.1.5.2:
Should r_Q15 = rng >> (l-16) be r_Q15 = rng >> (lg-16)?  There doesn't
seem to be an 'l' defined.

s4.2.1: Expand LTP earlier. It would also be useful to expand LPC again.
 
s4.2.2: acronym VAD is not expanded until the beginning of s4.2.3.

s4.2.7: acronym LSF needs to be expanded on first use.

s4.2.7.1: Explain briefly why Table 7 has values for indices 0 to 15
when wi0/1 are in range 0 to 14.

s4.2.7.4, para below Table 12:
> These 6 bits are combined to form a gain index between 0 and 63.

s/gain index/gain_index/ as this variable is used subsequently.

s4.2.7.4: The use of log_gain seems slightly confusing when combined
with gain_index.  One at least is presumably log scaled.  Maybe a bit
more explanation is needed.

======================================================================
            COMMENTS ABOVE submitted as part 1 of review
======================================================================
General: Define L1 and L2 (as in L1-norm, etc).

s4.2.7.2, last para:
> In that case, if this
>    flag is zero (indicating that there should be a side channel), then
>    Packet Loss Concealment (PLC, see Section 4.4) SHOULD be invoked to
>    recover a side channel signal.
What are the consequences (or what actions need to be taken) if it is
not invoked?

s4.2.7.5, para 1:
> These represent the interleaved zeros on the
>    unit circle between 0 and pi (hence "normalized") in the standard
>    decomposition of the LPC filter into a symmetric part and an anti-
>    symmetric part (P and Q in Section 4.2.7.5.6).
'on the unit circle between 0 and pi' might be clearer as 'on the upper
half of the unit circle' or 'on the half of the unit circle in the
positive imaginary area of the complex plane'.
'standard decomposition'?  Needs a reference.

s4.2.7.5, para 1: A reference for the use of LSF in LPC would be useful.

s4.2.7.5.x: There is inconsistent use of stage 1/stage 2 vs
stage-1/stage-2.  Please be consistent.

s4.2.7.5, para 1:
>    Because of non-linear
>    effects in the decoding process, an implementation SHOULD match the
>    fixed-point arithmetic described in this section exactly.  An encoder
>    SHOULD also use the same process.
- Does this contradict the 'must' in s1, para 2?
- What are the consequences of ignoring the SHOULD?  How bad would they
get?  Might it become unstable and how would one know?

s4.2.7.5.1, para 1: s/This indexes an element in a coarse codebook,
   selects the PDFs for the second stage of the VQ/This indexes an 
   element in a coarse codebook that selects the PDFs for the second stage 
   of the VQ/

s4.2.7.5.3, last para: 
> However, nothing in
>    either the reconstruction process or the quantization process in the
>    encoder thus far guarantees that the coefficients are monotonically
>    increasing and separated well enough to ensure a stable filter.
A reference that indicates why this requirement is needed would be desirable.
(and also for s4.2.5.7.8).

s4.2.7.5.4 and Table 25: Are the values in Table 25 NDeltaMin or NDelatMin_Q15?
The equations after Table 25 use both NDeltaMin and NDeltaMin_Q15.  Is this correct?
In particular the first two equations deliver _Q15 values but use raw NDeltaMin.

s4.1.1/s4.2.7.1 and other places:  The term 'exact integer division' is
used in various places.  My understanding was that this phrase implied
that it was known that the dividend was an exact multiple of the divisor
by some out-of-band means.  This doesn't seem to be the case generally
in Opus (e.g,, where both n/5 and n%5  are needed - clearly this doesn't
anticipate n%5 == 0 every time!)  So what does 'exact integer division'
imply?  A definition may be needed. 

s4.3, last para: s/described in the figure above./described in Table 55 above./

s4.3.1: 
> The "transient" flag encoded in the bitstream has a probability of 1/8. 
This statement appears out of the blue apparently.  Some more
explanation of what the transient flag actually implies and why we
should be so sure about its PDF would help.

s4.3.2.1: Arguably a reference is needed for the z-transform.

s4.3.2.1: Avoid the equation picture splitting across page boundaries.
in the current version it is unclear what the denominator is. (use
needLines processing direcrive in xml2rfc).  Same applies to the
equation below Table 57 in s4.3.4.3.

4.3.2.1, after the equations:
> The
>    prediction is clamped internally so that fixed point implementations
>    with limited dynamic range do not suffer desynchronization.
As a person with limited skills in the srt, I have no idea what
desynchronization implies here. 

4.3.2.1, ibid:
> We
>    approximate the ideal probability distribution of the prediction
>    error using a Laplace distribution with separate parameters for each
>    frame size in intra- and inter-frame modes.
I suspect this sentence belongs before the equation described the z-transform.
Where are the values of the parameters for the inter-frame mode defined
(the intra-frame ones are in the text)?

s4.3.2.3: Paragraph on decoding band boosts:  Might be improved by using
equations rather than the wordy descriptions used at present.

(global)s4.3.2.3, para above table 56: s/iff/if and only if/

s4.3.2.3: LOG2_FRAC_TABLE is missing.

s4.3.3: It would be helpful to explain either here, or at the outset of
s4.3 overall, how the concept of energy bands and MDCT bins applies to
the CELT part of the codec, and just how many bands and bins are used.
Some of this is contained in s5.3.2, but the magic number 17 appears
later in 4.3.3 which is presumably something to do with the point in the
frequency domain that CELT takes over from LP in the hybrid mode.  It
would make the very complex section 4.3.3 rather easier to understand
with this extra information - I have to say I struggled!  On reflection,
I think an example of what bits are allocated to a band and how thay rae
subsequently used would be quite helpful - Without going to delve into
the code I am really not clear that I understand just what bits are
allocated and what they then encode and I have read the text quite a few
times now.

s4.3.3: Be consistent between 'tone to noise' and 'tone-to-noise'.

s4.3.3:
>    The band-energy normalized structure of Opus MDCT mode ensures that a
>    constant bit allocation for the shape content of a band will result
>    in a roughly constant tone to noise ratio, which provides for fairly
>    consistent perceptual performance.  The effectiveness of this
>    approach is the result of two factors: that the band energy, which is
>    understood to be perceptually important on its own, is always
>    preserved regardless of the shape precision, and because the constant
>    tone-to-noise ratio implies a constant intra-band noise to masking
>    ratio.  Intra-band masking is the strongest of the perceptual masking
>    effects.  This structure means that the ideal allocation is more
>    consistent from frame to frame than it is for other codecs without an
>    equivalent structure.
This paragraph contains a number of interesting assertions:  Is there a
reference where one could see them justified (it may be that this is the
result of original research in the Opus team).
 
s4.3.3, paragraphs after the bullet points:  The concepts of 'shape' and
'shape encoding' is introcuced here without explicit definition.  Are we
talking about the shape windowing used in FFT/MDCT here? This should be
made clear.

s4.3.3, 5th para after bullet points: s/In the reference the maximums/In
the reference implementation the maximums/

s4.3.3, 6th para after the bullet points:  A table of bands per mode and
number of MDCT bins  covered would be helpful here in order to  get a
feeling for the scale of the problem.  Also the cache_caps50 table in
the code contains the magic number 168.  Where does this come from?

s4.3.3, 6th para after the bullet points:
> set LM to the shift value for the frame size (e.g. 0 for 120, 1 for
>    240, 3 for 480),  
Where do these frame sizes get specified? And what is the total set of
frame sizes? The text says 'e.g.' (which incidentally should be 'e.g.,')
implying that this is not the complete set.

s4.3.3, 6th para after the bullet points: Need to define 'truncating
integer division' to go with 'exact integer division'.

s4.3.3, 7th para after the bullet points: 
> The band boosts are represented by a series of binary symbols which
>    are coded with very low probability.
How many, at least, and what values?  Are these range encoded? I don't
see them in the table above or with a PDF specified.

s4.3.3, 7th para after the bullet points:
>    and every time
>    a band is boosted the initial cost is reduced (down to a minimum of
>    two).
Would that be a value of two or two bits? 
  
s4.3.3: Paragraph on decoding band boosts:  Might be improved by using
equations rather than the wordy descriptions used at present.

(global)s4.3.3, para above table 56: s/iff/if and only if/

s4.3.3, 2nd para after Table 56: 
> For stereo frames, bits are reserved for intensity stereo and for
>    dual stereo.  Intensity stereo requires ilog2(end-start) bits.
The terms 'intenmsity stereo' and 'dual stereo' don't appear to have
been defined.

s4.3.3: LOG2_FRAC_TABLE is missing.

4.3.4.3, last para:
>    If the decoded vector represents more than one time block, then the
>    following process is applied separately on each time block. 
Should this sentence come before the previous paragraph?  There  isn't
really a 'following process' in this section and I don't think it menas
the process in s4.3.4.4?


s4.3.4.3, last sentence: 
> This extra rotation is applied in an interleaved manner with a stride
>    equal to round(sqrt(N/nb_blocks))
I think this needs some more explanation for the uninitiated.

s4.3.4.4:
> Multiple levels of splitting may be
>    applied up to a frame size dependent limit. 
What this limit is does not appear to be defined.

s4.3.5: The 'collapse' phenomenon is not fully defined, and it would be
useful to mention why it happens.  Also s/min/minimum/.

s4.3.6: s/Just like/Just as/

s4.3.7, last para: I think 'power complementarity' requires further
explanation or a reference.

s4.5, para 3: s/To avoid or reduces glitches during these/To avoid or
reduce glitches during these/

s4.5.1.1, para 1: s/For for SILK-only/For SILK-only/

s4.5.1.4, para 2: s/redundant frame is as-is,/redundant frame as-is,/

s5, figure 16: The Optional High-pass Filter box has two spurious '+'
symbols on the vertical sides. 

s5, last para:  A reference for the Auto Regressive Moving Average
(ARMA) filter would be useful.

s5.2.3.4.2.1, title: s/Burgs method/Burg's Method/

s5.2.3.5, para 2: Expand 'R/D performance' (or probably specify it as
abbreviation for rate-distortion in para 1).

s5.3.5, para 1 below equation: Is E an abbreviation for 'extra'?

s5.3.6: The abbreviation RD for rate-distortion is defined here (see
comment on s5.2.3.5).

s6.1: This section is perhaps a little 'triumphalist' for the reference
implementation (this may of course be justified!.  The quality metric is
a 'relative quality metric' and presumably if someone does a *really*
good job of coding, it is entirely possible that the new algorithms
might get better than 100 on the quality score (i.e., provide a better
answer than the reference implementation).

s6.2: Just wondering, but is non-standard frame size the only option
offered by Opus Custom?  If not, probably more text is needed here.  Are
there any major changes to the algorithms implied by the use of Opus
Custom?

s7.  Referencing SECGUIDE (RFC 3552) seems inappropriate since it occurs
in such a security considerations section. Just omit it.

s11.2: A number of references are to Wikipaedia pages.  While these were
useful to me in refreshing or initializing my knowledge, they are not
usually considered adequately stable for use in RFCs.  I fear you may
have to provide more stable references.

Appendix A: I checked that the code extracted as inicated and was able
to be compiled under Ubuntu 10.04 LTS.