Skip to main content

Framework for Telepresence Multi-Streams
draft-ietf-clue-framework-04

The information below is for an old version of the document.
Document Type
This is an older version of an Internet-Draft that was ultimately published as RFC 8845.
Authors Dr. Allyn Romanow , Mark Duckworth , Andrew Pepperell , Brian Baldino
Last updated 2012-03-12
RFC stream Internet Engineering Task Force (IETF)
Formats
Reviews
Additional resources Mailing list discussion
Stream WG state WG Document
Document shepherd (None)
IESG IESG state Became RFC 8845 (Proposed Standard)
Consensus boilerplate Unknown
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-ietf-clue-framework-04
CLUE WG                                                       A. Romanow
Internet-Draft                                             Cisco Systems
Intended status: Informational                         M. Duckworth, Ed.
Expires: September 13, 2012                                      Polycom
                                                            A. Pepperell

                                                              B. Baldino
                                                           Cisco Systems
                                                          March 12, 2012

                Framework for Telepresence Multi-Streams
                    draft-ietf-clue-framework-04.txt

Abstract

   This memo offers a framework for a protocol that enables devices in a
   telepresence conference to interoperate by specifying the
   relationships between multiple media streams.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on September 13, 2012.

Copyright Notice

   Copyright (c) 2012 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must

Romanow, et al.        Expires September 13, 2012               [Page 1]
Internet-Draft         CLUE Telepresence Framework            March 2012

   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  4
   4.  Overview of the Framework/Model  . . . . . . . . . . . . . . .  7
   5.  Spatial Relationships  . . . . . . . . . . . . . . . . . . . .  8
   6.  Media Captures and Capture Scenes  . . . . . . . . . . . . . .  9
     6.1.  Media Captures . . . . . . . . . . . . . . . . . . . . . .  9
       6.1.1.  Media Capture Attributes . . . . . . . . . . . . . . . 10
     6.2.  Capture Scene  . . . . . . . . . . . . . . . . . . . . . . 12
       6.2.1.  Capture scene attributes . . . . . . . . . . . . . . . 14
     6.3.  Simultaneous Transmission Set Constraints  . . . . . . . . 14
   7.  Encodings  . . . . . . . . . . . . . . . . . . . . . . . . . . 15
     7.1.  Individual Encodings . . . . . . . . . . . . . . . . . . . 15
     7.2.  Encoding Group . . . . . . . . . . . . . . . . . . . . . . 16
   8.  Associating Media Captures with Encoding Groups  . . . . . . . 18
   9.  Consumer's Choice of Streams to Receive from the Provider  . . 18
     9.1.  Local preference . . . . . . . . . . . . . . . . . . . . . 19
     9.2.  Physical simultaneity restrictions . . . . . . . . . . . . 19
     9.3.  Encoding and encoding group limits . . . . . . . . . . . . 20
     9.4.  Message Flow . . . . . . . . . . . . . . . . . . . . . . . 20
   10. Extensibility  . . . . . . . . . . . . . . . . . . . . . . . . 21
   11. Examples - Using the Framework . . . . . . . . . . . . . . . . 21
     11.1. Three screen endpoint media provider . . . . . . . . . . . 22
     11.2. Encoding Group Example . . . . . . . . . . . . . . . . . . 28
     11.3. The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 29
     11.4. Media Consumer Behavior  . . . . . . . . . . . . . . . . . 29
       11.4.1. One screen consumer  . . . . . . . . . . . . . . . . . 30
       11.4.2. Two screen consumer configuring the example  . . . . . 30
       11.4.3. Three screen consumer configuring the example  . . . . 31
   12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31
   13. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 31
   14. Security Considerations  . . . . . . . . . . . . . . . . . . . 31
   15. Changes Since Last Version . . . . . . . . . . . . . . . . . . 31
   16. Informative References . . . . . . . . . . . . . . . . . . . . 32
   Appendix A.  Open Issues . . . . . . . . . . . . . . . . . . . . . 33
     A.1.  Video layout arrangements and centralized composition  . . 33
     A.2.  Source is selectable . . . . . . . . . . . . . . . . . . . 33
     A.3.  Media Source Selection . . . . . . . . . . . . . . . . . . 33
     A.4.  Endpoint requesting many streams from MCU  . . . . . . . . 33
     A.5.  VAD (voice activity detection) tagging of audio streams  . 34
     A.6.  Private Information  . . . . . . . . . . . . . . . . . . . 34

Romanow, et al.        Expires September 13, 2012               [Page 2]
Internet-Draft         CLUE Telepresence Framework            March 2012

   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34

Romanow, et al.        Expires September 13, 2012               [Page 3]
Internet-Draft         CLUE Telepresence Framework            March 2012

1.  Introduction

   Current telepresence systems, though based on open standards such as
   RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each
   other.  A major factor limiting the interoperability of telepresence
   systems is the lack of a standardized way to describe and negotiate
   the use of the multiple streams of audio and video comprising the
   media flows.  This draft provides a framework for a protocol to
   enable interoperability by handling multiple streams in a
   standardized way.  It is intended to support the use cases described
   in draft-ietf-clue-telepresence-use-cases-02 and to meet the
   requirements in draft-ietf-clue-telepresence-requirements-01.

   The solution described here is strongly focused on what is being done
   today, rather than on a vision of future conferencing.  At the same
   time, the highest priority has been given to creating an extensible
   framework to make it easy to accommodate future conferencing
   functionality as it evolves.

   The purpose of this effort is to make it possible to handle multiple
   streams of media in such a way that a satisfactory user experience is
   possible even when participants are using different vendor equipment,
   and also when they are using devices with different types of
   communication capabilities.  Information about the relationship of
   media streams at the provider's end must be communicated so that
   streams can be chosen and audio/video rendering can be done in the
   best possible manner.

   There is no attempt here to dictate to the renderer what it should
   do.  What the renderer does is up to the renderer.

   After the following Definitions, a short section introduces key
   concepts.  The body of the text comprises several sections about the
   key elements of the framework, how a consumer chooses streams to
   receive, and some examples.  The appendix describe topics that are
   under discussion for adding to the document.

2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

3.  Definitions

   The definitions marked with an "*" are new; all the others are from

Romanow, et al.        Expires September 13, 2012               [Page 4]
Internet-Draft         CLUE Telepresence Framework            March 2012

   draft-wenger-clue-definitions-00-01.txt.

   *Audio Capture: Media Capture for audio.  Denoted as ACn.

   Camera-Left and Right: For media captures, camera-left and camera-
   right are from the point of view of a person observing the rendered
   media.  They are the opposite of stage-left and stage-right.

   Capture Device: A device that converts audio and video input into an
   electrical signal, in most cases to be fed into a media encoder.
   Cameras and microphones are examples for capture devices.

   *Capture Scene: a structure representing the scene that is captured
   by a collection of capture devices.  A capture scene includes
   attributes and one or more capture scene entries, with each entry
   including one or more media captures.

   *Capture Scene Entry: a list of media captures of the same media type
   that together form one way to represent the capture scene.

   Conference: used as defined in [RFC4353], A Framework for
   Conferencing within the Session Initiation Protocol (SIP).

   *Individual Encoding: A variable with a set of attributes that
   describes the maximum values of a single audio or video capture
   encoding.  The attributes include: maximum bandwidth- and for video
   maximum macroblocks (for H.264), maximum width, maximum height,
   maximum frame rate.

   *Encoding Group: A set of encoding parameters representing a media
   provider's encoding capabilities.  Media stream providers formed of
   multiple physical units, in each of which resides some encoding
   capability, would typically advertise themselves to the remote media
   stream consumer using multiple encoding groups.  Within each encoding
   group, multiple potential encodings are possible, with the sum of the
   chosen encodings' characteristics constrained to being less than or
   equal to the group-wide constraints.

   Endpoint: The logical point of final termination through receiving,
   decoding and rendering, and/or initiation through capturing,
   encoding, and sending of media streams.  An endpoint consists of one
   or more physical devices which source and sink media streams, and
   exactly one [RFC4353] Participant (which, in turn, includes exactly
   one SIP User Agent).  In contrast to an endpoint, an MCU may also
   send and receive media streams, but it is not the initiator nor the
   final terminator in the sense that Media is Captured or Rendered.
   Endpoints can be anything from multiscreen/multicamera rooms to
   handheld devices.

Romanow, et al.        Expires September 13, 2012               [Page 5]
Internet-Draft         CLUE Telepresence Framework            March 2012

   Front: the portion of the room closest to the cameras.  In going
   towards back you move away from the cameras.

   MCU: Multipoint Control Unit (MCU) - a device that connects two or
   more endpoints together into one single multimedia conference
   [RFC5117].  An MCU includes an [RFC4353] Mixer.  [Edt. RFC4353 is
   tardy in requiring that media from the mixer be sent to EACH
   participant.  I think we have practical use cases where this is not
   the case.  But the bug (if it is one) is in 4353 and not herein.]

   Media: Any data that, after suitable encoding, can be conveyed over
   RTP, including audio, video or timed text.

   *Media Capture: a source of Media, such as from one or more Capture
   Devices.  A Media Capture (MC) may be the source of one or more Media
   streams.  A Media Capture may also be constructed from other Media
   streams.  A middle box can express Media Captures that it constructs
   from Media streams it receives.

   *Media Consumer: an Endpoint or middle box that receives media
   streams

   *Media Provider: an Endpoint or middle box that sends Media streams

   Model: a set of assumptions a telepresence system of a given vendor
   adheres to and expects the remote telepresence system(s) also to
   adhere to.

   *Plane of Interest: The spatial plane containing the most relevant
   subject matter.

   Render: the process of generating a representation from a media, such
   as displayed motion video or sound emitted from loudspeakers.

   *Simultaneous Transmission Set: a set of media captures that can be
   transmitted simultaneously from a Media Provider.

   Spatial Relation: The arrangement in space of two objects, in
   contrast to relation in time or other relationships.  See also
   Camera-Left and Right.

   Stage-Left and Right: For media captures, stage-left and stage-right
   are the opposite of camera-left and camera-right.  For the case of a
   person facing (and captured by) a camera, stage-left and stage-right
   are from the point of view of that person.

   *Stream: RTP stream as in [RFC3550].

Romanow, et al.        Expires September 13, 2012               [Page 6]
Internet-Draft         CLUE Telepresence Framework            March 2012

   Stream Characteristics: the media stream attributes commonly used in
   non-CLUE SIP/SDP environments (such as: media codec, bit rate,
   resolution, profile/level etc.) as well as CLUE specific attributes,
   such as the ID of a capture or a spatial location.

   Telepresence: an environment that gives non co-located users or user
   groups a feeling of (co-located) presence - the feeling that a Local
   user is in the same room with other Local users and the Remote
   parties.  The inclusion of Remote parties is achieved through
   multimedia communication including at least audio and video signals
   of high fidelity.

   *Video Capture: Media Capture for video.  Denoted as VCn.

   Video composite: A single image that is formed from combining visual
   elements from separate sources.

4.  Overview of the Framework/Model

   The CLUE framework specifies how multiple media streams are to be
   handled in a telepresence conference.

   The main goals include:

   o  Interoperability

   o  Extensibility

   o  Flexibility

   Interoperability is achieved by the media provider describing the
   relationships between media streams in constructs that are understood
   by the consumer, who can then render the media.  Extensibility is
   achieved through abstractions and the generality of the model, making
   it easy to add new parameters.  Flexibility is achieved largely by
   having the consumer choose what content and format it wants to
   receive from what the provider is capable of sending.

   A transmitting endpoint or MCU describes specific aspects of the
   content of the media and the formatting of the media streams it can
   send (advertisement); and the receiving end responds to the provider
   by specifying which content and media streams it wants to receive
   (configuration).  The provider then transmits the asked for content
   in the specified streams.

   This advertisement and configuration occurs at call initiation but
   may also happen at any time throughout the conference, whenever there

Romanow, et al.        Expires September 13, 2012               [Page 7]
Internet-Draft         CLUE Telepresence Framework            March 2012

   is a change in what the consumer wants or the provider can send.

   An endpoint or MCU typically acts as both provider and consumer at
   the same time, sending advertisements and sending configurations in
   response to receiving advertisements.  (It is possible to be just one
   or the other.)

   The data model is based around two main concepts: a capture and an
   encoding.  A media capture (MC), such as audio or video, describes
   the content a provider can send.  Media captures are described in
   terms of CLUE-defined attributes, such as spatial relationships and
   purpose of the capture.  Providers tell consumers which media
   captures they can provide, described in terms of the media capture
   attributes.

   A provider organizes its media captures that represent the same scene
   into capture scenes.  A consumer chooses which media captures it
   wants to receive according to the capture scenes sent by the
   provider.

   In addition, the provider sends the consumer a description of the
   streams it can send in terms of the media attributes of the stream,
   in particular, well-known audio and video parameters such as
   bandwidth, frame rate, macroblocks per second.

   The provider also specifies constraints on its ability to provide
   media, and the consumer must take these into account in choosing the
   content and streams it wants.  Some constraints are due to the
   physical limitations of devices - for example, a camera may not be
   able to provide zoom and non-zoom views simultaneously.  Other
   constraints are system based constraints, such as maximum bandwidth
   and maximum macroblocks/second.

   The following sections discuss these constructs and processes in
   detail, followed by use cases showing how the framework specification
   can be used.

5.  Spatial Relationships

   In order for a consumer to perform a proper rendering, it is often
   necessary to provide spatial information about the streams it is
   receiving.  CLUE defines a coordinate system that allows media
   providers to describe the spatial relationships of their media
   captures to enable proper scaling and spatial rendering of their
   streams.  The coordinate system is based on a few principles:

Romanow, et al.        Expires September 13, 2012               [Page 8]
Internet-Draft         CLUE Telepresence Framework            March 2012

   o  Simple systems which do not have multiple Media Captures to
      associate spatially need not use the coordinate model.

   o  Coordinates can either be in real, physical units (millimeters),
      have an unknown scale or have no physical scale.  Systems which
      know their physical dimensions should always provide those real-
      world measurements.  Systems which don't know specific physical
      dimensions but still know relative distances should use 'unknown
      scale'.  'No scale' is intended to be used where Media Captures
      from different devices (with potentially different scales) will be
      forwarded alongside one another (e.g. in the case of a middle
      box).

      *  "millimeters" means the scale is in millimeters

      *  "Unknown" means the scale is not necessarily millimeters, but
         the scale is the same for every capture in the capture scene.

      *  "No Scale" means the scale could be different for each capture
         - an MCU provider that advertises two adjacent captures and
         picks sources (which can change quickly) from different
         endpoints might use this value; the scale could be different
         and changing for each capture.  But the areas of capture still
         represent a spatial relation between captures.

   o  The coordinate system is Cartesian X, Y, Z with the origin at a
      spot of the provider's choosing.  The provider must use the same
      origin for all coordinates within the same capture scene.

   The direction of increasing coordinate values is:
   X increases from camera left to camera right
   Y increases from front to back
   Z increases from low to high

6.  Media Captures and Capture Scenes

   This section describes how media providers can describe the content
   of media to consumers.

6.1.  Media Captures

   Media captures are the fundamental representations of streams that a
   device can transmit.  What a Media Capture actually represents is
   flexible:

Romanow, et al.        Expires September 13, 2012               [Page 9]
Internet-Draft         CLUE Telepresence Framework            March 2012

   o  It can represent the immediate output of a physical source (e.g.
      camera, microphone) or 'synthetic' source (e.g. laptop computer,
      DVD player).

   o  It can represent the output of an audio mixer or video composer

   o  It can represent a concept such as 'the loudest speaker'

   o  It can represent a conceptual position such as 'the leftmost
      stream'

   To distinguish between multiple instances, video and audio captures
   are numbered such as: VC1, VC2 and AC1, AC2.  VC1 and VC2 refer to
   two different video captures and AC1 and AC2 refer to two different
   audio captures.

   Each Media Capture can be associated with attributes to describe what
   it represents.

6.1.1.  Media Capture Attributes

   Media Capture Attributes describe static information about the
   captures.  A provider uses the media capture attributes to describe
   the media captures to the consumer.  The consumer will select the
   captures it wants to receive.  Attributes are defined by a variable
   and its value.  The currently defined attributes and their values
   are:

   Content: {slides, speaker, sl, main, alt}

   A field with enumerated values which describes the role of the media
   capture and can be applied to any media type.  The enumerated values
   are defined by [RFC4796].  The values for this attribute are the same
   as the mediacnt values for the content attribute in [RFC4796].  This
   attribute can have multiple values, for example content={main,
   speaker}.

   Composed: {true, false}

   A field with a Boolean value which indicates whether or not the Media
   Capture is a mix (audio) or composition (video) of streams.

   This attribute is not intended to describe the layout used when
   compositing video streams.

   Audio Channel Format: {mono, stereo} A field with enumerated values
   which describes the method of encoding used for audio.

Romanow, et al.        Expires September 13, 2012              [Page 10]
Internet-Draft         CLUE Telepresence Framework            March 2012

   A value of 'mono' means the Audio Capture has one channel.

   A value of 'stereo' means the Audio Capture has two audio channels,
   left and right.

   This attribute applies only to Audio Captures.  A single stereo
   capture is different from two mono captures that have a left-right
   spatial relationship.  A stereo capture maps to a single RTP stream,
   while each mono audio capture maps to a separate RTP stream.

   Switched: {true, false}

   A field with a Boolean value which indicates whether or not the Media
   Capture represents the (dynamic) most appropriate subset of a
   'whole'.  What is 'most appropriate' is up to the provider and could
   be the active speaker, a lecturer or a VIP.

   Point of Capture: {(X, Y, Z)} A field with a single Cartesian (X, Y,
   Z) point value which describes the spatial location, virtual or
   physical, of the capturing device (such as camera).

   When the Point of Capture attribute is specified, it must include X,
   Y and Z coordinates.  If the point of capture is not specified, it
   means the consumer should not assume anything about the spatial
   location of the capturing device.  Even if the provider specifies an
   area of capture attribute, it does not need to specify the point of
   capture.

   Area of Capture:

   {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, Y3,
   Z3), top right(X4, Y4, Z4)}

   A field with a set of four (X, Y, Z) points as a value which describe
   the spatial location of what is being "captured".  By comparing the
   Area of Capture for different Media Captures within the same capture
   scene a consumer can determine the spatial relationships between them
   and render them correctly.

   The four points should be co-planar.  The four points form a
   quadrilateral, not necessarily a rectangle.

   The quadrilateral described by the four (X, Y, Z) points defines the
   plane of interest for the particular media capture.

   If the area of capture attribute is specified, it must include X, Y
   and Z coordinates for all four points.  If the area of capture is not
   specified, it means the media capture is not spatially related to any

Romanow, et al.        Expires September 13, 2012              [Page 11]
Internet-Draft         CLUE Telepresence Framework            March 2012

   other media capture (but this can change in a subsequent provider
   advertisement).

   For a switched capture that switches between different sections
   within a larger area, the area of capture should use coordinates for
   the larger potential area.

   EncodingGroup: {<encodeGroupID value>}

   A field with a value equal to the encodeGroupID of the encoding group
   associated with the media capture.

6.2.  Capture Scene

   In order for a provider's individual media captures to be used
   effectively by a consumer, the provider organizes the media captures
   into capture scenes, with the structure and contents of these capture
   scenes being sent from the provider to the consumer.

   A capture scene is a structure representing the scene that is
   captured by a collection of capture devices.  A capture scene
   includes one or more capture scene entries, with each entry including
   one or more media captures.  A capture scene represents, for example,
   the video image of a group of people seated next to each other, along
   with the sound of their voices, which could be represented by some
   number of VCs and ACs in the capture scene entries.  A middle box may
   also express capture scenes that it constructs from media streams it
   receives.

   A provider may advertise multiple capture scenes or just a single
   capture scene.  A media provider might typically use one capture
   scene for main participant media and another capture scene for a
   computer generated presentation.  A capture scene may include more
   than one type of media.  For example, a capture scene can include
   several capture scene entries for video captures, and several capture
   scene entries for audio captures.

   A provider can express spatial relationships between media captures
   that are included in the same capture scene.  But there is no spatial
   relationship between media captures that are in different capture
   scenes.

   A media provider arranges media captures in a capture scene to help
   the media consumer choose which captures it wants.  The capture scene
   entries in a capture scene are different alternatives the provider is
   suggesting for representing the capture scene.  The media consumer
   can choose to receive all media captures from one capture scene entry
   for each media type (e.g. audio and video), or it can pick and choose

Romanow, et al.        Expires September 13, 2012              [Page 12]
Internet-Draft         CLUE Telepresence Framework            March 2012

   media captures regardless of how the provider arranges them in
   capture scene entries.

   Media captures within the same capture scene entry must be of the
   same media type - it is not possible to mix audio and video captures
   in the same capture scene entry, for instance.  The provider must be
   capable of encoding and sending all media captures in a single entry
   simultaneously.  A consumer may decide to receive all the media
   captures in a single capture scene entry, but a consumer could also
   decide to receive just a subset of those captures.  A consumer can
   also decide to receive media captures from different capture scene
   entries.

   When a provider advertises a capture scene with multiple entries, it
   is essentially signaling that there are multiple representations of
   the same scene available.  In some cases, these multiple
   representations would typically be used simultaneously (for instance
   a "video entry" and an "audio entry").  In some cases the entries
   would conceptually be alternatives (for instance an entry consisting
   of 3 video captures versus an entry consisting of just a single video
   capture).  In this latter example, the provider would in the simple
   case end up providing to the consumer the entry containing the number
   of video captures that most closely matched the media consumer's
   number of display devices.

   The following is an example of 4 potential capture scene entries for
   an endpoint-style media provider:

   1.  (VC0, VC1, VC2) - left, center and right camera video captures

   2.  (VC3) - video capture associated with loudest room segment

   3.  (VC4) - video capture zoomed out view of all people in the room

   4.  (AC0) - main audio

   The first entry in this capture scene example is a list of video
   captures with a spatial relationship to each other.  Determination of
   the order of these captures (VC0, VC1 and VC2) for rendering purposes
   is accomplished through use of their Area of Capture attributes.  The
   second entry (VC3) and the third entry (VC4) are additional
   alternatives of how to capture the same room in different ways.  The
   inclusion of the audio capture in the same capture scene indicates
   that AC0 is associated with those video captures, meaning it comes
   from the same scene.  The audio should be rendered in conjunction
   with any rendered video captures from the same capture scene.

Romanow, et al.        Expires September 13, 2012              [Page 13]
Internet-Draft         CLUE Telepresence Framework            March 2012

6.2.1.  Capture scene attributes

   Attributes can be applied to capture scenes as well as to individual
   media captures.  Attributes specified at this level apply to all
   constituent media captures.

   Area of Scene attribute

   The area of scene attribute for a capture scene has the same format
   as the area of capture attribute for a media capture.  The area of
   scene is for the entire scene, which is captured by the one or more
   media captures in the capture scene entries.  If the provider does
   not specify the area of scene, but does specify areas of capture,
   then the consumer may assume the area of scene is greater than or
   equal to the outer extents of the individual areas of capture.

   Scale attribute

   An optional attribute indicating if the numbers used for area of
   scene, area of capture and point of capture are in terms of
   millimeters, unknown scale factor, or not any scale, as described in
   Section 5.  If any media captures have an area of capture attribute
   or point of capture attribute, then this scale attribute must also be
   defined.  The possible values for this attribute are:

      "millimeters"
      "unknown"
      "no scale"

6.3.  Simultaneous Transmission Set Constraints

   The provider may have constraints or limitations on its ability to
   send media captures.  One type is caused by the physical limitations
   of capture mechanisms; these constraints are represented by a
   simultaneous transmission set.  The second type of limitation
   reflects the encoding resources available - bandwidth and
   macroblocks/second.  This type of constraint is captured by encoding
   groups, discussed below.

   An endpoint or MCU can send multiple captures simultaneously, however
   sometimes there are constraints that limit which captures can be sent
   simultaneously with other captures.  A device may not be able to be
   used in different ways at the same time.  Provider advertisements are
   made so that the consumer will choose one of several possible
   mutually exclusive usages of the device.  This type of constraint is
   expressed in a Simultaneous Transmission Set, which lists all the
   media captures that can be sent at the same time.  This is easier to

Romanow, et al.        Expires September 13, 2012              [Page 14]
Internet-Draft         CLUE Telepresence Framework            March 2012

   show in an example.

   Consider the example of a room system where there are 3 cameras each
   of which can send a separate capture covering 2 persons each- VC0,
   VC1, VC2.  The middle camera can also zoom out and show all 6
   persons, VC3.  But the middle camera cannot be used in both modes at
   the same time - it has to either show the space where 2 participants
   sit or the whole 6 seats, but not both at the same time.

   Simultaneous transmission sets are expressed as sets of the MCs that
   could physically be transmitted at the same time, (though it may not
   make sense to do so).  In this example the two simultaneous sets are
   shown in Table 1.  The consumer must make sure that it chooses one
   and not more of the mutually exclusive sets.  A consumer may choose
   any subset of the media captures in a simultaneous set, it does not
   have to choose all the captures in a simultaneous set if it does not
   want to receive all of them.

                           +-------------------+
                           | Simultaneous Sets |
                           +-------------------+
                           | {VC0, VC1, VC2}   |
                           | {VC0, VC3, VC2}   |
                           +-------------------+

                Table 1: Two Simultaneous Transmission Sets

   A media provider includes the simultaneous sets in its provider
   advertisement.  These simultaneous set constraints apply across all
   the captures scenes in the advertisement.  The simultaneous
   transmission sets MUST allow all the media captures in a particular
   capture scene entry to be used simultaneously.

7.  Encodings

   We have considered how providers can describe the content of media to
   consumers.  We will now consider how the providers communicate
   information about their abilities to send streams.  We introduce two
   constructs - individual encodings and encoding groups.  Consumers
   will then map the media captures they want onto the encodings with
   encoding parameters they want.  This process is then described.

7.1.  Individual Encodings

   An individual encoding represents a way to encode a media capture to
   become an encoded media stream sent from the media provider to the
   media consumer.  An individual encoding has a set of parameters

Romanow, et al.        Expires September 13, 2012              [Page 15]
Internet-Draft         CLUE Telepresence Framework            March 2012

   characterizing how the media is encoded.  Different media types have
   different parameters, and different encoding algorithms may have
   different parameters.  An individual encoding can be used for only
   one actual encoded media stream at a time.

   The parameters of an individual encoding represent the maximimum
   values for certain aspects of the encoding.  A particular
   instantiation into an encoded stream might use lower values than
   these maximums.

   The following tables show the variables for audio and video encoding.

   +--------------+----------------------------------------------------+
   | Name         | Description                                        |
   +--------------+----------------------------------------------------+
   | encodeID     | A unique identifier for the individual encoding    |
   | maxBandwidth | Maximum number of bits per second                  |
   | maxH264Mbps  | Maximum number of macroblocks per second: ((width  |
   |              | + 15) / 16) * ((height + 15) / 16) *               |
   |              | framesPerSecond                                    |
   | maxWidth     | Video resolution's maximum supported width,        |
   |              | expressed in pixels                                |
   | maxHeight    | Video resolution's maximum supported height,       |
   |              | expressed in pixels                                |
   | maxFrameRate | Maximum supported frame rate                       |
   +--------------+----------------------------------------------------+

               Table 2: Individual Video Encoding Parameters

           +--------------+-----------------------------------+
           | Name         | Description                       |
           +--------------+-----------------------------------+
           | maxBandwidth | Maximum number of bits per second |
           +--------------+-----------------------------------+

               Table 3: Individual Audio Encoding Parameters

7.2.  Encoding Group

   An encoding group includes a set of one or more individual encodings,
   plus some parameters that apply to the group as a whole.  By grouping
   multiple individual encodings together, an encoding group describes
   additional constraints on bandwidth and other parameters for the
   group.  Table 4 shows the parameters and individual encoding sets
   that are part of an encoding group.

Romanow, et al.        Expires September 13, 2012              [Page 16]
Internet-Draft         CLUE Telepresence Framework            March 2012

   +-------------------+-----------------------------------------------+
   | Name              | Description                                   |
   +-------------------+-----------------------------------------------+
   | encodeGroupID     | A unique identifier for the encoding group    |
   | maxGroupBandwidth | Maximum number of bits per second relating to |
   |                   | all encodings combined                        |
   | maxGroupH264Mbps  | Maximum number of macroblocks per second      |
   |                   | relating to all video encodings combined      |
   | videoEncodings[]  | Set of potential encodings (list of           |
   |                   | encodeIDs)                                    |
   | audioEncodings[]  | Set of potential encodings (list of           |
   |                   | encodeIDs)                                    |
   +-------------------+-----------------------------------------------+

                          Table 4: Encoding Group

   When the individual encodings in a group are instantiated into actual
   encoded media streams, each stream has a bandwidth that must be less
   than or equal to the maxBandwidth for the particular individual
   encoding.  The maxGroupBandwidth parameter gives the additional
   restriction that the sum of all the individual instantiated
   bandwidths must be less than or equal to the maxGroupBandwidth value.

   Likewise, the sum of the macroblocks per second of each instantiated
   encoding in the group must not exceed the maxGroupH264Mbps value.

   The following diagram illustrates the structure of a media provider's
   Encoding Groups and their contents.

   ,-------------------------------------------------.
   |             Media Provider                      |
   |                                                 |
   |  ,--------------------------------------.       |
   |  | ,--------------------------------------.     |
   |  | | ,--------------------------------------.   |
   |  | | |          Encoding Group              |   |
   |  | | | ,-----------.                        |   |
   |  | | | |           | ,---------.            |   |
   |  | | | |           | |         | ,---------.|   |
   |  | | | | Encoding1 | |Encoding2| |Encoding3||   |
   |  `.| | |           | |         | `---------'|   |
   |    `.| `-----------' `---------'            |   |
   |      `--------------------------------------'   |
   `-------------------------------------------------'

                    Figure 1: Encoding Group Structure

   A media provider advertises one or more encoding groups.  Each

Romanow, et al.        Expires September 13, 2012              [Page 17]
Internet-Draft         CLUE Telepresence Framework            March 2012

   encoding group includes one or more individual encodings.  Each
   individual encoding can represent a different way of encoding media.
   For example one individual encoding may be 1080p60 video, another
   could be 720p30, with a third being CIF.

   While a typical 3 codec/display system might have one encoding group
   per "codec box", there are many possibilities for the number of
   encoding groups a provider may be able to offer and for the encoding
   values in each encoding group.

   There is no requirement for all encodings within an encoding group to
   be instantiated at once.

8.  Associating Media Captures with Encoding Groups

   Every media capture is associated with an encoding group, which is
   used to instantiate that media capture into one or more encoded
   streams.  Each media capture has an encoding group attribute.  The
   value of this attribute is the encodeGroupID for the encoding group
   with which it is associated.  More than one media capture may use the
   same encoding group.

   The maximum number of streams that can result from a particular
   encoding group constraint is equal to the number of individual
   encodings in the group.  The actual number of streams used at any
   time may be less than this maximum.  Any of the media captures that
   use a particular encoding group can be encoded according to any of
   the individual encodings in the group.  If there are multiple
   individual encodings in the group, then a single media capture can be
   encoded into multiple different streams at the same time, with each
   stream following the constraints of a different individual encoding.

   The Encoding Groups MUST allow all the media captures in a particular
   capture scene entry to be used simultaneously.

9.  Consumer's Choice of Streams to Receive from the Provider

   After receiving the provider's advertised media captures and
   associated constraints, the consumer must choose which media captures
   it wishes to receive, and which individual encodings from the
   provider it wants to use to encode the capture.  Each media capture
   has an encoding group ID attribute which specifies which individual
   encodings are available to be used for that media capture.

   For each media capture the consumer wants to receive, it configures
   one or more of the encodings in that capture's encoding group.  The

Romanow, et al.        Expires September 13, 2012              [Page 18]
Internet-Draft         CLUE Telepresence Framework            March 2012

   consumer does this by telling the provider the resolution, frame
   rate, bandwidth, etc. when asking for streams for its chosen
   captures.  Upon receipt of this configuration command from the
   consumer, the provider generates streams for each such configured
   encoding and sends those streams to the consumer.

   The consumer must have received at least one capture advertisement
   from the provider to be able to configure the provider's generation
   of media streams.

   The consumer is able to change its configuration of the provider's
   encodings any number of times during the call, either in response to
   a new capture advertisement from the provider or autonomously.  The
   consumer need not send a new configure message to the provider when
   it receives a new capture advertisement from the provider unless the
   contents of the new capture advertisement cause the consumer's
   current configure message to become invalid.

   When choosing which streams to receive from the provider, and the
   encoding characteristics of those streams, the consumer needs to take
   several things into account its local preference, simultaneity
   restrictions, and encoding limits.

9.1.  Local preference

   A variety of local factors will influence the consumer's choice of
   streams to be received from the provider:

   o  if the consumer is an endpoint, it is likely that it would choose,
      where possible, to receive video and audio captures that match the
      number of display devices and audio system it has

   o  if the consumer is a middle box such as an MCU, it may choose to
      receive loudest speaker streams (in order to perform its own media
      composition) and avoid pre-composed video captures

   o  user choice (for instance, selection of a new layout) may result
      in a different set of media captures, or different encoding
      characteristics, being required by the consumer

9.2.  Physical simultaneity restrictions

   There may be physical simultaneity constraints imposed by the
   provider that affect the provider's ability to simultaneously send
   all of the captures the consumer would wish to receive.  For
   instance, a middle box such as an MCU, when connected to a multi-
   camera room system, might prefer to receive both individual camera
   streams of the people present in the room and an overall view of the

Romanow, et al.        Expires September 13, 2012              [Page 19]
Internet-Draft         CLUE Telepresence Framework            March 2012

   room from a single camera.  Some endpoint systems might be able to
   provide both of these sets of streams simultaneously, whereas others
   may not (if the overall room view were produced by changing the zoom
   level on the center camera, for instance).

9.3.  Encoding and encoding group limits

   Each of the provider's encoding groups has limits on bandwidth and
   macroblocks per second, and the constituent potential encodings have
   limits on the bandwidth, macroblocks per second, video frame rate,
   and resolution that can be provided.  When choosing the media
   captures to be received from a provider, a consumer device must
   ensure that the encoding characteristics requested for each
   individual media capture fits within the capability of the encoding
   it is being configured to use, as well as ensuring that the combined
   encoding characteristics for media captures fit within the
   capabilities of their associated encoding groups.  In some cases,
   this could cause an otherwise "preferred" choice of streams to be
   passed over in favour of different streams - for instance, if a set
   of 3 media captures could only be provided at a low resolution then a
   3 screen device could switch to favoring a single, higher quality,
   stream.

9.4.  Message Flow

   The following diagram shows the basic flow of messages between a
   media provider and a media consumer.  The usage of the "capture
   advertisement" and "configure encodings" message is described above.
   The consumer also sends its own capability message to the provider
   which may contain information about its own capabilities or
   restrictions.

   Diagram for Message Flow

            Media Consumer                         Media Provider
            --------------                         ------------
                  |                                     |
                  |----- Consumer Capability ---------->|
                  |                                     |
                  |                                     |
                  |<---- Capture advertisement ---------|
                  |                                     |
                  |                                     |
                  |------ Configure encodings --------->|
                  |                                     |

   In order for a maximally-capable provider to be able to advertise a
   manageable number of video captures to a consumer, there is a

Romanow, et al.        Expires September 13, 2012              [Page 20]
Internet-Draft         CLUE Telepresence Framework            March 2012

   potential use for the consumer, at the start of CLUE, to be able to
   inform the provider of its capabilities.  One example here would be
   the video capture attribute set - a consumer could tell the provider
   the complete set of video capture attributes it is able to understand
   and so the provider would be able to reduce the capture scene it
   advertises to be tailored to the consumer.

   TBD - the content of the consumer capability message needs to be
   better defined.  The authors believe there is a need for this
   message, but have not worked out the details yet.

10.  Extensibility

   One of the most important characteristics of the Framework is its
   extensibility.  Telepresence is a relatively new industry and while
   we can foresee certain directions, we also do not know everything
   about how it will develop.  The standard for interoperability and
   handling multiple streams must be future-proof.

   The framework itself is inherently extensible through expanding the
   data model types.  For example:

   o  Adding more types of media, such as telemetry, can done by
      defining additional types of captures in addition to audio and
      video.

   o  Adding new functionalities , such as 3-D, say, will require
      additional attributes describing the captures.

   o  Adding a new codecs, such as H.265, can be accomplished by
      defining new encoding variables.

   The infrastructure is designed to be extended rather than requiring
   new infrastructure elements.  Extension comes through adding to
   defined types.

   Assuming the implementation is in something like XML, adding data
   elements and attributes makes extensibility easy.

11.  Examples - Using the Framework

   This section shows some examples in more detail how to use the
   framework to represent a typical case for telepresence rooms.  First
   an endpoint is illustrated, then an MCU case is shown.

Romanow, et al.        Expires September 13, 2012              [Page 21]
Internet-Draft         CLUE Telepresence Framework            March 2012

11.1.  Three screen endpoint media provider

   Consider an endpoint with the following description:

   o  3 cameras, 3 displays, a 6 person table

   o  Each video device can provide one capture for each 1/3 section of
      the table

   o  A single capture representing the active speaker can be provided

   o  A single capture representing the active speaker with the other 2
      captures shown picture in picture within the stream can be
      provided

   o  A capture showing a zoomed out view of all 6 seats in the room can
      be provided

   The audio and video captures for this endpoint can be described as
   follows.

   Video Captures:

   o  VC0- (the camera-left camera stream), encoding group=EG0,
      purpose=main;auto-switched:no

   o  VC1- (the center camera stream), encoding group=EG1, purpose=main;
      auto-switched:no

   o  VC2- (the camera-right camera stream), encoding group=EG2,
      purpose=main;auto-switched:no

   o  VC3- (the loudest panel stream), encoding group=EG1,
      purpose=main;auto-switched:yes

   o  VC4- (the loudest panel stream with PiPs), encoding group=EG1,
      purpose=main; composed=true; auto-switched:yes

   o  VC5- (the zoomed out view of all people in the room), encoding
      group=EG1, purpose=main; composed=no; auto-switched:no

   o  VC6- (presentation stream), encoding group=EG1,
      purpose=presentation;auto-switched:no

   The following diagram is a top view of the room with 3 cameras, 3
   displays, and 6 seats.  Each camera is capturing 2 people.  The six
   seats are not all in a straight line.

Romanow, et al.        Expires September 13, 2012              [Page 22]
Internet-Draft         CLUE Telepresence Framework            March 2012

      ,-. d
     (   )`--.__        +---+
      `-' /     `--.__  |   |
    ,-.  |            `-.._ |_-+Camera 2 (VC2)
   (   ).'        ___..-+-''`+-+
    `-' |_...---''      |   |
    ,-.c+-..__          +---+
   (   )|     ``--..__  |   |
    `-' |             ``+-..|_-+Camera 1 (VC1)
    ,-. |            __..--'|+-+
   (   )|     __..--'   |   |
    `-'b|..--'          +---+
    ,-. |``---..___     |   |
   (   )\          ```--..._|_-+Camera 0 (VC0)
    `-'  \             _..-''`-+
     ,-. \      __.--'' |   |
    (   ) |..-''        +---+
     `-' a

   The two points labeled b and c are intended to be at the midpoint
   between the seating positions, and where the fields of view of the
   cameras intersect.
   The plane of interest for VC0 is a vertical plane that intersects
   points 'a' and 'b'.
   The plane of interest for VC1 intersects points 'b' and 'c'.
   The plane of interest for VC2 intersects points 'c' and 'd'.
   This example uses an area scale of millimeters.

   Areas of capture:
       bottom left    bottom right  top left         top right
   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
   VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
   VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
   VC6 none

   Points of capture:
   VC0 (-1678,0,800)
   VC1 (0,0,800)
   VC2 (1678,0,800)
   VC3 none
   VC4 none
   VC5 (0,0,800)
   VC6 none

   In this example, the right edge of the VC0 area lines up with the

Romanow, et al.        Expires September 13, 2012              [Page 23]
Internet-Draft         CLUE Telepresence Framework            March 2012

   left edge of the VC1 area.  It doesn't have to be this way.  There
   could be a gap or an overlap.  One additional thing to note for this
   example is the distance from a to b is equal to the distance from b
   to c and the distance from c to d.  All these distances are 1346 mm.
   This is the planar width of each area of capture for VC0, VC1, and
   VC2.

   Note the text in parentheses (e.g. "the camera-left camera stream")
   is not explicitly part of the model, it is just explanatory text for
   this example, and is not included in the model with the media
   captures and attributes.

   Audio Captures:

   o  AC0 (camera-left), encoding group=EG3, purpose=main, channel
      format=mono

   o  AC1 (camera-right), encoding group=EG3, purpose=main, channel
      format=mono

   o  AC2 (center) encoding group=EG3, purpose=main, channel format=mono

   o  AC3 being a simple pre-mixed audio stream from the room (mono),
      encoding group=EG3, purpose=main, channel format=mono

   o  AC4 audio stream associated with the presentation video (mono)
      encoding group=EG3, purpose=presentation, channel format=mono

   Areas of capture:
       bottom left    bottom right  top left         top right
   AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
   AC1 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
   AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
   AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
   AC4 none

   The physical simultaneity information is:

      Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6}

      Simultaneous transmission set #2 {VC0, VC2, VC5, VC6}

   This constraint indicates it is not possible to use all the VCs at
   the same time.  VC5 can not be used at the same time as VC1 or VC3 or
   VC4.  Also, using every member in the set simultaneously may not make
   sense - for example VC3(loudest) and VC4 (loudest with PIP).  (In
   addition, there are encoding constraints that make choosing all of

Romanow, et al.        Expires September 13, 2012              [Page 24]
Internet-Draft         CLUE Telepresence Framework            March 2012

   the VCs in a set impossible.  VC1, VC3, VC4, VC5, VC6 all use EG1 and
   EG1 has only 3 ENCs.  This constraint shows up in the encoding
   groups, not in the simultaneous transmission sets.)

   In this example there are no restrictions on which audio captures can
   be sent simultaneously.

   Encoding Groups:

   This example has three encoding groups associated with the video
   captures.  Each group can have 3 encodings, but with each potential
   encoding having a progressively lower specification.  In this
   example, 1080p60 transmission is possible (as ENC0 has a maxMbps
   value compatible with that) as long as it is the only active encoding
   in the group(as maxMbps for the entire encoding group is also
   489600).  Significantly, as up to 3 encodings are available per
   group, it is possible to transmit some video captures simultaneously
   that are not in the same entry in the capture scene.  For example VC1
   and VC3 at the same time.

   It is also possible to transmit multiple encodings of a single video
   capture.  For example VC0 can be encoded using ENC0 and ENC1 at the
   same time, as long as the encoding parameters satisfy the constraints
   of ENC0, ENC1, and EG0, such as one at 1080p30 and one at 720p30.

Romanow, et al.        Expires September 13, 2012              [Page 25]
Internet-Draft         CLUE Telepresence Framework            March 2012

   encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                      maxH264Mbps=489600, maxBandwidth=4000000
       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
                      maxH264Mbps=108000, maxBandwidth=4000000
       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
                      maxH264Mbps=61200, maxBandwidth=4000000

   encodeGroupID=EG1 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000
       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                      maxH264Mbps=489600, maxBandwidth=4000000
       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
                      maxH264Mbps=108000, maxBandwidth=4000000
       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
                      maxH264Mbps=61200, maxBandwidth=4000000

   encodeGroupID=EG2 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000
       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                      maxH264Mbps=489600, maxBandwidth=4000000
       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
                      maxH264Mbps=108000, maxBandwidth=4000000
       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
                      maxH264Mbps=61200, maxBandwidth=4000000

                Figure 2: Example Encoding Groups for Video

   For audio, there are five potential encodings available, so all five
   audio captures can be encoded at the same time.

   encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000
       encodeID=ENC9, maxBandwidth=64000
       encodeID=ENC10, maxBandwidth=64000
       encodeID=ENC11, maxBandwidth=64000
       encodeID=ENC12, maxBandwidth=64000
       encodeID=ENC13, maxBandwidth=64000

                Figure 3: Example Encoding Group for Audio

   Capture Scenes:

   The following table represents the capture scenes for this provider.
   Recall that a capture scene is composed of alternative capture scene
   entries covering the same scene.  Capture Scene #1 is for the main
   people captures, and Capture Scene #2 is for presentation.

Romanow, et al.        Expires September 13, 2012              [Page 26]
Internet-Draft         CLUE Telepresence Framework            March 2012

      Each row in the table is a separate entry in the capture scene

                           +------------------+
                           | Capture Scene #1 |
                           +------------------+
                           | VC0, VC1, VC2    |
                           | VC3              |
                           | VC4              |
                           | VC5              |
                           | AC0, AC1, AC2    |
                           | AC3              |
                           +------------------+

                           +------------------+
                           | Capture Scene #2 |
                           +------------------+
                           | VC6              |
                           | AC4              |
                           +------------------+

   Different capture scenes are unique to each other, non-overlapping.
   A consumer can choose an entry from each capture scene.  In this case
   the three captures VC0, VC1, and VC2 are one way of representing the
   video from the endpoint.  These three captures should appear adjacent
   next to each other.  Alternatively, another way of representing the
   Capture Scene is with the capture VC3, which automatically shows the
   person who is talking.  Similarly for the VC4 and VC5 alternatives.

   As in the video case, the different entries of audio in Capture Scene
   #1 represent the "same thing", in that one way to receive the audio
   is with the 3 audio captures (AC0, AC1, AC2), and another way is with
   the mixed AC3.  The Media Consumer can choose an audio capture entry
   it is capable of receiving.

   The spatial ordering is understood by the media capture attributes
   area and point of capture.

   A Media Consumer would likely want to choose a capture scene entry to
   receive based in part on how many streams it can simultaneously
   receive.  A consumer that can receive three people streams would
   probably prefer to receive the first entry of Capture Scene #1 (VC0,
   VC1, VC2) and not receive the other entries.  A consumer that can
   receive only one people stream would probably choose one of the other
   entries.

   If the consumer can receive a presentation stream too, it would also
   choose to receive the only entry from Capture Scene #2 (VC6).

Romanow, et al.        Expires September 13, 2012              [Page 27]
Internet-Draft         CLUE Telepresence Framework            March 2012

11.2.  Encoding Group Example

   This is an example of an encoding group to illustrate how it can
   express dependencies between encodings.

  encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                         maxH264Mbps=244800, maxBandwidth=4000000
       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                         maxH264Mbps=244800, maxBandwidth=4000000
       encodeID=AUDENC0, maxBandwidth=96000
       encodeID=AUDENC1, maxBandwidth=96000
       encodeID=AUDENC2, maxBandwidth=96000

   Here, the encoding group is EG0.  It can transmit up to two 1080p30
   encodings (Mbps for 1080p = 244800), but it is capable of
   transmitting a maxFrameRate of 60 frames per second (fps).  To
   achieve the maximum resolution (1920 x 1088) the frame rate is
   limited to 30 fps.  However 60 fps can be achieved at a lower
   resolution if required by the consumer.  Although the encoding group
   is capable of transmitting up to 6Mbit/s, no individual video
   encoding can exceed 4Mbit/s.

   This encoding group also allows up to 3 audio encodings, AUDENC<0-2>.
   It is not required that audio and video encodings reside within the
   same encoding group, but if so then the group's overall maxBandwidth
   value is a limit on the sum of all audio and video encodings
   configured by the consumer.  A system that does not wish or need to
   combine bandwidth limitations in this way should instead use separate
   encoding groups for audio and video in order for the bandwidth
   limitations on audio and video to not interact.

   Audio and video can be expressed in separate encoding groups, as in
   this illustration.

  encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                         maxH264Mbps=244800, maxBandwidth=4000000
       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                         maxH264Mbps=244800, maxBandwidth=4000000

  encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000
       encodeID=AUDENC0, maxBandwidth=96000
       encodeID=AUDENC1, maxBandwidth=96000
       encodeID=AUDENC2, maxBandwidth=96000

Romanow, et al.        Expires September 13, 2012              [Page 28]
Internet-Draft         CLUE Telepresence Framework            March 2012

11.3.  The MCU Case

   This section shows how an MCU might express its Capture Scenes,
   intending to offer different choices for consumers that can handle
   different numbers of streams.  A single audio capture stream is
   provided for all single and multi-screen configurations that can be
   associated (e.g. lip-synced) with any combination of video captures
   at the consumer.

   +--------------------+---------------------------------------------+
   | Capture Scene #1   | note                                        |
   +--------------------+---------------------------------------------+
   | VC0                | video capture for single screen consumer    |
   | VC1, VC2           | video capture for 2 screen consumer         |
   | VC3, VC4, VC5      | video capture for 3 screen consumer         |
   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer         |
   | AC0                | audio capture representing all participants |
   +--------------------+---------------------------------------------+

   If / when a presentation stream becomes active within the conference,
   the MCU might re-advertise the available media as:

        +------------------+--------------------------------------+
        | Capture Scene #2 | note                                 |
        +------------------+--------------------------------------+
        | VC10             | video capture for presentation       |
        | AC1              | presentation audio to accompany VC10 |
        +------------------+--------------------------------------+

11.4.  Media Consumer Behavior

   This section gives an example of how a media consumer might behave
   when deciding how to request streams from the three screen endpoint
   described in the previous section.

   The receive side of a call needs to balance its requirements, based
   on number of screens and speakers, its decoding capabilities and
   available bandwidth, and the provider's capabilities in order to
   optimally configure the provider's streams.  Typically it would want
   to receive and decode media from each capture scene advertised by the
   provider.

   A sane, basic, algorithm might be for the consumer to go through each
   capture scene in turn and find the collection of video captures that
   best matches the number of screens it has (this might include
   consideration of screens dedicated to presentation video display
   rather than "people" video) and then decide between alternative
   entries in the video capture scenes based either on hard-coded

Romanow, et al.        Expires September 13, 2012              [Page 29]
Internet-Draft         CLUE Telepresence Framework            March 2012

   preferences or user choice.  Once this choice has been made, the
   consumer would then decide how to configure the provider's encoding
   groups in order to make best use of the available network bandwidth
   and its own decoding capabilities.

11.4.1.  One screen consumer

   VC3, VC4 and VC5 are all different entries by themselves, not grouped
   together in a single entry, so the receiving device should choose
   between one of those.  The choice would come down to whether to see
   the greatest number of participants simultaneously at roughly equal
   precedence (VC5), a switched view of just the loudest region (VC3) or
   a switched view with PiPs (VC4).  An endpoint device with a small
   amount of knowledge of these differences could offer a dynamic choice
   of these options, in-call, to the user.

11.4.2.  Two screen consumer configuring the example

   Mixing systems with an even number of screens, "2n", and those with
   "2n+1" cameras (and vice versa) is always likely to be the
   problematic case.  In this instance, the behavior is likely to be
   determined by whether a "2 screen" system is really a "2 decoder"
   system, i.e., whether only one received stream can be displayed per
   screen or whether more than 2 streams can be received and spread
   across the available screen area.  To enumerate 3 possible behaviors
   here for the 2 screen system when it learns that the far end is
   "ideally" expressed via 3 capture streams:

   1.  Fall back to receiving just a single stream (VC3, VC4 or VC5 as
       per the 1 screen consumer case above) and either leave one screen
       blank or use it for presentation if / when a presentation becomes
       active

   2.  Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens
       (either with each capture being scaled to 2/3 of a screen and the
       centre capture being split across 2 screens) or, as would be
       necessary if there were large bezels on the screens, with each
       stream being scaled to 1/2 the screen width and height and there
       being a 4th "blank" panel.  This 4th panel could potentially be
       used for any presentation that became active during the call.

   3.  Receive 3 streams, decode all 3, and use control information
       indicating which was the most active to switch between showing
       the left and centre streams (one per screen) and the centre and
       right streams.

   For an endpoint capable of all 3 methods of working described above,
   again it might be appropriate to offer the user the choice of display

Romanow, et al.        Expires September 13, 2012              [Page 30]
Internet-Draft         CLUE Telepresence Framework            March 2012

   mode.

11.4.3.  Three screen consumer configuring the example

   This is the most straightforward case - the consumer would look to
   identify a set of streams to receive that best matched its available
   screens and so the VC0 plus VC1 plus VC2 should match optimally.  The
   spatial ordering would give sufficient information for the correct
   video capture to be shown on the correct screen, and the consumer
   would either need to divide a single encoding group's capability by 3
   to determine what resolution and frame rate to configure the provider
   with or to configure the individual video captures' encoding groups
   with what makes most sense (taking into account the receive side
   decode capabilities, overall call bandwidth, the resolution of the
   screens plus any user preferences such as motion vs sharpness).

12.  Acknowledgements

   Mark Gorzyinski contributed much to the approach.  We want to thank
   Stephen Botzko for helpful discussions on audio.

13.  IANA Considerations

   TBD

14.  Security Considerations

   TBD

15.  Changes Since Last Version

   NOTE TO THE RFC-Editor: Please remove this section prior to
   publication as an RFC.

   Changes from 03 to 04:

   1.   Remove sentence from overview - "This constitutes a significant
        change ..."

   2.   Clarify a consumer can choose a subset of captures from a
        capture scene entry or a simultaneous set (in section "capture
        scene" and "consumer's choice...").

Romanow, et al.        Expires September 13, 2012              [Page 31]
Internet-Draft         CLUE Telepresence Framework            March 2012

   3.   Reword first paragraph of Media Capture Attributes section.

   4.   Clarify a stereo audio capture is different from two mono audio
        captures (description of audio channel format attribute).

   5.   Clarify what it means when coordinate information is not
        specified for area of capture, point of capture, area of scene.

   6.   Change the term "producer" to "provider" to be consistent (it
        was just in two places).

   7.   Change name of "purpose" attribute to "content" and refer to
        RFC4796 for values.

   8.   Clarify simultaneous sets are part of a provider advertisement,
        and apply across all capture scenes in the advertisement.

   9.   Remove sentence about lip-sync between all media captures in a
        capture scene.

   10.  Combine the concepts of "capture scene" and "capture set" into a
        single concept, using the term "capture scene" to replace the
        previous term "capture set", and eliminating the original
        separate capture scene concept.

16.  Informative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
              A., Peterson, J., Sparks, R., Handley, M., and E.
              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
              June 2002.

   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
              Jacobson, "RTP: A Transport Protocol for Real-Time
              Applications", STD 64, RFC 3550, July 2003.

   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
              Session Initiation Protocol (SIP)", RFC 4353,
              February 2006.

   [RFC4796]  Hautakorpi, J. and G. Camarillo, "The Session Description
              Protocol (SDP) Content Attribute", RFC 4796,
              February 2007.

Romanow, et al.        Expires September 13, 2012              [Page 32]
Internet-Draft         CLUE Telepresence Framework            March 2012

   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117,
              January 2008.

Appendix A.  Open Issues

A.1.  Video layout arrangements and centralized composition

   In the context of a conference with a central MCU, there has been
   discussion about a consumer requesting the provider to provide a
   certain type of layout arrangement or perform a certain composition
   algorithm, such as combining some number of most recent talkers, or
   producing a video layout using a 2x2 grid or 1 large cell with 5
   smaller cells around it.  The current framework does not address
   this.  It isn't clear if this topic should be included in this
   framework, or maybe a different part of CLUE, or maybe outside of
   CLUE altogether.

A.2.  Source is selectable

   A Boolean variable.  True indicates the media consumer can request a
   particular media source be mapped to a media capture.  Default is
   false.

   TBD - how does the consumer make the request for a particular source?
   How does the consumer know what is available?  Need to explain better
   how multiple media captures are different from a single media capture
   with choices for the source, and when each concept should be used.

A.3.  Media Source Selection

   The use cases include a case where the person at a receiving endpoint
   can request to receive media from a particular other endpoint, for
   example in a multipoint call to request to receive the video from a
   certain section of a certain room, whether or not people there are
   talking.

   TBD - this framework should address this case.  Maybe need a roster
   list of rooms or people in the conference, with a mechanism to select
   from the roster and associate it with media captures.  This is
   different from selecting a particular media capture from a capture
   scene.  The mechanism to do this will probably need to be different
   than selecting media captures based on capture scenes and attributes.

A.4.  Endpoint requesting many streams from MCU

   TBD - how to do VC selection for a system where the endpoint media
   consumers want to receive lots of streams and do their own

Romanow, et al.        Expires September 13, 2012              [Page 33]
Internet-Draft         CLUE Telepresence Framework            March 2012

   composition, rather than MCU doing transcoding and composing.
   Example is 3 screen consumer that wants 3 large loudest speaker
   streams, and a bunch of small ones to render as PiP.  How the small
   ones are chosen, which could potentially be chosen by either the
   endpoint or MCU.  There are other more complicated examples also.  Is
   the current framework adequate to support this?

A.5.  VAD (voice activity detection) tagging of audio streams

   TBD - do we want to have VAD be mandatory?  All audio streams
   originating from a media provider must be tagged with VAD
   information.  This tagging would include an overall energy value for
   the stream plus information on which sections of the capture scene
   are "active".

   Each audio stream which forms a constituent of an entry within a
   capture scene should include this tagging, and the energy value
   within it calculated using a fixed, consistent algorithm.

   When a system determines the most active area of a capture scene
   (either "loudest", or determined by other means such as a button
   press) it should convey that information to the corresponding media
   stream consumer via any audio streams being sent within that capture
   scene.  Specifically, there should be a list of active coordinates
   and their VAD characteristics within the audio stream in addition to
   the overall VAD information for the capture scene.  This is to ensure
   all media stream consumers receive the same, consistent, audio energy
   information whichever audio capture or captures they choose to
   receive for a capture scene.  Additionally, coordinate information
   can be mapped to video captures by a media stream consumer in order
   that it can perform "panel switching" if required.

A.6.  Private Information

   Do we want a way to include private information?

Authors' Addresses

   Allyn Romanow
   Cisco Systems
   San Jose, CA  95134
   USA

   Email: allyn@cisco.com

Romanow, et al.        Expires September 13, 2012              [Page 34]
Internet-Draft         CLUE Telepresence Framework            March 2012

   Mark Duckworth (editor)
   Polycom
   Andover, MA  01810
   US

   Email: mark.duckworth@polycom.com

   Andrew Pepperell
   Langley, England
   UK

   Email: apeppere@gmail.com

   Brian Baldino
   Cisco Systems
   San Jose, CA  95134
   US

   Email: bbaldino@cisco.com

Romanow, et al.        Expires September 13, 2012              [Page 35]