Skip to main content

Framework for Telepresence Multi-Streams
draft-ietf-clue-framework-08

The information below is for an old version of the document.
Document Type
This is an older version of an Internet-Draft that was ultimately published as RFC 8845.
Authors Mark Duckworth , Andrew Pepperell , Stephan Wenger
Last updated 2012-12-24
RFC stream Internet Engineering Task Force (IETF)
Formats
Reviews
Additional resources Mailing list discussion
Stream WG state WG Document
Document shepherd (None)
IESG IESG state Became RFC 8845 (Proposed Standard)
Consensus boilerplate Unknown
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-ietf-clue-framework-08
CLUE WG                                            M. Duckworth, Ed. 
Internet Draft                                                Polycom 
Intended status: Informational                            A. Pepperell 
Expires: June, 2013                                        Silverflare 
                                                             S. Wenger 
                                                                 Vidyo 
                                                     December 24, 2012 
                                                                            
 
                                    
               Framework for Telepresence Multi-Streams 
                   draft-ietf-clue-framework-08.txt 

Abstract 

   This memo offers a framework for a protocol that enables devices 
   in a telepresence conference to interoperate by specifying the 
   relationships between multiple media streams. 

Status of this Memo 

   This Internet-Draft is submitted in full conformance with the 
   provisions of BCP 78 and BCP 79. 

   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF).  Note that other groups may also distribute 
   working documents as Internet-Drafts.  The list of current 
   Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. 

   Internet-Drafts are draft documents valid for a maximum of six 
   months and may be updated, replaced, or obsoleted by other 
   documents at any time.  It is inappropriate to use Internet-Drafts 
   as reference material or to cite them other than as "work in 
   progress." 

   This Internet-Draft will expire on June 24, 2013. 

Copyright Notice 

   Copyright (c) 2012 IETF Trust and the persons identified as the 
   document authors.  All rights reserved. 

   This document is subject to BCP 78 and the IETF Trust's Legal 
   Provisions Relating to IETF Documents 
   (http://trustee.ietf.org/license-info) in effect on the date of 
   publication of this document.  Please review these documents 
   carefully, as they describe your rights and restrictions with 
   respect to this document.  Code Components extracted from this 
   document must include Simplified BSD License text as described in 
 
 
Duckworth et. al.        Expires June 24 2013                 [Page 1] 
 



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   Section 4.e of the Trust Legal Provisions and are provided without 
   warranty as described in the Simplified BSD License. 

Table of Contents 

    

    
   1. Introduction...................................................3 
   2. Terminology....................................................6 
   3. Definitions....................................................6 
   4. Overview of the Framework/Model................................9 
   5. Spatial Relationships.........................................11 
   6. Media Captures and Capture Scenes.............................12 
      6.1. Media Captures...........................................12 
         6.1.1. Media Capture Attributes............................12 
      6.2. Capture Scene............................................15 
         6.2.1. Capture scene attributes............................17 
         6.2.2. Capture scene entry attributes......................18 
      6.3. Simultaneous Transmission Set Constraints................19 
   7. Encodings.....................................................20 
      7.1. Individual Encodings.....................................21 
      7.2. Encoding Group...........................................22 
   8. Associating Media Captures with Encoding Groups...............24 
   9. Consumer's Choice of Streams to Receive from the Provider.....25 
      9.1. Local preference.........................................26 
      9.2. Physical simultaneity restrictions.......................26 
      9.3. Encoding and encoding group limits.......................26 
      9.4. Message Flow.............................................27 
   10. Extensibility................................................28 
   11. Examples - Using the Framework...............................28 
      11.1. Three screen endpoint media provider....................28 
      11.2. Encoding Group Example..................................35 
      11.3. The MCU Case............................................36 
      11.4. Media Consumer Behavior.................................37 
         11.4.1. One screen consumer................................37 
         11.4.2. Two screen consumer configuring the example........38 
         11.4.3. Three screen consumer configuring the example......38 
   12. Acknowledgements.............................................39 
   13. IANA Considerations..........................................39 
   14. Security Considerations......................................39 
   15. Changes Since Last Version...................................39 
   16. Authors' Addresses...........................................42 
    
    
 
 
Duckworth et. al.       Expires June 24, 2013                [Page 2] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

1. Introduction 

   Current telepresence systems, though based on open standards such 
   as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate 
   with each other.  A major factor limiting the interoperability of 
   telepresence systems is the lack of a standardized way to describe 
   and negotiate the use of the multiple streams of audio and video 
   comprising the media flows.  This draft provides a framework for a 
   protocol to enable interoperability by handling multiple streams 
   in a standardized way.  It is intended to support the use cases 
   described in draft-ietf-clue-telepresence-use-cases-02 and to meet 
   the requirements in draft-ietf-clue-telepresence-requirements-01. 

   Conceptually distinguished are Media Providers and Media 
   Consumers.  A Media Provider provides Media in the form of RTP 
   packets, a Media Consumer consumes those RTP packets.  Media 
   Providers and Media Consumers can reside in Endpoints or in 
   middleboxes such as Multipoint Control Units (MCUs).  A Media 
   Provider in an Endpoint is usually associated with the generation 
   of media for Media Captures; these Media Captures are typically 
   sourced from cameras, microphones, and the like.  Similarly, the 
   Media Consumer in an Endpoint is usually associated with 
   Renderers, such as screens and loudspeakers.  In middleboxes, 
   Media Providers and Consumers can have the form of outputs and 
   inputs, respectively, of RTP mixers, RTP translators, and similar 
   devices.  Typically, telepresence devices such as Endpoints and 
   middleboxes would perform as both Media Providers and Media 
   Consumers, the former being concerned with those devices' 
   transmitted media and the latter with those devices' received 
   media.  In a few circumstances, a CLUE Endpoint middlebox may 
   include only Consumer or Provider functionality, such as recorder-
   type Consumers or webcam-type Providers.   

   One initial motivation for this memo and its companion documents 
   has been that Endpoints according to this memo can, and usually 
   do, have multiple Media Captures and Media Renderers.  While 
   previous system designs can deal with such a situation, what was 
   missing was a mechanism that can associate the Media Captures with 
   each other in space and time.  Further, due to the potentially 
   large number of RTP flows required for a Multimedia Conference 
   involving potentially many Endpoints, each of which can have many 
   Media Captures and Media Renderers, a sensible system design is to 
   multiplex multiple RTP media flows onto the same transport 
   address, so to avoid using the port number as a multiplexing point 
   and the associated shortcomings such as NAT/firewall traversal.  
 
 
Duckworth et. al.       Expires June 24, 2013                [Page 3] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   While the actual mapping of those RTP flows to the header fields 
   of the RTP packets is not subject of this specification, the large 
   number of possible permutations of sensible options a Media 
   Provider may make available to a Media Consumer makes a mechanism 
   desirable that allows to narrow down the number of possible 
   options that a SIP offer-answer exchange has to consider.  Such 
   information is made available using protocol mechanisms specified 
   in this memo and companion documents, although it should be 
   stressed that its use in an implementation is optional.  Also, 
   there are aspects of the control of both Endpoints and 
   middleboxes/MCUs that dynamically change during the progress of a 
   call, such as audio-level based screen switching, layout changes, 
   and so on, which need to be conveyed.  Note that these control 
   aspects are complementary to those specified in traditional SIP 
   based conference management such as BFCP.  Finally, all this 
   information needs to be conveyed, and the notion of support for it 
   needs to be established.  This is done by the negotiation of a 
   "CLUE channel", a data channel negotiated early during the 
   initiation of a call.  An Endpoint or MCU that rejects the 
   establishment of this data channel, by definition, is not 
   supporting CLUE based mechanisms, whereas an Endpoint or MCU that 
   accepts it is required to use it to the extent specified in this 
   memo and its companion documents. 

   A very brief outline of the call flow used by a simple system in 
   compliance with this memo can be described as follows.   

   An initial offer/answer exchange establishes a CLUE channel 
   between two Endpoints.  With the establishment of that channel, 
   the endpoints have consented to use the CLUE protocol mechanisms 
   and have to adhere to them. 

   Over this CLUE channel, the Provider in each Endpoint conveys its 
   characteristics and capabilities as specified herein (which will 
   typically not be sufficient to set up all media).  The Consumer in 
   the Endpoint receives the information provided by the Provider, 
   and can use it for two purposes.  First, it can, but is not 
   necessarily required to, use the information provided to tailor 
   the SDP it is going to send during the following SIP offer/answer 
   exchange, and its reaction to SDP it receives in that step.  It is 
   often a sensible implementation choice to do so, as the 
   representation of the media information conveyed over the CLUE 
   channel can dramatically cut down on the size of SDP messages used 
   in the O/A exchange that follows.  Second, it takes note of the 
   spatial relationship associated with the Media that are described.  
 
 
Duckworth et. al.       Expires June 24, 2013                [Page 4] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   It is often sensible to take that spatial relationship into 
   account when tailoring the SDP. 

   This CLUE exchange is followed by an SDP offer answer exchange 
   that not only establishes those aspects of the media that have not 
   been "negotiated" over CLUE, but has also the side effect of 
   setting up the media transmission itself, involving potentially 
   security exchanges, ICE, and whatnot.  This step is plain vanilla 
   SIP, with the exception that the SDP used herein, in most cases 
   can (but not necessarily must) be considerably smaller than the 
   SDP a system would typically need to exchange if there were no 
   pre-established knowledge about the Provider and Consumer 
   characteristics.   

   During the lifetime of a call, further exchanges can occur over 
   the CLUE channel.  In some cases, those further exchanges can be 
   dealt with by Provider or Consumer without any other protocol 
   activity.  For example, voice-activated screen switching, signaled 
   over the CLUE channel, ought not to lead to heavy-handed 
   mechanisms like SIP re-invites.  However, in other cases, after 
   the CLUE negotiation an additional offer/answer exchange may 
   become necessary.  For example, if both sides decide to upgrade 
   the call from a single screen to a multi-screen call and more 
   bandwidth is required for the additional video channels, that 
   could require a new O/A exchange. 

   Numerous optimizations may be possible, and are the implementer's 
   choice.  For example, it may be sensible to establish one or more 
   initial media channels during the initial offer/answer exchange, 
   which would allow, for example, for a fast startup of audio.  
   Depending on the system design, it may be possible to re-use this 
   established channel using only CLUE mechanisms, thereby avoiding 
   further offer/answer exchanges.   

   One aspect of the protocol outlined herein and specified in 
   normative detail in companion documents is that it makes available 
   information regarding the Provider's capabilities to deliver 
   Media, and attributes related to that media such as their spatial 
   relationship, to the Media Consumer.  The operation of the 
   Renderer inside the Consumer is unspecified in that it can choose 
   to ignore some information provided by the Provider, and/or not 
   render media streams available from the Provider (although it has 
   to follow the CLUE protocol and, therefore, has to "accept" the 
   Provider's information).  All CLUE protocol mechanisms are 
   optional in the Consumer in the sense that, while the Consumer 
 
 
Duckworth et. al.       Expires June 24, 2013                [Page 5] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   must be able to receive (and, potentially, gracefully acknowledge) 
   CLUE messages, it is free to ignore the information provided 
   therein.  Obviously, this is not a particularly sensible design 
   choice. 

   Legacy devices are defined here in as those Endpoints and MCUs 
   that do not support the setup and use of the CLUE channel.  The 
   notion of a device being a legacy device is established during the 
   initial offer/answer exchange, in which the legacy device will not 
   understand the offer for the CLUE channel and, therefore, reject 
   it.  This is the indication for the CLUE-implementing Endpoint or 
   MCU that the other side of the communication is not compliant with 
   CLUE, and to fall back to whatever mechanism was used before the 
   introduction of CLUE. 

   As for the media, Provider and Consumer have an end-to-end 
   communication relationship with respect to (RTP transported) 
   media; and the mechanisms described herein and in companion 
   documents do not change the aspects of setting up those RTP flows 
   and sessions.  However, it should be noted that forms of RTP 
   multiplexing of multiple RTP flows onto the same transport address 
   are developed concurrently with the CLUE suite of specifications, 
   and it is widely expected that most, if not all, Endpoints or MCUs 
   supporting CLUE will also support those mechanisms.  Some design 
   choices made in this memo reflect this coincidence in spec 
   development timing.  

2. Terminology 

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 
   NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" 
   in this document are to be interpreted as described in RFC 2119 
   [RFC2119]. 

3. Definitions 

   The terms defined below are used throughout this memo and 
   companion documents and they are normative.  In order to easily 
   identify the use of a defined term, those terms are capitalized. 

   Audio Capture: Media Capture for audio.  Denoted as ACn. 

   Camera-Left and Right: For media captures, camera-left and camera-
   right are from the point of view of a person observing the 

 
 
Duckworth et. al.       Expires June 24, 2013                [Page 6] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   rendered media.  They are the opposite of stage-left and stage-
   right. 

   Capture Device: A device that converts audio and video input into 
   an electrical signal, in most cases to be fed into a media 
   encoder. 

   Cameras and microphones are examples for capture devices. 

   Capture Encoding: A specific encoding of a media capture, to be 
   sent by a media provider to a media consumer via RTP. 

   Capture Scene: a structure representing the scene that is captured 
   by a collection of capture devices.  A capture scene includes 
   attributes and one or more capture scene entries, with each entry 
   including one or more media captures. 

   Capture Scene Entry: a list of media captures of the same media 
   type that together form one way to represent the capture scene. 

   Conference: used as defined in [RFC4353], A Framework for 
   Conferencing within the Session Initiation Protocol (SIP). 

   Individual Encoding: A variable with a set of attributes that 
   describes the maximum values of a single audio or video capture 
   encoding.  The attributes include: maximum bandwidth- and for 
   video maximum macroblocks (for H.264), maximum width, maximum 
   height, maximum frame rate. 

   Encoding Group: A set of encoding parameters representing a total 
   media encoding capability to be sub-divided across potentially 
   multiple Individual Encodings. 

    

   Endpoint: The logical point of final termination through 
   receiving, decoding and rendering, and/or initiation through 
   capturing, encoding, and sending of media streams.  An endpoint 
   consists of one or more physical devices which source and sink 
   media streams, and exactly one [RFC4353] Participant (which, in 
   turn, includes exactly one SIP User Agent).  In contrast to an 
   endpoint, an MCU may also send and receive media streams, but it 
   is not the initiator nor the final terminator in the sense that 
   Media is Captured or Rendered. Endpoints can be anything from 
   multiscreen/multicamera rooms to handheld devices. 
 
 
Duckworth et. al.       Expires June 24, 2013                [Page 7] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   Front: the portion of the room closest to the cameras.  In going 
   towards back you move away from the cameras. 

   MCU: Multipoint Control Unit (MCU) - a device that connects two or 
   more endpoints together into one single multimedia conference 
   [RFC5117].  An MCU includes an [RFC4353] Mixer.  [Edt. RFC4353 is 
   tardy in requiring that media from the mixer be sent to EACH 
   participant.  I think we have practical use cases where this is 
   not the case.  But the bug (if it is one) is in 4353 and not 
   herein.] 

    

   Media: Any data that, after suitable encoding, can be conveyed 
   over RTP, including audio, video or timed text. 

   Media Capture: a source of Media, such as from one or more Capture 
   Devices.  A Media Capture (MC) may be the source of one or more 
   capture encodings.  A Media Capture may also be constructed from 
   other Media streams.  A middle box can express Media Captures that 
   it constructs from Media streams it receives. 

   Media Consumer: an Endpoint or middle box that receives media 
   streams 

   Media Provider: an Endpoint or middle box that sends Media streams 

   Model: a set of assumptions a telepresence system of a given 
   vendor adheres to and expects the remote telepresence system(s) 
   also to adhere to. 

   Plane of Interest: The spatial plane containing the most relevant 
   subject matter. 

   Render: the process of generating a representation from a media, 
   such as displayed motion video or sound emitted from loudspeakers. 

   Simultaneous Transmission Set: a set of media captures that can be 
   transmitted simultaneously from a Media Provider. 

   Spatial Relation: The arrangement in space of two objects, in 
   contrast to relation in time or other relationships.  See also 
   Camera-Left and Right. 

 
 
Duckworth et. al.       Expires June 24, 2013                [Page 8] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   Stage-Left and Right: For media captures, stage-left and stage-
   right are the opposite of camera-left and camera-right.  For the 
   case of a person facing (and captured by) a camera, stage-left and 
   stage-right are from the point of view of that person. 

   Stream: a capture encoding sent from a media provider to a media 
   consumer via RTP [RFC3550]. 

   Stream Characteristics: the media stream attributes commonly used 
   in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 
   resolution, profile/level etc.) as well as CLUE specific 
   attributes, such as the ID of a capture or a spatial location. 

   Telepresence: an environment that gives non co-located users or 
   user groups a feeling of (co-located) presence - the feeling that 
   a Local user is in the same room with other Local users and the 
   Remote parties.  The inclusion of Remote parties is achieved 
   through multimedia communication including at least audio and 
   video signals of high fidelity. 

   Video Capture: Media Capture for video.  Denoted as VCn. 

   Video composite: A single image that is formed from combining 
   visual elements from separate sources. 

4. Overview of the Framework/Model 

   The CLUE framework specifies how multiple media streams are to be 
   handled in a telepresence conference. 

   The main goals include: 

   o  Interoperability 

   o  Extensibility 

   o  Flexibility 

   Interoperability is achieved by the media provider describing the 
   relationships between media streams in constructs that are 
   understood by the consumer, who can then render the media.  
   Extensibility is achieved through abstractions and the generality 
   of the model, making it easy to add new parameters.  Flexibility 
   is achieved largely by having the consumer choose what content and 

 
 
Duckworth et. al.       Expires June 24, 2013                [Page 9] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   format it wants to receive from what the provider is capable of 
   sending. 

   A transmitting endpoint or MCU describes specific aspects of the 
   content of the media and the formatting of the media streams it 
   can send (advertisement); and the receiving end responds to the 
   provider by specifying which content and media streams it wants to 
   receive (configuration).  The provider then transmits the asked 
   for content in the specified streams. 

   This advertisement and configuration occurs at call initiation but 
   may also happen at any time throughout the conference, whenever 
   there is a change in what the consumer wants or the provider can 
   send. 

   An endpoint or MCU typically acts as both provider and consumer at 
   the same time, sending advertisements and sending configurations 
   in response to receiving advertisements.  (It is possible to be 
   just one or the other.) 

   The data model is based around two main concepts: a capture and an 
   encoding.  A media capture (MC), such as audio or video, describes 
   the content a provider can send.  Media captures are described in 
   terms of CLUE-defined attributes, such as spatial relationships 
   and purpose of the capture.  Providers tell consumers which media 
   captures they can provide, described in terms of the media capture 
   attributes. 

   A provider organizes its media captures that represent the same 
   scene into capture scenes.  A consumer chooses which media 
   captures it wants to receive according to the capture scenes sent 
   by the provider. 

   In addition, the provider sends the consumer a description of the 
   individual encodings it can send in terms of the media attributes 
   of the encodings, in particular, well-known audio and video 
   parameters such as bandwidth, frame rate, macroblocks per second. 

   The provider also specifies constraints on its ability to provide 
   media, and the consumer must take these into account in choosing 
   the content and capture encodings it wants.  Some constraints are 
   due to the physical limitations of devices - for example, a camera 
   may not be able to provide zoom and non-zoom views simultaneously.  
   Other constraints are system based constraints, such as maximum 
   bandwidth and maximum macroblocks/second. 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 10] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   The following sections discuss these constructs and processes in 
   detail, followed by use cases showing how the framework 
   specification can be used. 

5. Spatial Relationships 

   In order for a consumer to perform a proper rendering, it is often 
   necessary to provide spatial information about the streams it is 
   receiving.  CLUE defines a coordinate system that allows media  
   providers to describe the spatial relationships of their media 
   captures to enable proper scaling and spatial rendering of their 
   streams.  The coordinate system is based on a few principles: 

   o  Simple systems which do not have multiple Media Captures to 
      associate spatially need not use the coordinate model. 

   o  Coordinates can either be in real, physical units 
      (millimeters), have an unknown scale or have no physical scale.  
      Systems which know their physical dimensions should always 
      provide those real-world measurements.  Systems which don't 
      know specific physical dimensions but still know relative 
      distances should use 'unknown scale'.  'No scale' is intended 
      to be used where Media Captures from different devices (with 
      potentially different scales) will be forwarded alongside one 
      another (e.g. in the case of a middle box). 

      *  "millimeters" means the scale is in millimeters 

      *  "Unknown" means the scale is not necessarily millimeters, 
         but the scale is the same for every capture in the capture 
         scene. 

      *  "No Scale" means the scale could be different for each 
         capture- an MCU provider that advertises two adjacent 
         captures and picks sources (which can change quickly) from 
         different endpoints might use this value; the scale could be 
         different and changing for each capture.  But the areas of 
         capture still represent a spatial relation between captures. 

   o  The coordinate system is Cartesian X, Y, Z with the origin at a 
      spot of the provider's choosing.  The provider must use the 
      same coordinate system with same scale and origin for all 
      coordinates within the same capture scene. 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 11] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   The direction of increasing coordinate values is: 
   X increases from camera left to camera right 
   Y increases from front to back 
   Z increases from low to high 

6. Media Captures and Capture Scenes 

   This section describes how media providers can describe the 
   content of media to consumers. 

6.1. Media Captures 

   Media captures are the fundamental representations of streams that 
   a device can transmit.  What a Media Capture actually represents 
   is flexible: 

   o  It can represent the immediate output of a physical source 
      (e.g. camera, microphone) or 'synthetic' source (e.g. laptop 
      computer, DVD player). 

   o  It can represent the output of an audio mixer or video composer 

   o  It can represent a concept such as 'the loudest speaker' 

   o  It can represent a conceptual position such as 'the leftmost 
      stream' 

   To distinguish between multiple instances, video and audio 
   captures are numbered such as: VC1, VC2 and AC1, AC2.  VC1 and VC2 
   refer to two different video captures and AC1 and AC2 refer to two 
   different audio captures. 

   Each Media Capture can be associated with attributes to describe 
   what it represents. 

6.1.1. Media Capture Attributes 

   Media Capture Attributes describe static information about the 
   captures.  A provider uses the media capture attributes to 
   describe the media captures to the consumer.  The consumer will 
   select the captures it wants to receive.  Attributes are defined 
   by a variable and its value.  The currently defined attributes and 
   their values are: 

   Content: {slides, speaker, sl, main, alt} 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 12] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   A field with enumerated values which describes the role of the 
   media capture and can be applied to any media type.  The 
   enumerated values are defined by [RFC4796].  The values for this 
   attribute are the same as the mediacnt values for the content 
   attribute in [RFC4796].  This attribute can have multiple values, 
   for example content={main, speaker}. 

   Composed: {true, false} 

   A field with a Boolean value which indicates whether or not the 
   Media Capture is a mix (audio) or composition (video) of streams. 

   This attribute is useful for a media consumer to avoid nesting a 
   composed video capture into another composed capture or rendering. 
   This attribute is not intended to describe the layout a media 
   provider uses when composing video streams. 

   Audio Channel Format: {mono, stereo} A field with enumerated 
   values which describes the method of encoding used for audio. 

   A value of 'mono' means the Audio Capture has one channel. 

   A value of 'stereo' means the Audio Capture has two audio 
   channels, left and right. 

   This attribute applies only to Audio Captures.  A single stereo 
   capture is different from two mono captures that have a left-right 
   spatial relationship.  A stereo capture maps to a single RTP 
   stream, while each mono audio capture maps to a separate RTP 
   stream. 

   Switched: {true, false} 

   A field with a Boolean value which indicates whether or not the 
   Media Capture represents the (dynamic) most appropriate subset of 
   a 'whole'.  What is 'most appropriate' is up to the provider and 
   could be the active speaker, a lecturer or a VIP. 

   Point of Capture: {(X, Y, Z)} 

   A field with a single Cartesian (X, Y, Z) point value which 
   describes the spatial location, virtual or physical, of the 
   capturing device (such as camera). 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 13] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   When the Point of Capture attribute is specified, it must include 
   X, Y and Z coordinates.  If the point of capture is not specified, 
   it means the consumer should not assume anything about the spatial 
   location of the capturing device.  Even if the provider specifies 
   an area of capture attribute, it does not need to specify the 
   point of capture. 

   Point on Line of Capture: {(X,Y,Z)} 

   A field with a single Cartesian (X, Y, Z) point value (virtual or 
   physical) which describes a position in space of a second point on 
   the axis of the capturing device; the first point being the Point 
   of Capture (see above).  This point MUST lie between the Point of 
   Capture and the Area of Capture. 

   The Point on Line of Capture MUST be ignored if the Point of 
   Capture is not present for this capture device.  When the Point on 
   Line of Capture attribute is specified, it must include X, Y and Z 
   coordinates.  These coordinates MUST NOT be identical to the Point 
   of Capture coordinates.  If the Point on Line of Capture is not 
   specified, no assumptions are made about the axis of the capturing 
   device. 

   Area of Capture: 

   {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, 
   Y3, Z3), top right(X4, Y4, Z4)} 

   A field with a set of four (X, Y, Z) points as a value which 
   describe the spatial location of what is being "captured".  By 
   comparing the Area of Capture for different Media Captures within 
   the same capture scene a consumer can determine the spatial 
   relationships between them and render them correctly. 

   The four points should be co-planar.  The four points form a 
   quadrilateral, not necessarily a rectangle. 

   The quadrilateral described by the four (X, Y, Z) points defines 
   the plane of interest for the particular media capture. 

   If the area of capture attribute is specified, it must include X, 
   Y and Z coordinates for all four points.  If the area of capture 
   is not specified, it means the media capture is not spatially 
   related to any other media capture (but this can change in a 
   subsequent provider advertisement). 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 14] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   For a switched capture that switches between different sections 
   within a larger area, the area of capture should use coordinates 
   for the larger potential area. 

   EncodingGroup: {<encodeGroupID value>} 

   A field with a value equal to the encodeGroupID of the encoding 
   group associated with the media capture. 

   Max Capture Encodings: {unsigned integer} 

   An optional attribute indicating the maximum number of capture 
   encodings that can be simultaneously active for the media capture. 
   If absent, this parameter defaults to 1.  The minimum value for 
   this attribute is 1.  The number of simultaneous capture encodings 
   is also limited by the restrictions of the encoding group for the 
   media capture. 

6.2. Capture Scene 

   In order for a provider's individual media captures to be used 
   effectively by a consumer, the provider organizes the media 
   captures into capture scenes, with the structure and contents of 
   these capture scenes being sent from the provider to the consumer. 

   A capture scene is a structure representing the scene that is 
   captured by a collection of capture devices.  A capture scene 
   includes one or more capture scene entries, with each entry 
   including one or more media captures.  A capture scene represents, 
   for example, the video image of a group of people seated next to 
   each other, along with the sound of their voices, which could be 
   represented by some number of VCs and ACs in the capture scene 
   entries.  A middle box may also express capture scenes that it 
   constructs from media streams it receives. 

   A provider may advertise multiple capture scenes or just a single 
   capture scene.  A media provider might typically use one capture 
   scene for main participant media and another capture scene for a 
   computer generated presentation.  A capture scene may include more 
   than one type of media.  For example, a capture scene can include 
   several capture scene entries for video captures, and several 
   capture scene entries for audio captures. 

   A provider can express spatial relationships between media 
   captures that are included in the same capture scene.  But there 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 15] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   is no spatial relationship between media captures that are in 
   different capture scenes. 

   A media provider arranges media captures in a capture scene to 
   help the media consumer choose which captures it wants.  The 
   capture scene entries in a capture scene are different 
   alternatives the provider is suggesting for representing the 
   capture scene.  The media consumer can choose to receive all media 
   captures from one capture scene entry for each media type (e.g. 
   audio and video), or it can pick and choose media captures 
   regardless of how the provider arranges them in capture scene 
   entries.  Different capture scene entries of the same media type 
   are not necessarily mutually exclusive alternatives. 

   Media captures within the same capture scene entry must be of the 
   same media type - it is not possible to mix audio and video 
   captures in the same capture scene entry, for instance.  The 
   provider must be capable of encoding and sending all media 
   captures in a single entry simultaneously.  A consumer may decide 
   to receive all the media captures in a single capture scene entry, 
   but a consumer could also decide to receive just a subset of those 
   captures.  A consumer can also decide to receive media captures 
   from different capture scene entries. 

   When a provider advertises a capture scene with multiple entries, 
   it is essentially signaling that there are multiple 
   representations of the same scene available.  In some cases, these 
   multiple representations would typically be used simultaneously 
   (for instance a "video entry" and an "audio entry").  In some 
   cases the entries would conceptually be alternatives (for instance 
   an entry consisting of 3 video captures versus an entry consisting 
   of just a single video capture).  In this latter example, the 
   provider would in the simple case end up providing to the consumer 
   the entry containing the number of video captures that most 
   closely matched the media consumer's number of display devices. 

   The following is an example of 4 potential capture scene entries 
   for an endpoint-style media provider: 

   1.  (VC0, VC1, VC2) - left, center and right camera video captures 

   2.  (VC3) - video capture associated with loudest room segment 

   3.  (VC4) - video capture zoomed out view of all people in the 
   room 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 16] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   4.  (AC0) - main audio 

   The first entry in this capture scene example is a list of video 
   captures with a spatial relationship to each other.  Determination 
   of the order of these captures (VC0, VC1 and VC2) for rendering 
   purposes is accomplished through use of their Area of Capture 
   attributes.  The second entry (VC3) and the third entry (VC4) are 
   additional alternatives of how to capture the same room in 
   different ways.  The inclusion of the audio capture in the same 
   capture scene indicates that AC0 is associated with those video 
   captures, meaning it comes from the same scene.  The audio should 
   be rendered in conjunction with any rendered video captures from 
   the same capture scene. 

6.2.1. Capture scene attributes 

   Attributes can be applied to capture scenes as well as to 
   individual media captures.  Attributes specified at this level 
   apply to all constituent media captures. 

   Description attribute - list of {<description text>, <language 
   tag>} 

   The optional description attribute is a list of human readable 
   text strings which describe the capture scene.  If there is more 
   than one string in the list, then each string in the list should 
   contain the same description, but in a different language.  A 
   provider that advertises multiple capture scenes can provide 
   descriptions for each of them.  This attribute can contain text in 
   any number of languages. 

   The language tag identifies the language of the corresponding 
   description text.  The possible values for a language tag are the 
   values of the 'Subtag' column for the "Type: language" entries in 
   the "Language Subtag Registry" at [IANA-Lan] originally defined in 
   [RFC5646].  A particular language tag value MUST NOT be used more 
   than once in the description attribute list. 

   Area of Scene attribute 

   The area of scene attribute for a capture scene has the same 
   format as the area of capture attribute for a media capture.  The 
   area of scene is for the entire scene, which is captured by the 
   one or more media captures in the capture scene entries.  If the 
   provider does not specify the area of scene, but does specify 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 17] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   areas of capture, then the consumer may assume the area of scene 
   is greater than or equal to the outer extents of the individual 
   areas of capture. 

   Scale attribute 

   An optional attribute indicating if the numbers used for area of 
   scene, area of capture and point of capture are in terms of 
   millimeters, unknown scale factor, or not any scale, as described 
   in Section 5.  If any media captures have an area of capture 
   attribute or point of capture attribute, then this scale attribute 
   must also be defined.  The possible values for this attribute are: 

      "millimeters" 

      "unknown" 

      "no scale" 

6.2.2. Capture scene entry attributes 

   Attributes can be applied to capture scene entries.  Attributes 
   specified at this level apply to the capture scene entry as a 
   whole. 

   Scene-switch-policy: {site-switch, segment-switch} 

   A media provider uses this scene-switch-policy attribute to 
   indicate its support for different switching policies.  In the 
   provider's advertisement, this attribute can have multiple values, 
   which means the provider supports each of the indicated policies.  
   The consumer, when it requests media captures from this capture 
   scene entry, should also include this attribute but with only the 
   single value (from among the values indicated by the provider) 
   indicating the consumer's choice for which policy it wants the 
   provider to use.  If the provider does not support any of these 
   policies, it should omit this attribute. 

   The "site-switch" policy means all captures are switched at the 
   same time to keep captures from the same endpoint site together.  
   Let's say the speaker is at site A and everyone else is at a 
   "remote" site. 

   When the room at site A shown, all the camera images from site A 
   are forwarded to the remote sites.  Therefore at each receiving 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 18] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   remote site, all the screens display camera images from site A. 
   This can be used to preserve full size image display, and also 
   provide full visual context of the displayed far end, site A. In 
   site switching, there is a fixed relation between the cameras in 
   each room and the displays in remote rooms.  The room or 
   participants being shown is switched from time to time based on 
   who is speaking or by manual control. 

   The "segment-switch" policy means different captures can switch at 
   different times, and can be coming from different endpoints.  
   Still using site A as where the speaker is, and "remote" to refer 
   to all the other sites, in segment switching, rather than sending 
   all the images from site A, only the image containing the speaker 
   at site A is shown.  The camera images of the current speaker and 
   previous speakers (if any) are forwarded to the other sites in the 
   conference. 

   Therefore the screens in each site are usually displaying images 
   from different remote sites - the current speaker at site A and 
   the previous ones.  This strategy can be used to preserve full 
   size image display, and also capture the non-verbal communication 
   between the speakers.  In segment switching, the display depends 
   on the activity in the remote rooms - generally, but not 
   necessarily based on audio / speech detection. 

6.3. Simultaneous Transmission Set Constraints 

   The provider may have constraints or limitations on its ability to 
   send media captures.  One type is caused by the physical 
   limitations of capture mechanisms; these constraints are 
   represented by a simultaneous transmission set.  The second type 
   of limitation reflects the encoding resources available - 
   bandwidth and macroblocks/second.  This type of constraint is 
   captured by encoding groups, discussed below. 

   An endpoint or MCU can send multiple captures simultaneously, 
   however sometimes there are constraints that limit which captures 
   can be sent simultaneously with other captures.  A device may not 
   be able to be used in different ways at the same time.  Provider 
   advertisements are made so that the consumer will choose one of 
   several possible mutually exclusive usages of the device.  This 
   type of constraint is expressed in a Simultaneous Transmission 
   Set, which lists all the media captures that can be sent at the 
   same time.  This is easier to show in an example. 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 19] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   Consider the example of a room system where there are 3 cameras 
   each of which can send a separate capture covering 2 persons each- 
   VC0, VC1, VC2.  The middle camera can also zoom out and show all 6 
   persons, VC3.  But the middle camera cannot be used in both modes 
   at the same time - it has to either show the space where 2 
   participants sit or the whole 6 seats, but not both at the same 
   time. 

   Simultaneous transmission sets are expressed as sets of the MCs 
   that could physically be transmitted at the same time, (though it 
   may not make sense to do so).  In this example the two 
   simultaneous sets are shown in Table 1.  The consumer must make 
   sure that it chooses one and not more of the mutually exclusive 
   sets.  A consumer may choose any subset of the media captures in a 
   simultaneous set, it does not have to choose all the captures in a 
   simultaneous set if it does not want to receive all of them. 

                           +-------------------+ 
                           | Simultaneous Sets | 
                           +-------------------+ 
                           | {VC0, VC1, VC2}   | 
                           | {VC0, VC3, VC2}   | 
                           +-------------------+ 

                Table 1: Two Simultaneous Transmission Sets 

   A media provider includes the simultaneous sets in its provider 
   advertisement.  These simultaneous set constraints apply across 
   all the captures scenes in the advertisement.  The simultaneous 
   transmission sets MUST allow all the media captures in a 
   particular capture scene entry to be used simultaneously. 

7. Encodings 

   We have considered how providers can describe the content of media 
   to consumers.  We will now consider how the providers communicate 
   information about their abilities to send streams.  We introduce 
   two constructs - individual encodings and encoding groups.  
   Consumers will then map the media captures they want onto the 
   encodings with encoding parameters they want.  This process is 
   then described. 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 20] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

7.1. Individual Encodings 

   An individual encoding represents a way to encode a media capture 
   to become a capture encoding, to be sent as an encoded media 
   stream from the media provider to the media consumer.  An 
   individual encoding has a set of parameters characterizing how the 
   media is encoded. 

   Different media types have different parameters, and different 
   encoding algorithms may have different parameters.  An individual 
   encoding can be assigned to only one capture encoding at a time.  

   The parameters of an individual encoding represent the maximum 
   values for certain aspects of the encoding.  A particular 
   instantiation into a capture encoding might use lower values than 
   these maximums. 

   The following tables show the variables for audio and video 
   encoding. 

   +--------------+--------------------------------------------------
   --+ 
   | Name         | Description                                        
   | 
   +--------------+--------------------------------------------------
   --+ 
   | encodeID     | A unique identifier for the individual encoding    
   | 
   | maxBandwidth | Maximum number of bits per second                  
   | 
   | maxH264Mbps  | Maximum number of macroblocks per second: ((width  
   | 
   |              | + 15) / 16) * ((height + 15) / 16) *               
   | 
   |              | framesPerSecond                                    
   | 
   | maxWidth     | Video resolution's maximum supported width,        
   | 
   |              | expressed in pixels                                
   | 
   | maxHeight    | Video resolution's maximum supported height,       
   | 
   |              | expressed in pixels                                
   | 
   | maxFrameRate | Maximum supported frame rate                       
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 21] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   | 
   +--------------+--------------------------------------------------
   --+ 

               Table 2: Individual Video Encoding Parameters 

           +--------------+-----------------------------------+ 
           | Name         | Description                       | 
           +--------------+-----------------------------------+ 
           | maxBandwidth | Maximum number of bits per second | 
           +--------------+-----------------------------------+ 

               Table 3: Individual Audio Encoding Parameters 

7.2. Encoding Group 

   An encoding group includes a set of one or more individual 
   encodings, plus some parameters that apply to the group as a 
   whole.  By grouping multiple individual encodings together, an 
   encoding group describes additional constraints on bandwidth and 
   other parameters for the group.  Table 4 shows the parameters and 
   individual encoding sets that are part of an encoding group. 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 22] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   +-------------------+---------------------------------------------
   --+ 
   | Name              | Description                                   
   | 
   +-------------------+---------------------------------------------
   --+ 
   | encodeGroupID     | A unique identifier for the encoding group    
   | 
   | maxGroupBandwidth | Maximum number of bits per second relating 
   to | 
   |                   | all encodings combined                        
   | 
   | maxGroupH264Mbps  | Maximum number of macroblocks per second      
   | 
   |                   | relating to all video encodings combined      
   | 
   | videoEncodings[]  | Set of potential encodings (list of           
   | 
   |                   | encodeIDs)                                    
   | 
   | audioEncodings[]  | Set of potential encodings (list of           
   | 
   |                   | encodeIDs)                                    
   | 
   +-------------------+---------------------------------------------
   --+ 

                          Table 4: Encoding Group 

   When the individual encodings in a group are instantiated into 
   capture encodings, each capture encoding has a bandwidth that must 
   be less than or equal to the maxBandwidth for the particular 
   individual encoding.  The maxGroupBandwidth parameter gives the 
   additional restriction that the sum of all the individual capture 
   encoding bandwidths must be less than or equal to the 
   maxGroupBandwidth value. 

   Likewise, the sum of the macroblocks per second of each 
   instantiated encoding in the group must not exceed the 
   maxGroupH264Mbps value. 

   The following diagram illustrates the structure of a media 
   provider's Encoding Groups and their contents. 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 23] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   ,-------------------------------------------------. 
   |             Media Provider                      | 
   |                                                 | 
   |  ,--------------------------------------.       | 
   |  | ,--------------------------------------.     | 
   |  | | ,--------------------------------------.   | 
   |  | | |          Encoding Group              |   | 
   |  | | | ,-----------.                        |   | 
   |  | | | |           | ,---------.            |   | 
   |  | | | |           | |         | ,---------.|   | 
   |  | | | | Encoding1 | |Encoding2| |Encoding3||   | 
   |  `.| | |           | |         | `---------'|   | 
   |    `.| `-----------' `---------'            |   | 
   |      `--------------------------------------'   | 
   `-------------------------------------------------' 

                    Figure 1: Encoding Group Structure 

   A media provider advertises one or more encoding groups.  Each 
   encoding group includes one or more individual encodings.  Each 
   individual encoding can represent a different way of encoding 
   media.  For example one individual encoding may be 1080p60 video, 
   another could be 720p30, with a third being CIF. 
    

   While a typical 3 codec/display system might have one encoding 
   group per "codec box", there are many possibilities for the number 
   of encoding groups a provider may be able to offer and for the 
   encoding values in each encoding group. 

   There is no requirement for all encodings within an encoding group 
   to be instantiated at once. 

8. Associating Media Captures with Encoding Groups 

   Every media capture is associated with an encoding group, which is 
   used to instantiate that media capture into one or more capture 
   encodings.  Each media capture has an encoding group attribute.  
   The value of this attribute is the encodeGroupID for the encoding 
   group with which it is associated.  More than one media capture 
   may use the same encoding group. 

   The maximum number of streams that can result from a particular 
   encoding group constraint is equal to the number of individual 
   encodings in the group.  The actual number of capture encodings 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 24] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   used at any time may be less than this maximum.  Any of the media 
   captures that use a particular encoding group can be encoded 
   according to any of the individual encodings in the group.  If 
   there are multiple individual encodings in the group, then the 
   media consumer can configure the media provider to encode a single 
   media capture into multiple different capture encodings at the 
   same time, subject to the Max Capture Encodings constraint, with 
   each capture encoding following the constraints of a different 
   individual encoding. 

   The Encoding Groups MUST allow all the media captures in a 
   particular capture scene entry to be used simultaneously. 

9. Consumer's Choice of Streams to Receive from the Provider 

   After receiving the provider's advertised media captures and 
   associated constraints, the consumer must choose which media 
   captures it wishes to receive, and which individual encodings from 
   the provider it wants to use to encode the captures.  Each media 
   capture has an encoding group ID attribute which specifies which 
   individual encodings are available to be used for that media 
   capture. 

   For each media capture the consumer wants to receive, it 
   configures one or more of the encodings in that capture's encoding 
   group.  The consumer does this by telling the provider the 
   resolution, frame rate, bandwidth, etc. when asking for capture 
   encodings for its chosen captures.  Upon receipt of this 
   configuration command from the consumer, the provider generates a 
   stream for each such configured capture encoding and sends those 
   streams to the consumer. 

   The consumer must have received at least one capture advertisement 
   from the provider to be able to configure the provider's 
   generation of media streams. 

   The consumer is able to change its configuration of the provider's 
   encodings any number of times during the call, either in response 
   to a new capture advertisement from the provider or autonomously.  
   The consumer need not send a new configure message to the provider 
   when it receives a new capture advertisement from the provider 
   unless the contents of the new capture advertisement cause the 
   consumer's current configure message to become invalid. 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 25] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   When choosing which streams to receive from the provider, and the 
   encoding characteristics of those streams, the consumer needs to 
   take several things into account: its local preference, 
   simultaneity restrictions, and encoding limits. 

9.1. Local preference 

   A variety of local factors will influence the consumer's choice of 
   streams to be received from the provider: 

   o  if the consumer is an endpoint, it is likely that it would 
      choose, where possible, to receive video and audio captures 
      that match the number of display devices and audio system it 
      has 

   o  if the consumer is a middle box such as an MCU, it may choose 
      to receive loudest speaker streams (in order to perform its own 
      media composition) and avoid pre-composed video captures 

   o  user choice (for instance, selection of a new layout) may 
      result in a different set of media captures, or different 
      encoding characteristics, being required by the consumer 

9.2. Physical simultaneity restrictions 

   There may be physical simultaneity constraints imposed by the 
   provider that affect the provider's ability to simultaneously send 
   all of the captures the consumer would wish to receive.  For 
   instance, a middle box such as an MCU, when connected to a multi-
   camera room system, might prefer to receive both individual camera 
   streams of the people present in the room and an overall view of 
   the room from a single camera.  Some endpoint systems might be 
   able to provide both of these sets of streams simultaneously, 
   whereas others may not (if the overall room view were produced by 
   changing the zoom level on the center camera, for instance). 

9.3. Encoding and encoding group limits 

   Each of the provider's encoding groups has limits on bandwidth and 
   macroblocks per second, and the constituent potential encodings 
   have limits on the bandwidth, macroblocks per second, video frame 
   rate, and resolution that can be provided.  When choosing the 
   media captures to be received from a provider, a consumer device 
   must ensure that the encoding characteristics requested for each 
   individual media capture fits within the capability of the 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 26] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   encoding it is being configured to use, as well as ensuring that 
   the combined encoding characteristics for media captures fit 
   within the capabilities of their associated encoding groups.  In 
   some cases, this could cause an otherwise "preferred" choice of 
   capture encodings to be passed over in favour of different capture 
   encodings - for instance, if a set of 3 media captures could only 
   be provided at a low resolution then a 3 screen device could 
   switch to favoring a single, higher quality, capture encoding. 

9.4. Message Flow 

   The following diagram shows the basic flow of messages between a 
   media provider and a media consumer.  The usage of the "capture 
   advertisement" and "configure encodings" message is described 
   above. The consumer also sends its own capability message to the 
   provider which may contain information about its own capabilities 
   or restrictions. 

   Diagram for Message Flow 

            Media Consumer                         Media Provider 
            --------------                         ------------ 
                  |                                     | 
                  |----- Consumer Capability ---------->| 
                  |                                     | 
                  |                                     | 
                  |<---- Capture advertisement ---------| 
                  |                                     | 
                  |                                     | 
                  |------ Configure encodings --------->| 
                  |                                     | 

   In order for a maximally-capable provider to be able to advertise 
   a manageable number of video captures to a consumer, there is a 
   potential use for the consumer, at the start of CLUE, to be able 
   to inform the provider of its capabilities.  One example here 
   would be the video capture attribute set - a consumer could tell 
   the provider the complete set of video capture attributes it is 
   able to understand and so the provider would be able to reduce the 
   capture scene it advertises to be tailored to the consumer. 

   TBD - the content of the consumer capability message needs to be 
   better defined.  The authors believe there is a need for this 
   message, but have not worked out the details yet. 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 27] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

10. Extensibility 

   One of the most important characteristics of the Framework is its 
   extensibility.  Telepresence is a relatively new industry and 
   while we can foresee certain directions, we also do not know 
   everything about how it will develop.  The standard for 
   interoperability and handling multiple streams must be future-
   proof. The framework itself is inherently extensible through 
   expanding the data model types.  For example: 

   o  Adding more types of media, such as telemetry, can done by 
      defining additional types of captures in addition to audio and 
      video. 

   o  Adding new functionalities , such as 3-D, say, will require 
      additional attributes describing the captures. 

   o  Adding a new codecs, such as H.265, can be accomplished by 
      defining new encoding variables. 

   The infrastructure is designed to be extended rather than 
   requiring new infrastructure elements.  Extension comes through 
   adding to defined types. 

   Assuming the implementation is in something like XML, adding data 
   elements and attributes makes extensibility easy. 

11. Examples - Using the Framework 

   This section shows some examples in more detail how to use the 
   framework to represent a typical case for telepresence rooms.  
   First an endpoint is illustrated, then an MCU case is shown. 

11.1. Three screen endpoint media provider 

   Consider an endpoint with the following description: 

   3 cameras, 3 displays, a 6 person table 

   o  Each video device can provide one capture for each 1/3 section 
      of the table 

   o  A single capture representing the active speaker can be 
      provided 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 28] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   o  A single capture representing the active speaker with the other 
      2 captures shown picture in picture within the stream can be 
      provided 

   o  A capture showing a zoomed out view of all 6 seats in the room 
      can be provided 

   The audio and video captures for this endpoint can be described as 
   follows. 

   Video Captures: 

   o  VC0- (the camera-left camera stream), encoding group=EG0, 
      content=main, switched=false 

   o  VC1- (the center camera stream), encoding group=EG1, 
      content=main, switched=false 

   o  VC2- (the camera-right camera stream), encoding group=EG2, 
      content=main, switched=false 

   o  VC3- (the loudest panel stream), encoding group=EG1, 
      content=main, switched=true 

   o  VC4- (the loudest panel stream with PiPs), encoding group=EG1, 
      content=main, composed=true, switched=true 

   o  VC5- (the zoomed out view of all people in the room), encoding 
      group=EG1, content=main, composed=false, switched=false 

   o  VC6- (presentation stream), encoding group=EG1, content=slides, 
      switched=false 

   The following diagram is a top view of the room with 3 cameras, 3 
   displays, and 6 seats.  Each camera is capturing 2 people.  The 
   six seats are not all in a straight line. 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 29] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

      ,-. D 
     (   )`--.__        +---+ 
      `-' /     `--.__  |   | 
    ,-.  |            `-.._ |_-+Camera 2 (VC2) 
   (   ).'        ___..-+-''`+-+ 
    `-' |_...---''      |   | 
    ,-.c+-..__          +---+ 
   (   )|     ``--..__  |   | 
    `-' |             ``+-..|_-+Camera 1 (VC1) 
    ,-. |            __..--'|+-+ 
   (   )|     __..--'   |   | 
    `-'b|..--'          +---+ 
    ,-. |``---..___     |   | 
   (   )\          ```--..._|_-+Camera 0 (VC0) 
    `-'  \             _..-''`-+ 
     ,-. \      __.--'' |   | 
    (   ) |..-''        +---+ 
     `-' a 
    

   The two points labeled b and c are intended to be at the midpoint 
   between the seating positions, and where the fields of view of the 
   cameras intersect. 

   The plane of interest for VC0 is a vertical plane that intersects 
   points 'a' and 'b'.  

   The plane of interest for VC1 intersects points 'b' and 'c'. The 
   plane of interest for VC2 intersects points 'c' and 'd'. 

   This example uses an area scale of millimeters. 

   Areas of capture: 

       bottom left    bottom right  top left         top right 
   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 
   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 
   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757) 
   VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 
   VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 
   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 
   VC6 none 

   Points of capture: 
   VC0 (-1678,0,800) 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 30] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   VC1 (0,0,800) 
   VC2 (1678,0,800) 
   VC3 none 
   VC4 none 
   VC5 (0,0,800) 
   VC6 none 

   In this example, the right edge of the VC0 area lines up with the 
   left edge of the VC1 area.  It doesn't have to be this way.  There 
   could be a gap or an overlap.  One additional thing to note for 
   this example is the distance from a to b is equal to the distance 
   from b to c and the distance from c to d.  All these distances are 
   1346 mm. This is the planar width of each area of capture for VC0, 
   VC1, and VC2. 

   Note the text in parentheses (e.g. "the camera-left camera 
   stream") is not explicitly part of the model, it is just 
   explanatory text for this example, and is not included in the 
   model with the media captures and attributes.  Also, the 
   "composed" boolean attribute doesn't say anything about how a 
   capture is composed, so the media consumer can't tell based on 
   this attribute that VC4 is composed of a "loudest panel with 
   PiPs". 

   Audio Captures: 

   o  AC0 (camera-left), encoding group=EG3, content=main, channel 
      format=mono 

   o  AC1 (camera-right), encoding group=EG3, content=main, channel 
      format=mono 

   o  AC2 (center) encoding group=EG3, content=main, channel 
      format=mono 

   o  AC3 being a simple pre-mixed audio stream from the room (mono), 
      encoding group=EG3, content=main, channel format=mono 

   o  AC4 audio stream associated with the presentation video (mono) 
      encoding group=EG3, content=slides, channel format=mono 

   Areas of capture: 

       bottom left    bottom right  top left         top right 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 31] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 
   AC1 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757) 
   AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 
   AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 
   AC4 none 

   The physical simultaneity information is: 

      Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6} 

      Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 

   This constraint indicates it is not possible to use all the VCs at 
   the same time.  VC5 can not be used at the same time as VC1 or VC3 
   or VC4.  Also, using every member in the set simultaneously may 
   not make sense - for example VC3(loudest) and VC4 (loudest with 
   PIP).  (In addition, there are encoding constraints that make 
   choosing all of the VCs in a set impossible.  VC1, VC3, VC4, VC5, 
   VC6 all use EG1 and EG1 has only 3 ENCs.  This constraint shows up 
   in the encoding groups, not in the simultaneous transmission 
   sets.) 

   In this example there are no restrictions on which audio captures 
   can be sent simultaneously. 

   Encoding Groups: 

   This example has three encoding groups associated with the video 
   captures.  Each group can have 3 encodings, but with each 
   potential encoding having a progressively lower specification.  In 
   this example, 1080p60 transmission is possible (as ENC0 has a 
   maxMbps value compatible with that) as long as it is the only 
   active encoding in the group(as maxMbps for the entire encoding 
   group is also 489600).  Significantly, as up to 3 encodings are 
   available per group, it is possible to transmit some video 
   captures simultaneously that are not in the same entry in the 
   capture scene.  For example VC1 and VC3 at the same time. 

   It is also possible to transmit multiple capture encodings of a 
   single video capture.  For example VC0 can be encoded using ENC0 
   and ENC1 at the same time, as long as the encoding parameters 
   satisfy the constraints of ENC0, ENC1, and EG0, such as one at 
   1080p30 and one at 720p30. 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 32] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   encodeGroupID=EG0, maxGroupH264Mbps=489600, 
   maxGroupBandwidth=6000000 
       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 
                      maxH264Mbps=489600, maxBandwidth=4000000 
       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 
                      maxH264Mbps=108000, maxBandwidth=4000000 
       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 
                      maxH264Mbps=61200, maxBandwidth=4000000 
   encodeGroupID=EG1 maxGroupH264Mbps=489600 
   maxGroupBandwidth=6000000 
       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 
                      maxH264Mbps=489600, maxBandwidth=4000000 
       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 
                      maxH264Mbps=108000, maxBandwidth=4000000 
       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 
                      maxH264Mbps=61200, maxBandwidth=4000000 
   encodeGroupID=EG2 maxGroupH264Mbps=489600 
   maxGroupBandwidth=6000000 
       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 
                      maxH264Mbps=489600, maxBandwidth=4000000 
       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 
                      maxH264Mbps=108000, maxBandwidth=4000000 
       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 
                      maxH264Mbps=61200, maxBandwidth=4000000 

                Figure 2: Example Encoding Groups for Video 

   For audio, there are five potential encodings available, so all 
   five audio captures can be encoded at the same time. 

   encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000 
       encodeID=ENC9, maxBandwidth=64000 
       encodeID=ENC10, maxBandwidth=64000 
       encodeID=ENC11, maxBandwidth=64000 
       encodeID=ENC12, maxBandwidth=64000 
       encodeID=ENC13, maxBandwidth=64000 

                Figure 3: Example Encoding Group for Audio 

   Capture Scenes: 

   The following table represents the capture scenes for this 
   provider. Recall that a capture scene is composed of alternative 
   capture scene entries covering the same scene.  Capture Scene #1 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 33] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   is for the main people captures, and Capture Scene #2 is for 
   presentation. 

   Each row in the table is a separate entry in the capture scene 

                           +------------------+ 
                           | Capture Scene #1 | 
                           +------------------+ 
                           | VC0, VC1, VC2    | 
                           | VC3              | 
                           | VC4              | 
                           | VC5              | 
                           | AC0, AC1, AC2    | 
                           | AC3              | 
                           +------------------+ 

                           +------------------+ 
                           | Capture Scene #2 | 
                           +------------------+ 
                           | VC6              | 
                           | AC4              | 
                           +------------------+ 

   Different capture scenes are unique to each other, non-
   overlapping. A consumer can choose an entry from each capture 
   scene.  In this case the three captures VC0, VC1, and VC2 are one 
   way of representing the video from the endpoint.  These three 
   captures should appear adjacent next to each other.  
   Alternatively, another way of representing the Capture Scene is 
   with the capture VC3, which automatically shows the person who is 
   talking.  Similarly for the VC4 and VC5 alternatives. 

   As in the video case, the different entries of audio in Capture 
   Scene #1 represent the "same thing", in that one way to receive 
   the audio is with the 3 audio captures (AC0, AC1, AC2), and 
   another way is with the mixed AC3.  The Media Consumer can choose 
   an audio capture entry it is capable of receiving. 

   The spatial ordering is understood by the media capture attributes 
   area and point of capture. 

   A Media Consumer would likely want to choose a capture scene entry 
   to receive based in part on how many streams it can simultaneously 
   receive.  A consumer that can receive three people streams would 
   probably prefer to receive the first entry of Capture Scene #1 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 34] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   (VC0, VC1, VC2) and not receive the other entries.  A consumer 
   that can receive only one people stream would probably choose one 
   of the other entries. 

   If the consumer can receive a presentation stream too, it would 
   also choose to receive the only entry from Capture Scene #2 (VC6). 

11.2. Encoding Group Example 

   This is an example of an encoding group to illustrate how it can 
   express dependencies between encodings. 

   encodeGroupID=EG0, maxGroupH264Mbps=489600, 
   maxGroupBandwidth=6000000 
       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 
   maxFrameRate=60, 
                         maxH264Mbps=244800, maxBandwidth=4000000 
       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 
   maxFrameRate=60, 
                         maxH264Mbps=244800, maxBandwidth=4000000 
       encodeID=AUDENC0, maxBandwidth=96000 
       encodeID=AUDENC1, maxBandwidth=96000 
       encodeID=AUDENC2, maxBandwidth=96000 

   Here, the encoding group is EG0.  It can transmit up to two 
   1080p30 capture encodings (Mbps for 1080p = 244800), but it is 
   capable of transmitting a maxFrameRate of 60 frames per second 
   (fps).  To achieve the maximum resolution (1920 x 1088) the frame 
   rate is limited to 30 fps.  However 60 fps can be achieved at a 
   lower resolution if required by the consumer.  Although the 
   encoding group is capable of transmitting up to 6Mbit/s, no 
   individual video encoding can exceed 4Mbit/s. 

   This encoding group also allows up to 3 audio encodings, AUDENC<0-
   2>. It is not required that audio and video encodings reside 
   within the same encoding group, but if so then the group's overall 
   maxBandwidth value is a limit on the sum of all audio and video 
   encodings configured by the consumer.  A system that does not wish 
   or need to combine bandwidth limitations in this way should 
   instead use separate encoding groups for audio and video in order 
   for the bandwidth limitations on audio and video to not interact. 

   Audio and video can be expressed in separate encoding groups, as 
   in this illustration. 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 35] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   encodeGroupID=EG0, maxGroupH264Mbps=489600, 
   maxGroupBandwidth=6000000 
       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 
   maxFrameRate=60, 
                         maxH264Mbps=244800, maxBandwidth=4000000 
       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 
   maxFrameRate=60, 
                         maxH264Mbps=244800, maxBandwidth=4000000 
   encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000 
       encodeID=AUDENC0, maxBandwidth=96000 
       encodeID=AUDENC1, maxBandwidth=96000 
       encodeID=AUDENC2, maxBandwidth=96000 

11.3. The MCU Case 

   This section shows how an MCU might express its Capture Scenes, 
   intending to offer different choices for consumers that can handle 
   different numbers of streams.  A single audio capture stream is 
   provided for all single and multi-screen configurations that can 
   be associated (e.g. lip-synced) with any combination of video 
   captures at the consumer. 

   +--------------------+--------------------------------------------
   -+ 
   | Capture Scene #1   | note                                        
   | 
   +--------------------+--------------------------------------------
   -+ 
   | VC0                | video capture for single screen consumer    
   | 
   | VC1, VC2           | video capture for 2 screen consumer         
   | 
   | VC3, VC4, VC5      | video capture for 3 screen consumer         
   | 
   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer         
   | 
   | AC0                | audio capture representing all participants 
   | 
   +--------------------+--------------------------------------------
   -+ 

   If / when a presentation stream becomes active within the 
   conference the MCU might re-advertise the available media as: 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 36] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

        +------------------+--------------------------------------+ 
        | Capture Scene #2 | note                                 | 
        +------------------+--------------------------------------+ 
        | VC10             | video capture for presentation       | 
        | AC1              | presentation audio to accompany VC10 | 
        +------------------+--------------------------------------+ 

11.4. Media Consumer Behavior 

   This section gives an example of how a media consumer might behave 
   when deciding how to request streams from the three screen 
   endpoint described in the previous section. 

   The receive side of a call needs to balance its requirements, 
   based on number of screens and speakers, its decoding capabilities 
   and available bandwidth, and the provider's capabilities in order 
   to optimally configure the provider's streams.  Typically it would 
   want to receive and decode media from each capture scene 
   advertised by th provider. 

   A sane, basic, algorithm might be for the consumer to go through 
   eac capture scene in turn and find the collection of video 
   captures that best matches the number of screens it has (this 
   might include consideration of screens dedicated to presentation 
   video display rather than "people" video) and then decide between 
   alternative entries in the video capture scenes based either on 
   hard-coded preferences or user choice.  Once this choice has been 
   made, the consumer would then decide how to configure the 
   provider's encoding groups in order to make best use of the 
   available network bandwidth and its own decoding capabilities.  

11.4.1. One screen consumer 

   VC3, VC4 and VC5 are all different entries by themselves, not 
   grouped together in a single entry, so the receiving device should 
   choose between one of those.  The choice would come down to 
   whether to see the greatest number of participants simultaneously 
   at roughly equal precedence (VC5), a switched view of just the 
   loudest region (VC3) or a switched view with PiPs (VC4).  An 
   endpoint device with a small amount of knowledge of these 
   differences could offer a dynamic choice of these options, in-
   call, to the user. 

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 37] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

11.4.2. Two screen consumer configuring the example 

   Mixing systems with an even number of screens, "2n", and those 
   with "2n+1" cameras (and vice versa) is always likely to be the 
   problematic case.  In this instance, the behavior is likely to be 
   determined by whether a "2 screen" system is really a "2 decoder" 
   system, i.e., whether only one received stream can be displayed 
   per screen or whether more than 2 streams can be received and 
   spread across the available screen area.  To enumerate 3 possible 
   behaviors here for the 2 screen system when it learns that the far 
   end is "ideally" expressed via 3 capture streams: 

   1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 
      per the 1 screen consumer case above) and either leave one 
      screen blank or use it for presentation if / when a 
      presentation becomes active. 

   2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 
      screens (either with each capture being scaled to 2/3 of a 
      screen and the centre capture being split across 2 screens) or, 
      as would be necessary if there were large bezels on the 
      screens, with each stream being scaled to 1/2 the screen width 
      and height and there being a 4th "blank" panel.  This 4th panel 
      could potentially be used for any presentation that became 
      active during the call. 

   3. Receive 3 streams, decode all 3, and use control information 
      indicating which was the most active to switch between showing 
      the left and centre streams (one per screen) and the centre and 
      right streams. 

   For an endpoint capable of all 3 methods of working described 
   above, again it might be appropriate to offer the user the choice 
   of display mode. 

11.4.3. Three screen consumer configuring the example 

   This is the most straightforward case - the consumer would look to 
   identify a set of streams to receive that best matched its 
   available screens and so the VC0 plus VC1 plus VC2 should match 
   optimally.  The spatial ordering would give sufficient information 
   for the correct video capture to be shown on the correct screen, 
   and the consumer would either need to divide a single encoding 
   group's capability by 3 to determine what resolution and frame 
   rate to configure the provider with or to configure the individual 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 38] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   video captures' encoding groups with what makes most sense (taking 
   into account the receive side decode capabilities, overall call 
   bandwidth, the resolution of the screens plus any user preferences 
   such as motion vs sharpness). 

12. Acknowledgements 

   Mark Gorzyinski contributed much to the approach.  We want to 
   thank Stephen Botzko for helpful discussions on audio. 

13. IANA Considerations 

   TBD 

14. Security Considerations 

   TBD 

15. Changes Since Last Version 

   NOTE TO THE RFC-Editor: Please remove this section prior to 
   publication as an RFC. 

   Changes from 06 to 07: 

   1. Ticket #9.  Rename Axis of Capture Point attribute to Point on 
      Line of Capture.  Clarify the description of this attribute. 

   2. Ticket #17.  Add "capture encoding" definition.  Use this new 
      term throughout document as appropriate, replacing some usage 
      of the terms "stream" and "encoding". 

   3. Ticket #18.  Add Max Capture Encodings media capture attribute.  

   4. Add clarification that different capture scene entries are not 
      necessarily mutually exclusive. 

   Changes from 05 to 06: 

   1. Capture scene description attribute is a list of text strings, 
      each in a different language, rather than just a single string. 

   2. Add new Axis of Capture Point attribute. 

   3. Remove appendices A.1 through A.6. 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 39] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   4. Clarify that the provider must use the same coordinate system 
      with same scale and origin for all coordinates within the same 
      capture scene. 

   Changes from 04 to 05: 

   1. Clarify limitations of "composed" attribute. 

   2. Add new section "capture scene entry attributes" and add the 
      attribute "scene-switch-policy". 

   3. Add capture scene description attribute and description 
      language attribute. 

   4. Editorial changes to examples section for consistency with the 
      rest of the document. 

   Changes from 03 to 04: 

   1. Remove sentence from overview - "This constitutes a significant 
      change ..." 

   2. Clarify a consumer can choose a subset of captures from a 
      capture scene entry or a simultaneous set (in section "capture 
      scene" and "consumer's choice..."). 

   3. Reword first paragraph of Media Capture Attributes section. 

   4. Clarify a stereo audio capture is different from two mono audio 
      captures (description of audio channel format attribute). 

   5. Clarify what it means when coordinate information is not 
      specified for area of capture, point of capture, area of scene. 

   6. Change the term "producer" to "provider" to be consistent (it 
      was just in two places). 

   7. Change name of "purpose" attribute to "content" and refer to 
      RFC4796 for values. 

   8. Clarify simultaneous sets are part of a provider advertisement, 
      and apply across all capture scenes in the advertisement. 

   9. Remove sentence about lip-sync between all media captures in a 
      capture scene. 
 
 
Duckworth et. al.       Expires June 24, 2013               [Page 40] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

   10.   Combine the concepts of "capture scene" and "capture set" 
      into a single concept, using the term "capture scene" to 
      replace the previous term "capture set", and eliminating the 
      original separate capture scene concept. 

   Informative References 

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate 
              Requirement Levels", BCP 14, RFC 2119, March 1997. 

   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., 
   Johnston, 
              A., Peterson, J., Sparks, R., Handley, M., and E. 
              Schooler, "SIP: Session Initiation Protocol", RFC 3261, 
              June 2002. 

   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V. 
              Jacobson, "RTP: A Transport Protocol for Real-Time 
              Applications", STD 64, RFC 3550, July 2003. 

   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the 
              Session Initiation Protocol (SIP)", RFC 4353, 
              February 2006. 

   [RFC4796]  Hautakorpi, J. and G. Camarillo, "The Session 
   Description 
              Protocol (SDP) Content Attribute", RFC 4796, 
              February 2007. 

   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 
   5117, 
              January 2008. 

   [RFC5646]  Phillips, A. and M. Davis, "Tags for Identifying 
              Languages", BCP 47, RFC 5646, September 2009. 

   [IANA-Lan] 
              IANA, "Language Subtag Registry",           
   <http://www.iana.org/assignments/ 
              language-subtag-registry>. 

    

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 41] 
    



 
Internet-Draft       CLUE Telepresence Framework        December 2012 
    

16. Authors' Addresses 

   Mark Duckworth (editor) 
   Polycom 
   Andover, MA  01810 
   USA 

   Email: mark.duckworth@polycom.com 

    

   Andrew Pepperell 
   Silverflare 
   Uxbridge, England 
   UK 

   Email: apeppere@gmail.com 

    

   Stephan Wenger 
   Vidyo, Inc. 
   433 Hakcensack Ave. 
   Hackensack, N.J. 07601 
   USA 
    
   Email: stewe@stewe.org 

    

 
 
Duckworth et. al.       Expires June 24, 2013               [Page 42]