Internet Engineering Task Force                               R. Montero
Internet-Draft                                    University of A Coruna
Intended status: Informational                           August 13, 2020
Expires: February 14, 2021


Protocol for Evaluating Reinforcement Learning Environments in Real Time
                          draft-perlert-wg-00

Abstract

   This document defines a simple UDP protocol for communicating a
   server simulating a reinforcement learning environment and a client
   observing it and responding with actions.

   Reinforcement learning problems are usually defined within the scope
   of a Markov Decission Process (MDP) where an agent sends an action
   belonging to an action space to an environment.  The environment acts
   as a black box returning an observation and a reward for the agent,
   whose goal is to maximize the total obtained rewards.

   Although the problem statement is easy to understand, there are no
   conventions on how to communicate a reinforcement learning simulation
   with a client agent, either in a local network or over the Internet.
   Additionally, giving an answer to this can be especially useful when
   it comes to multiagent support and analysis.

   The protocol PERLERT defined in this document assumes that server and
   client have shared certain information beforehand via another way of
   communication like a web page served using HTTP protocol.  For
   example, the client must know a port number and an instance number
   before proceeding to participate in a simulation run on a server.

   Also, although it is often desired to know the full feedback from the
   environment, PERLERT focuses on real-time interaction where human
   agents can interact with AI agents even if that means that
   information can be lost due to network packet loss.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.




Montero                 Expires February 14, 2021               [Page 1]


Internet-Draft                   PERLERT                     August 2020


   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on February 14, 2021.

Copyright Notice

   Copyright (c) 2020 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
   2.  Communication Phases  . . . . . . . . . . . . . . . . . . . .   3
   3.  Messages Specification  . . . . . . . . . . . . . . . . . . .   3
     3.1.  Terms . . . . . . . . . . . . . . . . . . . . . . . . . .   3
     3.2.  Client Message Types  . . . . . . . . . . . . . . . . . .   5
     3.3.  Server Message Types  . . . . . . . . . . . . . . . . . .   6
   4.  UDP/IP Ports  . . . . . . . . . . . . . . . . . . . . . . . .   7
   5.  Example Case  . . . . . . . . . . . . . . . . . . . . . . . .   8
   6.  Additional Considerations . . . . . . . . . . . . . . . . . .   8
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
   8.  Security Considerations . . . . . . . . . . . . . . . . . . .   9
   9.  Normative References  . . . . . . . . . . . . . . . . . . . .   9
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  10

1.  Introduction

   This document specifies PERLERT (Protocol for Evaluation of
   Reinforcement Learning Environments in Real Time).

   It is intended to be used in the context of reinforcement learning
   problems analysis.  In reinforcement learning problems an agent sends
   an action to an environment.  The environment acts as a black box




Montero                 Expires February 14, 2021               [Page 2]


Internet-Draft                   PERLERT                     August 2020


   returning an observation and a reward for the agent, whose goal is to
   maximize the total obtained rewards.

   The main purpose of PERLERT is to make it easier to test and
   integrate differently implemented agents and run simulation servers
   separatedly from those agents.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

2.  Communication Phases

   There are two main separated phases in which client and server shall
   exchange PERLERT messages.

   lobby
       This phase is oriented to let agent clients register themselves
       within the available slots informed by the server.  It is
       especially useful when it comes to environments with multiagent
       support.

   rollout
       This is the main phase.  The term "rollout" here acts as a
       synonym of "simulation".  In this section the loop:

       (action) -> (observation, reward)

       ...takes place until clients are notified by the server that the
       simulation has finished.

3.  Messages Specification

   Messages defined in the following sections MUST be implemented as
   UDP/IP datagrams [RFC768].

   Also, all messages SHOULD use the same text encoding.  It is
   RECOMMENDED that both server and client encode messages using UTF8
   [RFC3629].

3.1.  Terms

   In order of appearance:

   SERVER_INSTANCE_NAME  Tag used to distinguish different environments
       being held by one same server, e.g.: "cartpole".



Montero                 Expires February 14, 2021               [Page 3]


Internet-Draft                   PERLERT                     August 2020


   SERVER_INSTANCE_NUMBER  Positive integer used to distinguish
       different instances of the same environment being held by one
       same server, e.g.: "0".

   HEADER  Shorthand for SERVER_INSTANCE_NAME:SERVER_INSTANCE_NUMBER,
       e.g.: "cartpole:0".

   SERVER_LOBBY_PORT  UDP/IP port on which server is listening for
       incoming messages related to the lobby phase.  It is necessary
       that clients know the SERVER_LOBBY_PORT beforehand.

   SERVER_ROLLOUT_PORT  UDP/IP port on which server is listening for
       incoming messages related to the rollout phase.  It will be
       notified by the server to the clients right before the simulation
       starts.

   CLIENT_PORT  UDP/IP port of agent clients.  Server SHOULD NOT send
       datagrams to clients if they have not been registered first,
       following the process explained in next section.

   AGENT_KEY  Key used to identify one available agent slot, e.g.:
       "agent0".

   AGENT_TAG  Tag used to identify one agent filling one available slot.
       Specific clients can use a custom tag to identify themselves
       within the scope of the lobby phase, e.g.: "john_doe_q_learning".

   BOOL_VALUE  "true" or "false" particles, without backticks.

   ACTION  Action chosen by an agent.  It MUST NOT contain the colon
       character (:), semicolon (;), or equal sign (=).  There are no
       other restrictions on how this field is formed as long as it is
       well understood by both client and server, e.g.: "move_left" or
       "5,6.78".

   SLOT_STATUS  "open" or "close" particles, without backticks.

   AGENT_KIND  Freeform field used to differentiate aspects of agents
       relevant during the lobby phase, e.g.: "citizen" or "zombie".  It
       MUST NOT contain the colon character (:), semicolon (;), comma
       (,) or equal sign (=).  There are no other restrictions on how
       this field is formed as long as it is well understood by both
       client and server.

   READY_STATUS  "ready" or "not_ready" particles, without backticks.

   AGENT_SLOT  Shorthand for
       AGENT_KEY=SLOT_STATUS,AGENT_KIND,AGENT_TAG,READY_STATUS;



Montero                 Expires February 14, 2021               [Page 4]


Internet-Draft                   PERLERT                     August 2020


   [AGENT_SLOT]  Appearance of 1..n AGENT_SLOT.

   MESSAGE  Informative message sent by server instances during lobby
       phase.

   TIMESTAMP  Number of milliseconds since UNIX Epoch (Jan 1, 1970)
       according to server time.

   STEP_NUMBER  Positive integer indicating the step number for a
       running simulation.

   OBSERVATION  Observation for an agent received upon a simulation step
       run on the server.  It MUST NOT contain the semicolon character
       (;), or equal sign (=).  There are no other restrictions on how
       this field is formed as long as it is well understood by both
       client and server, e.g.: "x:0.54,y:0.95".

   REWARD  Reward for an agent received upon a simulation step run on
       the server, usually modeled as a single floating point value.  It
       MUST NOT contain the semicolon character (;), or equal sign (=).

   EXTRA  Additional information for an agent received upon a simulation
       step run on the server.  It MUST NOT contain the semicolon
       character (;), or equal sign (=).  There are no other
       restrictions on how this field is formed as long as it is well
       understood by both client and server, e.g.:
       "did_jump:true,jump_length:6.84".

3.2.  Client Message Types

   This section specifies the content format for the message types that
   shall be implemented by PERLERT clients.

   lobby information request
       Message sent by clients to request lobby information associated
       with a given server instance.

       HEADER;lobby

   lobby registration request
       Message sent by clients to request to participate in a simulation
       server instance.

       HEADER;register=AGENT_KEY,AGENT_TAG

       Clients are allowed to issue multiple lobby registration
       requests, but only the last one correctly received by the server
       will take effect.



Montero                 Expires February 14, 2021               [Page 5]


Internet-Draft                   PERLERT                     August 2020


   lobby ready request
       Message sent by clients to inform the server whether they are
       ready to participate in the simulation or not.

       HEADER;ready=AGENT_KEY,BOOL_VALUE

   rollout action
       Message sent by clients to inform about the desired action to be
       run in the simulation.  It is not needed to send a "rollout
       action" message per each simulation timestep.  Instead, the
       server will use the last received action for each client and feed
       it into the environment until receiving a new action.  Server
       instances can choose which action feed to the environment
       simulation until agent clients provide a valid action.

       HEADER;action=ACTION

3.3.  Server Message Types

   This section specifies the content format for the message types that
   shall be implemented by PERLERT servers.

   lobby information
       Message responded by servers informing clients about lobby agent
       slots.  This datagram MUST be sent to a client upon receiving a
       "lobby information request", and to all clients whenever the
       lobby is altered due to a "lobby registration request" or a
       "lobby ready request".

       HEADER;[AGENT_SLOT]

       The message format MAY omit the trailing semicolon character (;).

   lobby registration response
       Message sent by servers upon a successful registration request.

       HEADER;registered=AGENT_KEY

       Servers MUST NOT allow a single client to be registered in
       multiple slots.  Before proceeding to register one client in one
       agent slot, such client must be removed from any slot where it
       may have been registered first.

       Servers MUST register clients with a default "not_ready" status.







Montero                 Expires February 14, 2021               [Page 6]


Internet-Draft                   PERLERT                     August 2020


   lobby message
       Message sent by servers to registered clients containing relevant
       general information.

       HEADER;message=MESSAGE

   lobby start
       Message sent by servers to all registered clients informing about
       the UDP/IP port for the rollout once the simulation is about to
       start.  The server can choose to start the simulation at any time
       but it MUST NOT do it if any client is in a "not_ready" status.

       HEADER;start=port:SERVER_ROLLOUT_PORT

   rollout step
       Message sent by servers to all registered clients containing the
       information provided by the environment for a single step.  Note
       that "rollout step" messages should be sent in a regular
       datastream containing enough data per time unit so that clients
       can properly render the environment, but should not exceed a
       reasonable amount of UDP packets.  It is RECOMMENDED to limit a
       maximum of 30 "rollout step" packets per second.

       HEADER:TIMESTAMP:STEP_NUMBER;obs=OBSERVATION;reward=REWARD;done=B
       OOL_VALUE

       Server MAY send additional information by concatenating an extra
       particle like this:

       HEADER:TIMESTAMP:STEP_NUMBER;obs=OBSERVATION;reward=REWARD;done=B
       OOL_VALUE;extra=EXTRA

       Because several messages of this type will be sent over the
       network, it is recommended that they are as condensed as
       possible.  For example, it is RECOMMENDED that floating point
       values either belonging to the OBSERVATION or the REWARD are
       rounded to a minimal needed amount of decimals.

4.  UDP/IP Ports

   All messages sent by one client MUST use the same UDP/IP source
   CLIENT_PORT during the whole information exchange process, since the
   agent sends a "lobby registration request" to the server until it
   receives a "rollout step" response with "done" flag as "true".

   "lobby information", "lobby registration response", "lobby message",
   and "lobby start" datagrams MUST use the same UDP/IP source
   SERVER_LOBBY_PORT for a given server instance.



Montero                 Expires February 14, 2021               [Page 7]


Internet-Draft                   PERLERT                     August 2020


   "rollout step" datagrams MUST use the same UDP/IP source
   SERVER_ROLLOUT_PORT for a given server instance.

5.  Example Case

   This section provides a brief example of datagrams exchanged by one
   client and one server during a PERLERT session.

          CLIENT                                           SERVER

          ==================== LOBBY PHASE ======================
          UDP port: 55555                         UDP port: 32322

          city:7;lobby -------------------------------------->

             <-------------- city:7;agent0=open,citizen,cpu,ready

          city:7;register=agent0,patrick -------------------->

             <-------------------------- city:7;registered=agent0

          city:7;ready=agent0,true -------------------------->

             <--------- city:7;agent0=close,citizen,patrick,ready
             <----------- city:7;message=Simulation will start...

             <--------------------------- city:7;start=port:32323

          ==================== ROLLOUT PHASE =====================
          UDP port: 55555                         UDP port: 32323

          city:7;action=walk -------------------------------->

             <-- city:7:1590853116323:0;obs=45;reward=0;done=false
             <-- city:7:1590853121058:0;obs=47;reward=0;done=false
             <-- city:7:1590853126423:0;obs=48;reward=1;done=false
             <-- city:7:1590853130429:0;obs=49;reward=0;done=false
             <--- city:7:1590853134833:0;obs=51;reward=1;done=true

                                 Figure 1

6.  Additional Considerations

   Because packet loss might prevent some PERLERT information from
   arriving to the other end, the following considerations are to be
   taken into account:





Montero                 Expires February 14, 2021               [Page 8]


Internet-Draft                   PERLERT                     August 2020


   After sending the "lobby start" message, the server instance SHOULD
   keep the SERVER_LOBBY_PORT open for five (5) seconds and resend the
   "lobby start" message to any client communicating to such port after
   the simulation has started.

   After the simulation is finished for a given client, this is, the
   "rollout step" message contains the "done" flag as "true", the server
   instance SHOULD keep the SERVER_ROLLOUT_PORT open for ten (10)
   seconds and listening to datagrams from such client.  The server
   instance SHOULD resend the appropriate "rollout step" datagram upon
   receiving a client message within that period.

7.  IANA Considerations

   This memo includes no request to IANA.

8.  Security Considerations

   Both client and server implementations SHOULD use a fixed buffer size
   as small as possible for receiving the UDP/IP packets.

   Both client and server MAY cipher the content of the messages.
   Although asymmetric publick/private key pairs usage is recommended,
   it is also encourage to use symmetric ciphering with a pre-shared key

   PERLERT is especially vulnerable to IP spoofing attacks, because
   actions received during the rollout phase are only identified by the
   IP of the sender.  Using an VPN is RECOMMENDED in order to tunnelize
   the information exchange.

9.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
              2003, <https://www.rfc-editor.org/info/rfc3629>.

   [RFC768]   Postel, J., "User Datagram Protocol", August 1980,
              <https://tools.ietf.org/html/rfc768>.








Montero                 Expires February 14, 2021               [Page 9]


Internet-Draft                   PERLERT                     August 2020


Author's Address

   Ruben Montero
   University of A Coruna
   Rua San Roque 9
   A Coruna, Galicia  15002
   ES

   Phone: +34 692 983 851
   Email: ruben.montero@udc.es









































Montero                 Expires February 14, 2021              [Page 10]