IoT Operations Working Group                             F. Foukalas
Internet-Draft                                           A. Tziouvaras
Intended status: Draft Standard                          September 27, 2021
Expires: September, 2022



draft-distributed-ml-iot-edge-cmp-foukalas-01.txt

Status of this Memo

This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF).  Note that other groups may also distribute
working documents as Internet-Drafts.  The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

This Internet-Draft will expire on September 22, 2022.


Copyright Notice

Copyright (c) 2021 IETF Trust and the persons identified as the
document authors. All rights reserved.

        This document is subject to BCP 78 and the IETF Trust's Legal
        Provisions Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.

Abstract

Next generation Internet requires decentralized and distributed
intelligence in order to make available a new type of
experience to serve the user's interests. Such new services
will be enabled by deploying the intelligence
over a high volume of IoT devices in a form of distributed
protocol. Such a  protocol will orchestrate the machine learning
(ML) application in order to train the aggregated data available
from the IoT devices. The training is not an easy task in such
a distributed environment, where the amount of connected IoT
devices will scale up and the needs for both interoperability
and computing are high. This draft, addresses both issues
by combining two emerging technologies known as edge AI
and fog computing. The protocol procedures aggregate the data
collected by the IoT devices into a fog node and apply edge AI
for data analysis at the edge of the infrastructure. The
analysis of the IoT requirements resulted in an end-to-end ML
protocol specification which is presented throughout this draft.

Table of Contents

1. Introduction 2
2. Background and terminology 3
3. Edge computing architecture 4
4. Protocol stages 8
4.1. Initial configuration 8
4.2. FL training 11
4.3. Cloud update 12
5. Security Considerations 14
6. IANA Considerations 15
7. Conclusions 15
8. References 15
8.1. Normative References 15
9. Acknowledgments 16


1. Introduction

There is an evident requirement to address several challenges
to offer robust IoT services by leveraging the integration of
Edge computing with IoT known as IoT edge computing. The concept
of IoT edge computing has not been specified in detail yet
although two recent drafts described already some aspects of such
Internet architecture. Such architecture is way more useful in case
of distributed machine learning deployment to future Internet,
where the edge artificial intelligence will play an important role.
Towards this end, the proposed draft provides first the IoT edge
computing architecture, which includes the necessary elements
to deploy distributed machine learning. Second, three stages of
such a distributed intelligence are described in a sort of protocol
procedures, where the initialization, the learning and cloud updates
were devised. Details are given for all the protocol procedures
of the distributed machine learning for IoT edge computing.

2. Background and terminology

Below we list a number of terms related with the distributed
machine learning solution:

End devices: End devices [1] are IoT devices that collect
data while also having computing and networking capabilities.
 End devices can be any type of device that can connect to the
Edge gateway and facilitate sensors for data collection.

Edge gateway: The Edge gateway is a server that is located to
the Edge of the network [1]. It facilitates large computational
and networking capabilities and coordinates the FL process.
The Edge gateway is used to relieve the traffic from the network
backhaul as the end devices connect to the Edge instead of the
cloud.

Cloud: Cloud supports very large computational capabilities [1]
and is geographically located far from the end devices. It provides
accessibility to the Edge gateway and remains agnostic on the amount
and type of participating end devices. As a result, the cloud does
not have an active role in the FL training process.

Federated learning (FL): FL is a distributed ML technique which
utilizes a large number of End devices that train their ML models
locally without communicating with each other. The locally trained
models are dispatched to the Edge gateway which aggregates the
collected models into one global model. In the sequel the global
model is broadcasted to the end devices in order for the next
training round to begin. During the FL process, the end devices
do not share data or any other information.

Constrained application protocol (CoAP): CoAP is a UDP
communication protocol which supports lightweight communication between
two entities [RFC 7252]. CoAP is ideal for devices with limited
computational capabilities as it does not require full protocol
stack to operate. CoAP supports the following message formats:
Confirmable (CON) messages, non-confirmable (NON) messages,
acknowledgement (ACK) reply messages and reset (RST) reply messages.
CON messages are reliable message requests and are provided by
marking a message as confirmable. A confirmable message is
retransmitted using a default timeout and exponential back off
between retransmissions, until the recipient sends an Acknowledgement
message (ACK) with the same Message ID. When a recipient is not able
to process a Confirmable message, it replies with a Reset message (RST)
instead of an Acknowledgement. NON messages are message requests
that do not require reliable transmission. These are not acknowledged,
but still have a Message ID for duplicate detection.  When a recipient
is not able to process a Non-confirmable message, it may reply with a
Reset message (RST).

3. Edge computing architecture

Fig 1 below depicts the IoT architecture we employ, where the three
main entities are the end devices, the edge gateway and the cloud
server. Below we describe the functionalities of each module
and how each module it interacts with the rest of the
architecture:

End devices: End devices can be classified into constrained and
non-constrained according to the processing capabilities they
employ. Previous work in [2] classifies the end devices into the
following categories:

Class 0 (C0): This class contains sensor-like devices. Although
they may answer keep-alive signals and send basic indications,
they most likely do not have the resources to securely
communicate with the Internet directly (larger devices act as
proxies, gateways, or servers) and cannot be secured or managed
comprehensively in the traditional sense.

Class 1 (C1): Such devices are quite constrained in code space
and processing capabilities and cannot easily talk to other
Internet nodes nor employ a full protocol stack. Thus they
are considered ideal for the Constrained Application Protocol
(CoAP) over UDP.

Class 2 (C2): C2 devices are less constrained and capable of
supporting most of the same protocol stacks as servers and
laptop computers.

Other (C3): Devices with capabilities significantly beyond that
of Class 2 are left uncategorized (Others). They may still be
constrained by a limited energy supply, but can largely use
existing protocols unchanged.

To this end, the IoT architecture provides cameras as C1 devices
and mobile phones as C2-other devices. Each device stores a
local dataset independently from the others and does not have any
access to the data sets of the rest of the devices. Also, end
devices are responsible for training their local ML model and for
reporting the trained model to the edge gateway for the
aggregation process.

Edge gateway: The edge gateway is responsible for collecting the
locally trained models from the end devices and for aggregating
such models into a global model. Further, the edge gateway is
responsible for dispatching the trained model to the cloud in
order to make it available to the developers. In order to support
the aforementioned services the edge gateway employs the
following controller interfaces:

Southbound controller: The southbound interface is responsible
for handling the communication between the edge gateway and
the end devices [5]. The southbound controller also performs
the resource discovery, resource authentication, device
configuration and global model dispatch tasks. The resource
discovery process manages to detect and identify the devices
that participate on the FL training and also to establish a
communication link between the edge and the device. The resource
authentication process authenticates the end devices by matching
each device's unique ID with a trusted ID list that is stored
at the edge. The resource configuration broadcasts the ML model
hyperparameters to the participating end devices. Finally the
global model dispatch operation broadcasts the aggregated global
model to the trusted connected devices.

Central controller: The Central controller is the core component
of Network Artificial Intelligence, which can be called as
"Network Brain" [4]. It carries on the FL aggregation process and is
responsible to stop the FL process when the model converges. It also
performs the data sharing, global model training, global model
aggregation and device scheduling functionalities.

Northbound interface: The northbound interface is provided by a
gateway component to a remote network [5], e.g. a cloud, home
or enterprise network. The northbound interface is a data plane
interface, which facilitates the communication management of the
edge gateway with the cloud. Under this premise the northbound
interface is responsible for the model sharing and the model
publish functionalities. Model sharing is the function under
which the edge is authenticated by the cloud as a trusted
party and thus, gains the rights to upload the trained FL
model to the cloud. Model publish the uploading process of
the trained model to the cloud so that to make it available
to the developers.

Cloud server: The Cloud server may provide virtually unlimited
storage and processing power [3].  The reliance of IoT on
back-end cloud computing brings additional advantages such
as flexibility and efficiency.  The cloud will facilitate the
trained FL model which can be used by developers for AR
applications.

FL model: The FL model should operate separately from the dataset
used for the training process. In this sense, the ML model
architecture and the dataset type may change without affecting the
overall FL training process. This interoperability is ensured as
we design the FL independently of the web protocol and thus, the
end device-edge communication is not affected by any changes in
the IoT architecture. Further, the datasets of each device
are stored locally and interact only with the local FL model
while the edge does not have any access to them. As a result
the functionality of the FL training is not affected by either
the dataset type or size, or by the FL model architecture.


+------------------------------------------------------------------+
|                                                                  |
| +------------------------+                                       |
| | End devices            |                                       |
| | * Data collection      |                                       |
| | * Reporting            |                                       |
| | * Local model training |                                       |
| | +---------------------+|                                       |
|       | FL training                                              |
|       |                                                          |
| +---------------------------------------------------------------+|
| | Edge gateway                                                  ||
| |                                                               ||
| | +------------------+  +----------------+  +-----------------+ ||
| | | Southbound       |  | Central        |  | Northbound      | ||
| | | interface        |  | controller     |  | interface       | ||
| | |                  |  |                |  |                 | ||
| | | * Resource       |  | * Device       |  | * Model sharing | ||
| | |   discovery      |  |   scheduling   |  | * Model publish | ||
| | | * Resource       |  | * Global model |  +-----------------+ ||
| | |   authentication |  |   aggregation  |                      ||
| | | * Device         |  +----------------+                      ||
| | |   configuration  |                                          ||
| | | * Global model   |                                          ||
| | |   dispatch       |                                          ||
| | +------------------+                                          ||
| |                                                               ||
| +---------------------------------------------------------------+|
|      |                                                           |
|      | Model to cloud                                            |
| +---------------+                                                |
| | Cloud server  |                                                |
| |               |                                                |
| | * Store model |                                                |
| +---------------+                                                |
|                                                                  |
+------------------------------------------------------------------+
Figure 1: Protocol architecture


4. Protocol stages

In this section we describe the stages which are used by the Edge
computing protocol to perform the FL process.

4.1. Initial configuration

Fig. 2 below depicts the initial configuration stage of the Edge
IoT protocol using the CoAP. The initial configuration stage provides
the necessary functionalities for establishing the IoT-edge gateway
communication link and for identifying the end devices that will
participate in the training process. Such functionalities are
considered as follows:

1.Resource discovery: The end devices are discovered by the edge
and employ the CoAP to inform the edge gateway about their
computational capabilities. More specific, the end devices send an
NON message to the edge containing the resource type of the
corresponding device, i.e. C0, C1, C2 or C3. The NON message type
is not confirmable and thus, the edge informs the devices with an
 RST message only in case of a transmission error. In the sequel
the edge decides which device types may participate in the training
process and send back a NON message containing the resource discovery
decision to the corresponding devices.

2.Resource authentication: The end devices are authenticated by the
edge as trusted parties and are allowed to participate in the training
process. On the contrary, any unauthenticated devices cannot participate
in the training. To this end, the previously discovered end devices
send a NON message to edge containing the ID information of the
transmitted device. The edge then informs each device if it failed to
receive the corresponding ID by dispatching an RST message. Once the
edge collects all the IDs of the devices it performs the device
authentication process which designates which end devices will
participate on the FL process. Finally each device is informed about the
edge decision by a NON message that contains the authentication outcome.
Only authenticated end devices are eligible in participating in
the FL training.

3.Device scheduling: The edge gateway selects the amount of the
authenticated end devices that will participate in the training
and dispatches the necessary messages to inform them about its decision.
Under this premise, it dispatches a NON message containing such
information to each of the authenticated devices. The devices send back
an RST response in case of transmission failure and thus, making the
edge to retransmit the message. In case of successful transmission
of the original NON message the eligible devices proceed to the
device configuration phase.

4.Device configuration: The edge gateway employs the CoAP to broadcast
the FL model hyperparameters to the end devices in order to properly
configure their local models. To this end, the end devices dispatch
a NON message informing the edge about their computational capabilities.
The edge sends back an RST response in case of transmission error,
or no message in case of successfully message delivery. In the sequel,
the edge processes the obtained information and designates the model
architecture and ML parameters that will be used for the FL process.
Then it broadcasts the related decisions back to the end devices through
a NON message and all the eligible devices enter the training phase.

After the initial configuration process completes, the Edge IoT protocol
continues to the FL training stage.

+------------------------------------------------------------------+
|  +-------------+                 +--------------+                |
|  | End devices |                 | Edge gateway |                |
|  +-------------+                 +--------------+                |
|         |   Non message {Resource type}  |                       |
|         |------------------------------->|                       |
|         |                                |                       |
|         |                      +------------------+              |
|         |                      |Resource discovery|              |
|         |                      +------------------+              |
|         |                                |                       |
|         |   Non message {discovery}      |                       |
|         |<-------------------------------|                       |
|         |   Non message {Device ID}      |                       |
|         |------------------------------->|                       |
|         |                                |                       |
|         |                    +-----------------------+           |
|         |                    |Resource Authentication|           |
|         |                    +-----------------------+           |
|         |                                |                       |
|         |   Non message {Authentication} |                       |
|         |<-------------------------------|                       |
|         |                       +-----------------+              |
|         |                       |Device scheduling|              |
|         |                       +-----------------+              |
|         |                                |                       |
|         |  Non message {Scheduling info.}|                       |
|         |<-------------------------------|                       |
|         |                                |                       |
|         |   Non message {Avl. Resources} |                       |
|         |------------------------------->|                       |
|         |                                |                       |
|         |                        +----------------+              |
|         |                        |FL configuration|              |
|         |                        +----------------+              |
|         |                                |                       |
|         |   Non message {Hyperparameters}|                       |
|         |<-------------------------------|                       |
|         |                                |                       |
+------------------------------------------------------------------+
Figure 2: Protocol initial configuration stage.

4.2. FL training

The FL training is stage in which the actual FL takes places. Fig. 3
depicts the functionalities we employ in order to support the FL
process. Such functionalities are considered as follows:

1.Local model training: In this scenario, the end devices that are
eligible to participate in the FL training send a NON message to
request the ML model from the edge. Then, the edge responds with an
RST message if necessary, to trigger the original NON message
retransmission. In the sequel the edge dispatches the global model
to the end devices using again the NON message format. The devices
respond with an RST message in case the transmission resulted in
errors and thus, the edge retransmits the NON message to the
corresponding device. Afterwards, each device proceeds to locally
train the model using its local data set.

2.Device reporting: Once a device completes the local model training,
it dispatches its model to the edge gateway through the device
reporting process. Due to the constrained nature of the participating
devices, the end device-edge communication is implemented by
using the NON message format. To this end, the devices dispatch their
ids and the locally trained models to the edge via NON messages
which are not followed by an ACK from the server side. As a result,
if the Edge fails to obtain the corresponding RST reply will notify
the end devices and will trigger a retransmission procedure of the
original NON message to the Edge. After the edge obtains every local
model, it conducts the global model aggregation process and produces
one global model which is broadcasted back to the devices. The FL
training process is repeated until the predefined amount of FL rounds
is reached.

After the FL training completes, the edge computing protocol enters
the cloud update stage.

+------------------------------------------------------------------+
|  +-------------+                 +--------------+                |
|  | End devices |                 | Edge gateway |                |
|  +-------------+                 +--------------+                |
|         |   Non message {Model request}  |                       |
|         |------------------------------->|                       |
|         |                                |                       |
|         |   Non message {Global model}   |                       |
|         |<-------------------------------|                       |
|   +------------+                         |                       |
|   | Local model|                         |                       |
|   |  training  |                         |                       |
|   +------------+                         |                       |
|         |  Non message {Local model}     |                       |
|         |------------------------------->|                       |
|         |                                |                       |
|         |                        +------------------------+      |
|         |                        |Global model aggregation|      |
          |                        +------------------------+      |
|         |   Non message {Model request}  |                       |
|         |------------------------------->|                       |
|         |                                |                       |
|         |   Non message {Global model}   |                       |
|         |<-------------------------------|                       |
|   +------------+                         |                       |
|   | Local model|                         |                       |
|   |  training  |                         |                       |
|   +------------+                         |                       |
|         |                                |                       |
|         |                                |                       |
|                                                                  |
+------------------------------------------------------------------+
Figure 3: Protocol training stage.

4.3. Cloud update

Fig. 4 below depicts the cloud update stage of the Edge computing
protocol which is invoked after the FL training completes.
Cloud update consists of the following functionalities:

1.Model sharing: The edge gateway informs the cloud for its
intentions to upload the trained FL model. In the sequel the cloud
authenticates the edge and decides whether it can be considered a
trusted party. When the model sharing process successfully completes,
the edge is authenticated and can proceed to the model publish
functionality. Due to the fact that no IoT devices participate in
such communication process, we use the more reliable CON message
format; instead of relying on NON messages. To this end, the edge
dispatches a CON message to cloud that contains its ID to inform
it that the FL process has been completed. The cloud in return
responds by an ACK or RST reply that indicates whether the
initial request was successfully delivered. In the sequel, the
cloud performs the edge authorization procedure according to the
received ID and sends a CON message to the edge that contains
the authorization result.

2.Model publish: In this scenario, the edge sends the trained
model and the model version through a CON message to the cloud.
Thus the edge waits for an ACK or RST reply depending on the
success of the transmission. If the model is transmitted
without errors the cloud responds with an ACK message. On the
contrary, transmission errors result in an RST reply from the
cloud which triggers a retransmission from the edge. When the
cloud successfully obtains the trained ML model it stores it
and makes it available to the users.

+------------------------------------------------------------------+
|  +-------------+                      +-----+                    |
|  |Edge gateway |                      |Cloud|                    |
|  +-------------+                      +-----+                    |
|         |     CON message {Edge ID}      |                       |
|         |------------------------------->|                       |
|         |                                |                       |
|         |         ACK/RST reply          |                       |
|         |<-------------------------------|                       |
|         |                         +--------------+               |
|         |                         |Authentication|               |
|         |                         +--------------+               |
|         |  CON message {authorization}   |                       |
|         |<-------------------------------|                       |
|         |                                |                       |
|         |         ACK/RST reply          |                       |
|         |------------------------------->|                       |
|         |                                |                       |
|         | CON message {Model, version}   |                       |
|         |------------------------------->|                       |
|         |                                |                       |
|         |         ACK/RST reply          |                       |
|         |<-------------------------------|                       |
|         |                           +-----------+                |
|         |                           |Model store|                |
|         |                           +-----------+                |
|         |                                |                       |
|         |                                |                       |
+------------------------------------------------------------------+
Figure 4: Protocol cloud update stage.

5. Security Considerations

The FL training process is considered a difficult task as the achievable
accuracy of the model is affected by the characteristics of the local
data sets. Local datasets are the data collected by the end devices
 which are stored locally on each device. In order to ensure data
privacy, we make sure that no data exchange takes place between
the end devices or between the end devices and the Edge gateway.
In this sense, the Edge gateway aggregates the local models without
utilizing any local data set information and the data privacy of
each end devices is ensured. Regarding data security, the end
device-Edge gateway communication can be encrypted using any existing
encryption technique such as AES. Such an encryption mechanism can
be applied either for data sharing between the end devices and
the Edge or for encrypting the messages exchanged between those
entities similarly to [6]. The encryption mechanism can be applied
directly to the transmitted CoAP messages provided that a decryption
process is deployed on the receiver side. Nonetheless, the
implementation and deployment of such a technique is outside the
scope of this work.

6. IANA Considerations

There are no IANA considerations related to this document.

7. Conclusions

In this draft we present an FL protocol suitable for distributed
ML in an IoT network. We provide a functional architecture that
consists of a number of end devices, of an edge gateway and of a
cloud server. In order to support the FL training process we
provide three distinct protocol stages that coordinate the
distributed learning process. To this end we consider the initial
configuration, the FL training and the cloud update stages each
of which provides the necessary functionalities to the FL
process. The FL training process is conducted by leveraging the
 CoAP communication protocol and takes place between the end
devices and the edge server. After the training finishes,
the trained FL model is stored to the cloud and is made
accessible to the users.

8. References

8.1. Normative References

[1] IoT Edge Computing Challenges and Functions, IETF draft.
https://tools.ietf.org/html/draft-hong-t2trg-iot-edge-computing-01,
Jul. 2020.
[2] F. Pisani, F. M. C. de Oliveira, E. S. Gama, R. Immich, L. F.
Bittencourt, E. Borin. "Fog Computing on Constrained Devices:
Paving the Way for the Future IoT", in arXiv:
https://arxiv.org/abs/2002.05300, Mar. 2019.
[3] Distributed fault management for IoT Networks, IETf draft.
https://tools.ietf.org/html/draft-hongcs-t2trg-dfm-00, Dec 2018.
[4] IoT Edge Computing: Initiatives, Projects and Products,
IETF draft. https://tools.ietf.org/html/draft-defoy-t2trg-iot
-edge-computing-background-00, May 2020.
[5] IETF iot-edge-computing draft, Weblink: https://www.potaroo.
net/ietf/idref/draft-hong-t2trg-iot-edge-computing/#ref-RFC6291
[6] M. A. Rahman, M. S. Hossain, M. S. Islam, N. A. Alrajeh
and G. Muhammad, "Secure and Provenance Enhanced Internet
of Health Things Framework: A Blockchain Managed Federated
Learning Approach," in IEEE Access, vol. 8, pp.
205071-205087, Nov. 2020.

8.1. Non-normative References
[RFC 7252] The Constrained Application Protocol (CoAP), Weblink:
https://tools.ietf.org/html/rfc7252 , Jun. 2014


9. Acknowledgments

<This work has been funded by the NGI TRUST 3rd Open
Call with reference number: 2019003.>



Copyright (c) 2021 IETF Trust and the persons identified
as authors of the code. All rights reserved.

    Redistribution and use in source and binary forms,
    with or without modification,are permitted provided
    that the following conditions are met: Redistributions
    of source code must retain the above copyright
    notice, this list of conditions and the following
    disclaimer.

    Redistributions in binary form must reproduce the above
    copyright notice, this list of conditions and the
    following disclaimer in the documentation and/or other
    materials provided with the distribution.

   Neither the name of Internet Society,
   IETF or IETF Trust, nor the names of specific contributors,
   may be used to endorse or promote products derived from this
  software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
OF THE POSSIBILITY OF SUCH DAMAGE.

Authors' Addresses

Fotis Foukalas
Cognitive Innovations
Kifisias 125-127, 11524, Athens, Greece
Email: fotis@cogninn.com

Athanasios Tziouvaras
Cognitive Innovations
Kifisias 125-127, 11524, Athens, Greece
Email: thanasis@cogninn.com