Skip to main content

Minutes IETF118: rift
minutes-118-rift-01

Meeting Minutes Routing In Fat Trees (rift) WG
Date and time 2023-11-07 14:30
Title Minutes IETF118: rift
State Active
Other versions plain text
Last updated 2023-12-01

minutes-118-rift-01
Thanks Yuehua Wei (ZTE) for taking the minutes.

Video recording: https://www.youtube.com/watch?v=dOclSQ9wJu0

------------------------------------------------------------

- Chairs remarks - Jeff & Jeffrey, 5 minutes

- Updates from Jordan Head, 20 minutes
  - Base spec: https://datatracker.ietf.org/doc/draft-ietf-rift-rift/
Jim[AD]:  hey Jordan, thanks for that and thanks  for taking care of all of
that for me  so um very much appreciated the the only  comment I had was the
the TTL thing you're going to put some text in the applicability document on
that right? is that something that you're doing? because I kind of want to try
and get that moved at the same time so if we can kill two birds with one stone
that would be great.

Jordan: yeah I've driving the discussion a bit, gotten some discussion but I
think we're at the point where I'll just write some text and propose it. you 
know ask for forgiveness rather than  permission so to speak just to make sure 
we're covered. and then there were a couple of other points in the
applicability side but nothing that relates to something that's  normative like
this.

Jim:so yeah so that one's pretty much done as well right?

Jordan: minor stuff , okay perfect, thanks

  - KV registry: https://datatracker.ietf.org/doc/draft-ietf-rift-kv-registry/
Jeffrey[Chair]: I have a question, so I had thought this document is about the
registry itself, it seems  that now we have added this mechanism for the key
targets and then it's handling all those things, so should the draft be renamed
and then title changed?

Jordan: that's probably not an awful idea , I think Tony's gonna mention but
we've had other mechanisms  described there as well besides key  targets like
the southbound tie breaking  and so forth  but go ahead...

Toni: well the only thing that's Changing the rift document is that: first we
thought that we basically have the key which was  always outside and then the
content was just a blob and we wanted to throw the target into this blob but we
decided that we basically split it into  the key and the Target and then the
blob which is the really the value,right? so that schema change, so we have to
register this code point for the content for the key value TIE, but in  terms
of what this thing does, the rift spec doesn't say anything. that's all.  Still
farmed out to the key value spec.

Jeffrey[Chair]: right, so the KV registry spec is really not only about
registry but also about the  behavior

Toni:right so it also specified the behavior of this  field so since we defined
the tiebreaking  of the key value store on the rift spec, You could argue that
we should  put the target text also into the Rifts spec.

Jeffrey[Chair]: no, that's not my point. this KV registry document defines also
the target behavior, so I think that the document  should be renamed.

Toni: fair enough

Jordan: yes I'll take that for next time, Jeffrey that's fine. since it's just
a title change it's  not a big deal.

Sandy Zhang:  in some  scenarios may I understand the key Target is the routing
Target in  MPBGP? if it can be used like this?

Toni:  yes no maybe. so people are  probably confused because I'm not sure 
everybody knows how a bloom filter works. so the idea is fairly simple, you
take something fairly big and then you  generate multiple hash functions of 
something and you get let's say three hash functions, three bits, and you flip
those three bits on, and that way you can put 100,000 targets into 64 bit okay?
of  course you will get false positives but you don't get false negatives. so
you may address more people than you intend, but you have a very small filter .

all right, whereas the route Target in BGP is like you have by policy you have
a perfect match, here you don't have a perfect match.   you have something with
statistically works actually incredibly well, all right, but the delivers false
positives and you  have to deal with that. it's a well-known thing, research
papers blah blah blah often used technique. but the equivalents of the route
Target breaks down here because it's not the perfect match, right?

- Update on the interop testing in Hackathon, Tony P., 10 minutes

-rift meets dragonfly 
https://datatracker.ietf.org/doc/draft-przygienda-rift-dragonfly/, Tony P., 25
minutes

Jeffrey[Chair]: Tony yeah do you want to take the  questions now or later

Rod Van Meter K University:um it's a clarification about this  diagram

Toni: sure, yeah I know it's tons of information  we get worse

[Rod Van Meter K University]: hi this is  Rod Van Meter K University so um 
this diagram you said eight links?

[Toni]: eight edges. there's no node in the middle, it's an octogon. so it's a
regular thing that the dragonfly original was just like full  mesh and these
little wings which are all full meshes. and if you have four of them and you
align it correctly,it looks like a dragonfly. think about this, those are the
whatever routers, those are the two  planes how they connected and yes I should
have probably made big blocks but I was too lazy, and there would be two more
of these on the top on the left or well think behind. it doesn't matter because
those are clos planes, so those are clos fabrics on the dragonfly+ as far as I
could figure out.  because there's really no clean paper  that will explain to
you in research terms what the hell it is. so that was as much as I could
reconstruct from all kind of ideas flying around.

Linda: I'm a little confused by this picture too, so you're saying the red nodes

Toni:  there are no red nodes it's just a red  plane.
Linda: red plane
Toni: yeah you can talk so. those would be the red  nodes,
Linda: but you have a box connecting  the green and red, does that means it's 
just one node?

Toni: be an important concept  right so you can see it in different  ways. you
can see there are two completely  disconnected planes, you can see there's half
a full mesh or you can see there's two planes that you can somehow connect
together, you keep those links. Linda: so I see the red plane is only one hop
away from each node. why do  you say there's two hops?

Toni: what do we have in the middle? this cross?  nothing nothing it's just
random if you start to draw things like that, the things intersect. so yeah my
bad. only those things are nodes it's an octagon

Linda: okay so the the middle one is not really the connection. there's nothing
there.

Toni: sorry, there is six red links in fact . there is no node in the middle
there's nothing there. so you've got shortest path and  non shortest path.
Linda:  so you have one hop and you have some two hop paths.

Toni: 2 hops and in one hop, yes. sorry that's implicit. okay cool.

Jeff[Chair]: so I'll bring a couple of  points and we can start discussion.
1) why is this important? people been trying  dragonfly like topologies in data
center to save on interfaces practically the complexity doesn't justify the 
deployment where it becomes really interesting is when you use inter links to
interconnect data center as Toni said, and there's very important point.  today
it's pretty much impossible to get data center over 50 mega there's just no
power to cool it and power it. so a lot  of people started building data
centers in Us in pairs of 50 60 meab data  centers. this naturally lays down to
this  kind of topology. so I've got twice 50 mega Data Center and then within
data center you run your whatever you like right most probably MPBGP. so that's
 number one. this is why this is so important 2) number two: this provides you
Loop free routing. it doesn't explain how  to get traffic on the links. but 
practically the cheapest way is to go on the shortest link. it also gives you
low latency. you want to be able to use longer links. but you need to
understand that, again looking at the target, this is really machine learning
cluster. in Collective operations you cannot afford to have part of collective
operations having different latency, because it's all about job completion
time. so you need to make sure that  whatever your GPU is running, it is
following same  path. how do you get traffic in case congestion on another
link? again another problem to solve, not here. but practically you need to
know when to switch from shortest path to non-shortest path and it's not in
routing  protocol at least as of now. Adaptive routing applicability here again
if you try to do more granular load balance and just per flow, you end up in a
case where some of your packets go on the shortest link some don't. 
performance goes to 3%. so it's really important to understand from
applicability perspective how to deploy it, how to signal potential congestion,
or  bandwidth available, or failure on the inter fabric links. and all of this
will need to be worked out at least some of it. so  this is where I think we 
should start discussion.

Toni: yeah but you know the nice thing if you start to look also this, you know
direct path and this one alternative hop. this is an incredibly resilient
structure. I mean  you have to kill tons of connectivity before this thing
literally starts to become unreachable. when I was looking at the stuff and you
know if you  build like three planes clos and then this thing in between I mean
you have to nuke it before the stuff  starts to actually not have any path
toget anywhere. because I kind of hated  dragonfly. I thought it was too dense
and  nobody could figure out the routing. now I start to like them. of course
because I think I figured  them out. yeah all right so I think that's it.

Jeff[chair]: one more comment. there's a draft that crossing routing working
group that focus actually on BGP in dragonfly plus. if you want to get kind of
terms you're more familiar with VRFs, BGP policies, it explains how this can be
done with BGP policies where rather understanding whether link comes from  the
fabric or from the interlink. you just use different VRFs and you can't really
IS-IS path to figure out where you are. so it will help you to better
understand applicability of regular routing protocol to this.

Dima: thanks Tony it's really impressive what you did with Rift. I just want to
comment about computation scalability problem because essentially if we are
trying to use silicon to the maximum, then number of groups probably will be
half the radixs of top of fabric switches  plus one. because we have half
interfaces going south half interfaces going north and they're going to other
groups and plus one is our local groups.  so it could be 3365 for current
generation of  silicon and something like that.  but I think there's no need to
do full computations for every group because the reason to do full computation
if you're going to go through intermediate group and going to reflect why are
leaves in that group.

Toni: no. the only reason to do the full computations I mentioned if you really
want the negative disaggregation positive  disaggregation tackle the cases 
where you have to start in the fabric on  the direct plane because you can only
 get on this plane to the other fabric to  the leaf. so it's in the cases 
where other fabric breaks and it forces  all the way to your fabric to 
disaggregate. you see my point. right  I mean those are like I don't even know
how many links I have to break and  how.  only reason to run the computations

Dima: yeah my point is that probably it's possible to do less computations than
like full commutation for every member fabric.

Toni: yeah that's what I wrote.  right I said like leave out the inter fabric
link when you do positive and  negative desegregation it's good enough,  most
likely.

Dima: yeah because negative desegregation is needed if you cannot go through
particular top of the fabric switch.

Toni: yeah totally right.

Dima:and I wanted to second what Jeff said that there is power scalability
problem for how much  you can get for one particular data  center but this
topology looks like  a good feed for data center campuses.  or any aggregation
of data centers which are not too far from each other,  and you want to have
more or less uniform connectivity and a lot of bandwidth  between them because
it scales better than trying to add yet another level to the clos.

Toni:right, so if anybody feels like a little bit mental exercises, especially
the profs here, now imagine run this thing on a optical ring, counter rotating,
 what will happen if the ring get cut in  one place. What will this  topology
look like and what will happen.  because this next layer of problem in 
network, right? because you run this whole thing on lambdas over a ring.

Dima: yeah so that's it for  me. thanks.

Jeff[chair]: next  question

Jingyou: I'm from Fiberhome. I'm not have  any comments just a minor
suggestions.  because those figures seems handsome but is little difficult for
me to understand  so I suggest maybe we can add some formulas to give some
examples or use cases.

Toni: I used to be in Academia. I don't do formulas  anymore I could write it
beautifully in three formulas you know and I could talk to you about Banyan
trees and Banyan tree formulas. nobody would grog anything anything whatsoever
you know. I reserved it for  the journal paper

Linda: from futurewei. so I'm just curious like you have multiple planes and
you have each plane has their own topology. can you use IS-IS different areas
to solve the problem? the plane can be the area two and......

Toni: look, you could run in  the core IS-IS. right? I mean we wouldn't have
extended Rift, you could only shortest path one hop. so you don't get
bisectional bandwidth with IS-IS unless you hack IS-IS  to the point there's
not IS-IS anymore. so...

Linda: but you can use some kind of policy on the side so that you can...

Toni: IS-IS doesn't have policies
Jeff[chair]: that's why we use  BGP

Linda: how about use BGP, there is a draft ......we have a draft on that like 
basically have some kind of metrics  to influence the path selection so 
instead of choosing the shortest path, we add some other weight so that with 
that other weight added in maybe the longer path will be chosen.

Toni: so my comment will be you know once your policy grow complicated long
enough you  may start to carry packets by hand that may be more efficient.

Linda: yeah of course. but here we're  talking about um multiple path and 
shortest paths may not be the best path  and how do we balance

Toni: so dimma has a draft where he has shown basically with a lot of VPN how
you can solve that stuff because you know that the horizon  idea is actually
dima's idea not mine. because I was standing in front of like  sucking my
teeth, right? how the hell do you know shortest path properly here? it  was
dima's idea that we can actually  build a horizon. because he built the 
horizon using VPN in BGP because that's how you use them.  they basically
reflect the horizon. that's the BGP  mechanism

Linda: okay so if we have some ideas  on how do we do this?

Jeff[chair]: oh we know exactly  how to do it with BGP. that was presented in
last working group Rift.

Toni: yeah we talked about the BGP stuff. modulo a little details like where
are the couple of hundred lines of BGP policy and how you stitch that stuff
properly. so it doesn't break. plus of course the BGP  will stitch with the
VPNs and you have to start to think okay where are your tunnels? what will
happen if this thing where because no the tunnels start to develop their own
logic, right? how to go from one place to another and you have to control them 
that you go the path that you want. yeah  but it's all doable. like I say. 
ultimately you can get you know enough people to carry packets by hand and you
know beating them enough you will get what you want. Jeff[chair]: there's
another level of  complications when you start doing overlay which is mandatory
if you do  multitenancy, right? so if you do it on the switch, think about
VxLAN and VPN which is common way to do today, you're going to build structure
that is underlay VPN, another VPN that is tenant alright? it becomes really
complex from management. Linda: it may not be VPN per se but anyway I'll  just
throw some ideas here.

Toni: yeah yeah yeah no  it's  solvable I mean this is trying  to solve it in a
very ztp way with a very cheap forwarding plane,  that was always Rift, right?

Sandy Zhang: zte. I'd like to make sure if I understand right. how do the TOF
nodes know if the flow is intra fabric or inter fabric?

Toni:  that's a very justified question that's where Rift solves the problem
and that's where you BGP will have a hard time, right ? we know the direction
of the fabric. so we know who is south and who is north. and now  we can
differentiate whether it's inter  Fabric or whether it's a horizontal link. so
the inter link the adjacency will  clearly tell you which Horizon is on.

Sandy Zhang: yes  I think the FIB the forwarding table in  the TOF will show
that if the  route is  from inter Fabric or intra fabric. so when the TOF
receive the flow they will know how to forward this.

Toni:  correctly. which FIB to  throw to. precisely. and and thanks to ZTE 
because we spend a lot of time in hackathon. to start to ask questions because
I  only draw a very simple like you know a  three thingy and a four thingy and
they go like yeah five and I wasn't sure. so I had to draw the figure actually
you know to figure out this presentation. because I oversimplified with three,
everything works.  it's kind of trial  but this is exactly how it worked. the
incoming interface will tell you which FIB to go to. and I was  slightly
skeptical with demand and I looked and yes all even the cheapest  silicon can
do these days. that because it's actually very common problem if you run any
kind of VRF. you have to  know this is a VRF link so it's a completely
different FIB. otherwise it  won't work.

Sandy Zhang: so I think maybe some flag  may be added in the forwarding table
for  distinguish it maybe?

Toni:  how you solve it over specification.  this thing tells you . look this
is the computation that you used to build this FIB...

Jeff[chair]:  so in  BGP is configurational logic. you have  two different
virtual routers to treat  Fabric and Inter fabric routes. here based on the
fabric ID you see it's you or it's not you. so it's built into protocol you
don't need additional  management task to identify particular  interface.

Toni:  okay so you know please look over the  stuff. maybe you find the  whole
thing is just made up. I don't know.   I'm pretty confident you  know. that
this stuff holds up. but who knows. it's never seen that done before. I never
saw any kind of dynamic routing for dragonflies actually. where anybody
explained how the hell it's supposed to work. all this fancy stuff like 
dragonfly, hyper cubes, or thoroidal meshes we used in supercomputers where
links never fail.  so it's like simple. Dynamic routing is overvalued and  this
is you know first time I see something cooked up except Dima's stuff which is
basically you know stitching BGP magic. so it's not really  routing it's like
arm handling packets  the right way by a lot of policy magic. which is fine.
lot of  people seems to consider a pretty good job security these days.

Jeff[chair]:  okay thanks Toni, great presentation and again we've been trying
to solve non-shortest path routing Academia for probably since  routing exists
that's very good example  how it can be done with right protocol  simply and 
elegant.  so and uh we are exactly on time so just 30 minutes from now we are
going to have  a AIDC side meeting which is going to talk about in more details
workloads this kind of topologies dedicated. it's a  really machine learning 
application.   we can figure out how to  record it hopefully. somewhere in  the
cloud. yes thanks everyone and we'll see you in Australia.