Skip to main content

Minutes IETF110: irtfopen
minutes-110-irtfopen-00

Meeting Minutes IRTF Open Meeting (irtfopen) RAG
Date and time 2021-03-08 12:00
Title Minutes IETF110: irtfopen
State Active
Other versions plain text
Last updated 2021-03-09

minutes-110-irtfopen-00
IRTF Open Meeting
=================

Monday, 8 March 2021, at 12:00-14:00 UTC
Room: Room 2

Chair: Colin Perkins
Minutes: Mat Ford

## 12:00 Introduction and Status Update
      Colin Perkins, IRTF Chair

Slides:
https://datatracker.ietf.org/meeting/110/materials/slides-110-irtfopen-agenda-00

## 12:15 Learning in situ: a randomized experiment in video streaming
      Francis Y. Yan, 2021 ANRP Recipient

  Slides:
  https://datatracker.ietf.org/meeting/110/materials/slides-110-irtfopen-learning-in-situ-a-randomized-experiment-in-video-streaming-00

  Discussion
  ----------

Ali Begen: I am entirely sold on the idea of having all the sessions
incorporated into a single algorithm's confidence intervals (it is obvious that
people in different places/networks, watching different videos do get different
performance from the same player/algorithm).

mglt: I am wondering if there are any assumption that the traffic remain the
same over - has the same characteristics during the two years or if that is not
a necessary assumptions.

Francis Yan (FY): We have not found any significant difference over time that
requires retraining, but we retrain the model every day and monitor its
performance in case there is the need to adapt to time-varying characteristics.

Spencer Dawkins: I'm curious what this will look like over QUIC

FY: That'd be fun to try! We only tried BBR and Cubic, and saw BBR outperform
Cubic reliably (with a large amount of data collected from randomized
experiment), but we only used BBR in the paper.

Audrey Randall: Did the "two years" number mean the years of video streamed or
the time the experiment was conducted over?

FY: It was referring to experiment time; not the wall clock time.

David Oran: Did you investigate the possibility that having a single server
(complex) in a single topological location on the Internet would not bias the
results in any way?

FY: That's a good question that we have not investigated. We only have one
server at Stanford unfortunately, and admittedly adding CDN or multiple servers
may influence the findings.

Jonathan Hoyland: Did you see users refreshing the page more often or having
shorter sessions on the worse performing algorithms

FY: We had a figure that plots the distribution of watch times; users chose to
watch Fugu longer but other than the observation, we didn't fully understand
what factors caused the difference exactly.

Mirja Kühlewind: there are actually more optimized congestion controller for
video, see the rmcat working group (e.g. scream or nada)

FY: Yes, we were aware of SCReAM and other congestion-control that are proposed
to video streaming specifically, but ABR doesn't actually require as low
latency as videoconferencing. Reliable transmission like BBR works just fine in
our experience, but you're right that it's a potential direction for next work
to continue the co-design of CC and ABR.

Mirja: Yes that's what I was thinking of; more co-design the two

Spencer Dawkins: I heard you say that your users were in the US. Have you
thought about people further away?

FY: Unfortunately we are only approved by Stanford lawyers to stream to U.S.
users currently to avoid potential legal issues

Spencer:  I thought it might be something like that. Perhaps we should talk!

David Oran: One further observation/question: can Fugu predict whether having a
lower base quality in the manifest of encoded qualities would eliminate more
stalls? I ask because there's anecdotal evidence that stalls occur under
network conditions where no data would get though before the playout point
expires. If no, having lower qualities would likely produce better QoE than
allowing a stall

FY: We already have a more fine-grained ?? compared with most industrial
players. We have 10 versions for each video chunk including 4 resolutions with
different CRF encoding parameters and we spread out the bitrates. We have a
dashboard to monitor whether the 10 bitrates are evenly spread out in terms of
their sizes and their S/N values. The lowest base quality is already low enough
for users.

Zaid AlBanna: On slide 29, when calculating the median versus the mean show
similar spreads in quality parameters?

FY: We didn't calculate the median values but I would assume they showed
similar results.

Mirja Kühlewind: Are the ABR algorithms you studied used by major VoD platforms
in practice?

FY: All other four ABR algorithms are research algorithms. In terms of industry
adoption, probably a variant of BBA is used by NetFlix. It was proposed by a
colleague of mine at NetFlix. It's a heuristic based algorithm based on only
the buffer size. MPC and robust MPC were proposed in Sigcomm 2015. And Pensieve
was proposed in Sigcommm 2017 - they are both research algorithms but I don't
know if they're being used by industry.

MK: I understood that the algorithms used change very frequently so more
insight here could be helpful, although probably hard to get.

FY: Not as much of a black box as we'd originally thought. For instance BBA is
pretty simple - below a threshold playback buffer size (maybe 3 seconds) we ask
video server to send lowest quality, above 12 seconds we send highest quality,
between 3 seconds and 12 seconds use linear interpolation between different
bitrates so that's pretty interpretable.

Ali Begen: To my knowledge, Netflix never confirmed or denied using BBA or any
other algorithm.

FY: BBA was proposed by TY Juang when she studied at Stanford advised by Nick
McKeown.

Ali: That was a decade ago, so probably revised since then.

FY: Yes, when we presented Puffer at NetFlix they did not mention what they are
using.

Simon Leinen: Has there been similar work on videoconferencing systems (Zoom,
Jitsi, Meetecho :-)?

FY: Yes, ABRs corresponding work in video conferencing is bandwidth estimation
or congestion control for realtime video - also adapts to varying network
conditions by changing sending bitrate of video encoder.

Microsoft is organising a grand-challenge on bandwidth estimation for real-time
communications at MMSys this year (And here is the link:
https://2021.acmmmsys.org/rtc_challenge.php). Ali Begen is collaborating with
me on this.

Jonathan Hoyland: Did you have a chance to look at the P99 or the P99.9
performance of Fugu?

FY: The paper didn't include tail performance, but we did look at them. In the
figures we have included confidence intervals. We are confident that that
interval containes the mean value. P99 or P95 - I believe we checked this. I
agree that tail performance is critical.

CP: Can you say something about the challenges of running a research experiment
as a grad student at this scale?

FY: In my experience, having real users is like having real impact. So this is
exciting. But availability is the biggest challenge. When we write some code it
can work 99% of the time, but real users mean system should always be
available. As soon as Puffer stopped working I would receive user complaints by
email - that's why we built a monitoring system. I received alerts every time
there's a service interruption, so I'm on call 24/7. I'm almost the only
engineer working on developing this so it's really hard to maintain. We're
testing research schemes and research code tends to be messy. Our code is
relatively high quality, but running other research code on the production
system causes outages.

CP: To what extent do the types of issues you are running into apply to other
types of network measurement research. Would you find similar issues with
confidence intervals if you repeated other kinds of measurement experiments.

FY: Yes, I would expect our findings to generalise to other network measurement
experiments. We did see heavy-tailed user behaviour. Part of the reason why we
observed the noisy part of the Internet is because of the inherent issues and
heavy-tailed nature of the network. Past research has observed congestion
control where we observed results very different from those reported in the
research. When you measure congestion control on larger real world test beds
over a long time we tend to see different results and noisy results.

Jonathan Hoyland: Is this a question of code optimisation?

FY: Running research code on a production system is a hard problem. People tend
to fix potential bugs in research code, but we wanted to faithfully evaluate
against other algorithms.

JH: Did you compare server compute time for the algorithms?

FY: Yes we took into account compute time and resources required. Fortunately
none of the algorithms we looked at takes too many compute resources, typically
takes a few milliseconds to compute ABR decisions online, so not a bottleneck.
We did look at this but didn't report it in the paper because it was not a
bottleneck.

Renan Krishna: What information was included in the state updates to the MPC
controller?

FY: Past 8 chunks, their transmission times, sizes, size of the chunk to send
next, plus low level TCP statistics. Those are the inputs to TTP. For the
model-based controller the input is the current playback buffer level and all
the necessary chunk sizes. It needs to run dynamic programming as value
iteration algorithm online.

Roland Bless: Does CC kick in at all? Does it leave slow start often?

FY: The Pensieve paper in Sigcomm 2017 sets TCP Time Out such that it hardly
leaves congestion avoidance phase. Because when you send a video chunk every 2
to 4 seconds we don't want congestion control to ramp up everytime starting
from the slowstart phase. In our case I believe we also disabled slow start.

Colin Perkins: Thanks Franics, really nice talk.

## 13:00 Trufflehunter: Cache Snooping Rare Domains at Large Public DNS
Resolvers
      Audrey Randall, 2021 ANRP Recipient

Slides:
https://datatracker.ietf.org/meeting/110/materials/slides-110-irtfopen-trufflehunter-cache-snooping-rare-domains-at-large-public-dns-resolvers-00

Discussion
----------

Jonathan Hoyland: If two people are doing cache snooping at the same time then
do they both get borked results? (Assuming multiple DNS servers and queries
interleaved in some way)

Audrey Randall (AR): Not if they are using Recursion Desired = False and the
resolver respects the flag. I can elaborate further on audio if you like.

Andrew Campling: Is de-anonymizing data from ISP resolvers really an issue, at
least for those with millions of users?

AR: I would argue not. The real issue is open resolvers that were never
designed to be resolvers, because they're really just tiny home routers. You
could run our technique on large ISPs as well.

Jim Reid: How do you deal with anycasting => you don't always know which
resolver is answering

AR: We use CHAOS TXT location queries that are provided as debugging tools by
the public

Vladimír Čunát: +nsid is a better tool, I believe, but it could have wider
support.

AR:  thanks for the recommendation, I hadn't heard of that.

Jonathan Hoyland: Do you ever see two hits on a single ghost cache?

AR: Yes, quite often

Suzanne Woolf: NSID is discussed in RFC 5001. But it turns out that trying to
replace CHAOS TXT is a lot like trying to replace WHOIS.

Jim Reid: could google be doing predictive pre-fetching?ie a lookup of A is
almost always followed by a lookup of B

AR: Do you mean as an explanation for the ghost caches? We considered that but
didn't think it was probable, since the frontend caches storing the maximum
TTLs seemed to always be the case.

We didn't study whether Google prefetches the sorts of domains we study, but we
considered it unlikely for these particular domains since they're so rare.

J Ignacio Alvarez-Hamelin: How do you detect stalkerware/spyware using DNS
queries?

AR: We downloaded and profiled the stalkerware apps to see which DNS queries
the apps and their dashboards make, and then used our cache snooping technique
to measure how often those domains were in cache.

Jim Reid: where to you get the domain names for contract cheating services?

AR: They're easy to find if you google them - we're measuring hits on the main
landing page of these services.

Andrew Campling: Apologies if you've mentioned this already. Some (all?) of the
cloud-based resolvers offer some form of malicious content filtering, some on
their lead resolver, others on a less high profile version. Which were you
testing?

AR: Stalkerware not usually filtered as not considered a threat until quite
recently. Dual-use issue means you could end up blocking a legitimate
application used for a legal purpose. Secondly, if you block you risk the
stalker thinking that the person being stalked is trying to take action against
the stalker and research suggests that is the point at which surveillance can
turn violent. So it's not necessarily a good thing to block this stuff so most
threat feeds don't.

Jim Reid: Where did you get domain names for contract cheating services?

AR: Easy to find if you just Google them. We're measuring hits on the main
landing page for each of these services.

Jim Reid: Where did the list of typosquatted names come from?

AR: A couple of old research papers which is why we weren't expecting to see
any of them and it was interesting that a couple were getting up to 100
hits/day.

Jonathan Hoyland: Number of dual-use apps seems very low compared to the amount
of stalkerware. Is that what you actually measured?

AR: The particular dual-use apps that we measured seemed to be less common than
the most common types of overt stalkerware. Didn't measure as much the dual-use
apps - we were focussed on the overt stalkerware. Based on previous literature
in this space, most surveillance not done by overt apps, but through
misconfigured sharing settings and so on.

Jim Reid: Has RD-bit setting stuff had any influence on your work or the
behaviour of the resolver caches?

AR: The kind of cache snooping that we're doing wouldn't be possible without
poisoning the caches ourselves without using the RD-bit. If the RD-bit is off
and you make the query and it isn't in cache then it won't get put into cache.
Google was the exception - we found that if we hit a backend resolver that did
have query cached and then we hit a frontend cache that did not have the query
cached, then the record would be copied from the backend to the frontend, so
our own probes would fill the caches, so it was a challenge to remove the
poisoning that we had done.

Unbound will return a REFUSED response to a query with the RD-bit set. This is
why we could only measure half of the caches at Quad-9 because if they had a
resolver with Unbound software that received the original query then it became
invisible to Trufflehunter.

AR: Enable RD if you want to enable these kind of measurement studies on
resolvers where there are few to no de-anonymisation risks for users, e.g.
large public DNS services. If you are running DNS resolving software that will
get put on small home routers that could be misconfigured as open resolvers
then I would say no, you want to have defences against cache snooping which
could include not allowing queries with RD-bit unset. So the question is how
large the resolver is.

Colin Perkins: May be chatting with some of the folks in dnsop working group.

AR: Yes, happy to do that.

Jonathan Hoyland: Would be great to see results from around the world as abuse
techniques can vary quite significantly by culture / geography.

AR: Measurement platform that we're using which is how we get results from
around the US doesn't have quite as many nodes around the world but yes it
would be great to expand to that. It would be reasonably straight-forward to do
that - we stayed focussed on the US initially so as not to put too much load on
the platform and then it just ended up that was the data we had when we decided
to publish. We are at the moment working on expanding a similar tool to
Trufflehunter but instead of measuring domain usage we're trying to measure DNS
hijacking and we do want to expand that to worldwide using similar techniques.

Vladimír Čunát: Default settings aren't targeted for huge instances like those
studied.

Colin Perkins: We're seeing increasing use of different types of DNS transport
over TLS or HTTPS and new techniques like Oblivious DNS. Do any of these make a
difference to the type of work you're doing?

AR: DNSSEC and DNS over TLS or DNS over HTTPS don't impact our technique. The
queries still get cached the same way. I'm not familiar with Oblivious DNS.

Colin Perkins: This is a new development that I'm not super familiar with -
using encryption and proxy resolvers to anonymise queries.

AR: I suspect that as long as queries are still arriving at public resolvers
then it won't interfere with our technique. If it provides an extra layer of
anonymisation for people making queries, then awesome.

Colin Perkins: Presumably DNS over HTTPS may allow some of the stalkerware apps
to make DNS queries in a more controlled way so avoid public resolvers.

AR: It might. Stalkerware apps tend to be incredibly unsophisticated - they
don't try to obfuscate code, lots of them crash on installation, seem to have
lots of bugs and problems in general. Would be surprised to see them adopting
more sophisticated techniques any time soon, there are some that are ahead of
the game like Flexaspy that would be more of a problem.

Jonathan Hoyland: This might mess with your geolocation

AR: Yes, but we realise anycast is not a great way to geolocation anyway, so
our geolocation is already suspect.

Colin Perkins: We will try to connect you with operators of public resolver
services and developers of Oblivious DNS.

AR: I'd love that.

## 14:00 Close

Colin Perkins: Thanks again to Audrey and Francis for joining at an early hour
for both - two great talks and some really good discussion.

Applied Networking Research Workshop call for papers will be announced soon.

Recordings of the talks, and links to the papers, will be made available
from https://irtf.org/anrp/