Minutes IETF110: irtfopen
minutes-110-irtfopen-00
Meeting Minutes | IRTF Open Meeting (irtfopen) RAG | |
---|---|---|
Date and time | 2021-03-08 12:00 | |
Title | Minutes IETF110: irtfopen | |
State | Active | |
Other versions | plain text | |
Last updated | 2021-03-09 |
minutes-110-irtfopen-00
IRTF Open Meeting ================= Monday, 8 March 2021, at 12:00-14:00 UTC Room: Room 2 Chair: Colin Perkins Minutes: Mat Ford ## 12:00 Introduction and Status Update Colin Perkins, IRTF Chair Slides: https://datatracker.ietf.org/meeting/110/materials/slides-110-irtfopen-agenda-00 ## 12:15 Learning in situ: a randomized experiment in video streaming Francis Y. Yan, 2021 ANRP Recipient Slides: https://datatracker.ietf.org/meeting/110/materials/slides-110-irtfopen-learning-in-situ-a-randomized-experiment-in-video-streaming-00 Discussion ---------- Ali Begen: I am entirely sold on the idea of having all the sessions incorporated into a single algorithm's confidence intervals (it is obvious that people in different places/networks, watching different videos do get different performance from the same player/algorithm). mglt: I am wondering if there are any assumption that the traffic remain the same over - has the same characteristics during the two years or if that is not a necessary assumptions. Francis Yan (FY): We have not found any significant difference over time that requires retraining, but we retrain the model every day and monitor its performance in case there is the need to adapt to time-varying characteristics. Spencer Dawkins: I'm curious what this will look like over QUIC FY: That'd be fun to try! We only tried BBR and Cubic, and saw BBR outperform Cubic reliably (with a large amount of data collected from randomized experiment), but we only used BBR in the paper. Audrey Randall: Did the "two years" number mean the years of video streamed or the time the experiment was conducted over? FY: It was referring to experiment time; not the wall clock time. David Oran: Did you investigate the possibility that having a single server (complex) in a single topological location on the Internet would not bias the results in any way? FY: That's a good question that we have not investigated. We only have one server at Stanford unfortunately, and admittedly adding CDN or multiple servers may influence the findings. Jonathan Hoyland: Did you see users refreshing the page more often or having shorter sessions on the worse performing algorithms FY: We had a figure that plots the distribution of watch times; users chose to watch Fugu longer but other than the observation, we didn't fully understand what factors caused the difference exactly. Mirja Kühlewind: there are actually more optimized congestion controller for video, see the rmcat working group (e.g. scream or nada) FY: Yes, we were aware of SCReAM and other congestion-control that are proposed to video streaming specifically, but ABR doesn't actually require as low latency as videoconferencing. Reliable transmission like BBR works just fine in our experience, but you're right that it's a potential direction for next work to continue the co-design of CC and ABR. Mirja: Yes that's what I was thinking of; more co-design the two Spencer Dawkins: I heard you say that your users were in the US. Have you thought about people further away? FY: Unfortunately we are only approved by Stanford lawyers to stream to U.S. users currently to avoid potential legal issues Spencer: I thought it might be something like that. Perhaps we should talk! David Oran: One further observation/question: can Fugu predict whether having a lower base quality in the manifest of encoded qualities would eliminate more stalls? I ask because there's anecdotal evidence that stalls occur under network conditions where no data would get though before the playout point expires. If no, having lower qualities would likely produce better QoE than allowing a stall FY: We already have a more fine-grained ?? compared with most industrial players. We have 10 versions for each video chunk including 4 resolutions with different CRF encoding parameters and we spread out the bitrates. We have a dashboard to monitor whether the 10 bitrates are evenly spread out in terms of their sizes and their S/N values. The lowest base quality is already low enough for users. Zaid AlBanna: On slide 29, when calculating the median versus the mean show similar spreads in quality parameters? FY: We didn't calculate the median values but I would assume they showed similar results. Mirja Kühlewind: Are the ABR algorithms you studied used by major VoD platforms in practice? FY: All other four ABR algorithms are research algorithms. In terms of industry adoption, probably a variant of BBA is used by NetFlix. It was proposed by a colleague of mine at NetFlix. It's a heuristic based algorithm based on only the buffer size. MPC and robust MPC were proposed in Sigcomm 2015. And Pensieve was proposed in Sigcommm 2017 - they are both research algorithms but I don't know if they're being used by industry. MK: I understood that the algorithms used change very frequently so more insight here could be helpful, although probably hard to get. FY: Not as much of a black box as we'd originally thought. For instance BBA is pretty simple - below a threshold playback buffer size (maybe 3 seconds) we ask video server to send lowest quality, above 12 seconds we send highest quality, between 3 seconds and 12 seconds use linear interpolation between different bitrates so that's pretty interpretable. Ali Begen: To my knowledge, Netflix never confirmed or denied using BBA or any other algorithm. FY: BBA was proposed by TY Juang when she studied at Stanford advised by Nick McKeown. Ali: That was a decade ago, so probably revised since then. FY: Yes, when we presented Puffer at NetFlix they did not mention what they are using. Simon Leinen: Has there been similar work on videoconferencing systems (Zoom, Jitsi, Meetecho :-)? FY: Yes, ABRs corresponding work in video conferencing is bandwidth estimation or congestion control for realtime video - also adapts to varying network conditions by changing sending bitrate of video encoder. Microsoft is organising a grand-challenge on bandwidth estimation for real-time communications at MMSys this year (And here is the link: https://2021.acmmmsys.org/rtc_challenge.php). Ali Begen is collaborating with me on this. Jonathan Hoyland: Did you have a chance to look at the P99 or the P99.9 performance of Fugu? FY: The paper didn't include tail performance, but we did look at them. In the figures we have included confidence intervals. We are confident that that interval containes the mean value. P99 or P95 - I believe we checked this. I agree that tail performance is critical. CP: Can you say something about the challenges of running a research experiment as a grad student at this scale? FY: In my experience, having real users is like having real impact. So this is exciting. But availability is the biggest challenge. When we write some code it can work 99% of the time, but real users mean system should always be available. As soon as Puffer stopped working I would receive user complaints by email - that's why we built a monitoring system. I received alerts every time there's a service interruption, so I'm on call 24/7. I'm almost the only engineer working on developing this so it's really hard to maintain. We're testing research schemes and research code tends to be messy. Our code is relatively high quality, but running other research code on the production system causes outages. CP: To what extent do the types of issues you are running into apply to other types of network measurement research. Would you find similar issues with confidence intervals if you repeated other kinds of measurement experiments. FY: Yes, I would expect our findings to generalise to other network measurement experiments. We did see heavy-tailed user behaviour. Part of the reason why we observed the noisy part of the Internet is because of the inherent issues and heavy-tailed nature of the network. Past research has observed congestion control where we observed results very different from those reported in the research. When you measure congestion control on larger real world test beds over a long time we tend to see different results and noisy results. Jonathan Hoyland: Is this a question of code optimisation? FY: Running research code on a production system is a hard problem. People tend to fix potential bugs in research code, but we wanted to faithfully evaluate against other algorithms. JH: Did you compare server compute time for the algorithms? FY: Yes we took into account compute time and resources required. Fortunately none of the algorithms we looked at takes too many compute resources, typically takes a few milliseconds to compute ABR decisions online, so not a bottleneck. We did look at this but didn't report it in the paper because it was not a bottleneck. Renan Krishna: What information was included in the state updates to the MPC controller? FY: Past 8 chunks, their transmission times, sizes, size of the chunk to send next, plus low level TCP statistics. Those are the inputs to TTP. For the model-based controller the input is the current playback buffer level and all the necessary chunk sizes. It needs to run dynamic programming as value iteration algorithm online. Roland Bless: Does CC kick in at all? Does it leave slow start often? FY: The Pensieve paper in Sigcomm 2017 sets TCP Time Out such that it hardly leaves congestion avoidance phase. Because when you send a video chunk every 2 to 4 seconds we don't want congestion control to ramp up everytime starting from the slowstart phase. In our case I believe we also disabled slow start. Colin Perkins: Thanks Franics, really nice talk. ## 13:00 Trufflehunter: Cache Snooping Rare Domains at Large Public DNS Resolvers Audrey Randall, 2021 ANRP Recipient Slides: https://datatracker.ietf.org/meeting/110/materials/slides-110-irtfopen-trufflehunter-cache-snooping-rare-domains-at-large-public-dns-resolvers-00 Discussion ---------- Jonathan Hoyland: If two people are doing cache snooping at the same time then do they both get borked results? (Assuming multiple DNS servers and queries interleaved in some way) Audrey Randall (AR): Not if they are using Recursion Desired = False and the resolver respects the flag. I can elaborate further on audio if you like. Andrew Campling: Is de-anonymizing data from ISP resolvers really an issue, at least for those with millions of users? AR: I would argue not. The real issue is open resolvers that were never designed to be resolvers, because they're really just tiny home routers. You could run our technique on large ISPs as well. Jim Reid: How do you deal with anycasting => you don't always know which resolver is answering AR: We use CHAOS TXT location queries that are provided as debugging tools by the public Vladimír Čunát: +nsid is a better tool, I believe, but it could have wider support. AR: thanks for the recommendation, I hadn't heard of that. Jonathan Hoyland: Do you ever see two hits on a single ghost cache? AR: Yes, quite often Suzanne Woolf: NSID is discussed in RFC 5001. But it turns out that trying to replace CHAOS TXT is a lot like trying to replace WHOIS. Jim Reid: could google be doing predictive pre-fetching?ie a lookup of A is almost always followed by a lookup of B AR: Do you mean as an explanation for the ghost caches? We considered that but didn't think it was probable, since the frontend caches storing the maximum TTLs seemed to always be the case. We didn't study whether Google prefetches the sorts of domains we study, but we considered it unlikely for these particular domains since they're so rare. J Ignacio Alvarez-Hamelin: How do you detect stalkerware/spyware using DNS queries? AR: We downloaded and profiled the stalkerware apps to see which DNS queries the apps and their dashboards make, and then used our cache snooping technique to measure how often those domains were in cache. Jim Reid: where to you get the domain names for contract cheating services? AR: They're easy to find if you google them - we're measuring hits on the main landing page of these services. Andrew Campling: Apologies if you've mentioned this already. Some (all?) of the cloud-based resolvers offer some form of malicious content filtering, some on their lead resolver, others on a less high profile version. Which were you testing? AR: Stalkerware not usually filtered as not considered a threat until quite recently. Dual-use issue means you could end up blocking a legitimate application used for a legal purpose. Secondly, if you block you risk the stalker thinking that the person being stalked is trying to take action against the stalker and research suggests that is the point at which surveillance can turn violent. So it's not necessarily a good thing to block this stuff so most threat feeds don't. Jim Reid: Where did you get domain names for contract cheating services? AR: Easy to find if you just Google them. We're measuring hits on the main landing page for each of these services. Jim Reid: Where did the list of typosquatted names come from? AR: A couple of old research papers which is why we weren't expecting to see any of them and it was interesting that a couple were getting up to 100 hits/day. Jonathan Hoyland: Number of dual-use apps seems very low compared to the amount of stalkerware. Is that what you actually measured? AR: The particular dual-use apps that we measured seemed to be less common than the most common types of overt stalkerware. Didn't measure as much the dual-use apps - we were focussed on the overt stalkerware. Based on previous literature in this space, most surveillance not done by overt apps, but through misconfigured sharing settings and so on. Jim Reid: Has RD-bit setting stuff had any influence on your work or the behaviour of the resolver caches? AR: The kind of cache snooping that we're doing wouldn't be possible without poisoning the caches ourselves without using the RD-bit. If the RD-bit is off and you make the query and it isn't in cache then it won't get put into cache. Google was the exception - we found that if we hit a backend resolver that did have query cached and then we hit a frontend cache that did not have the query cached, then the record would be copied from the backend to the frontend, so our own probes would fill the caches, so it was a challenge to remove the poisoning that we had done. Unbound will return a REFUSED response to a query with the RD-bit set. This is why we could only measure half of the caches at Quad-9 because if they had a resolver with Unbound software that received the original query then it became invisible to Trufflehunter. AR: Enable RD if you want to enable these kind of measurement studies on resolvers where there are few to no de-anonymisation risks for users, e.g. large public DNS services. If you are running DNS resolving software that will get put on small home routers that could be misconfigured as open resolvers then I would say no, you want to have defences against cache snooping which could include not allowing queries with RD-bit unset. So the question is how large the resolver is. Colin Perkins: May be chatting with some of the folks in dnsop working group. AR: Yes, happy to do that. Jonathan Hoyland: Would be great to see results from around the world as abuse techniques can vary quite significantly by culture / geography. AR: Measurement platform that we're using which is how we get results from around the US doesn't have quite as many nodes around the world but yes it would be great to expand to that. It would be reasonably straight-forward to do that - we stayed focussed on the US initially so as not to put too much load on the platform and then it just ended up that was the data we had when we decided to publish. We are at the moment working on expanding a similar tool to Trufflehunter but instead of measuring domain usage we're trying to measure DNS hijacking and we do want to expand that to worldwide using similar techniques. Vladimír Čunát: Default settings aren't targeted for huge instances like those studied. Colin Perkins: We're seeing increasing use of different types of DNS transport over TLS or HTTPS and new techniques like Oblivious DNS. Do any of these make a difference to the type of work you're doing? AR: DNSSEC and DNS over TLS or DNS over HTTPS don't impact our technique. The queries still get cached the same way. I'm not familiar with Oblivious DNS. Colin Perkins: This is a new development that I'm not super familiar with - using encryption and proxy resolvers to anonymise queries. AR: I suspect that as long as queries are still arriving at public resolvers then it won't interfere with our technique. If it provides an extra layer of anonymisation for people making queries, then awesome. Colin Perkins: Presumably DNS over HTTPS may allow some of the stalkerware apps to make DNS queries in a more controlled way so avoid public resolvers. AR: It might. Stalkerware apps tend to be incredibly unsophisticated - they don't try to obfuscate code, lots of them crash on installation, seem to have lots of bugs and problems in general. Would be surprised to see them adopting more sophisticated techniques any time soon, there are some that are ahead of the game like Flexaspy that would be more of a problem. Jonathan Hoyland: This might mess with your geolocation AR: Yes, but we realise anycast is not a great way to geolocation anyway, so our geolocation is already suspect. Colin Perkins: We will try to connect you with operators of public resolver services and developers of Oblivious DNS. AR: I'd love that. ## 14:00 Close Colin Perkins: Thanks again to Audrey and Francis for joining at an early hour for both - two great talks and some really good discussion. Applied Networking Research Workshop call for papers will be announced soon. Recordings of the talks, and links to the papers, will be made available from https://irtf.org/anrp/