Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/...

30
Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 http://www.psc.edu/~mathis/papers/ PathDiag20080108.ppt 1 / 8

Transcript of Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/...

Page 1: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Pathdiag:Automatic TCP Diagnosis

Matt Mathis

John Heffner

Ragu Reddy

8/01/08

http://www.psc.edu/~mathis/papers/

PathDiag20080108.ppt

1/8

Page 2: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Outline• Why is the end-to-end problem so difficult?

• The pathdiag solution

• How it works

• Features

• Other issues

Page 3: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Why is the end-to-end problem so difficult?• By design TCP/IP hides the ‘net from upper layers

– TCP/IP provides basic reliable data delivery

– The “hour glass” between applications and networks

• This is a good thing, because it allows:– Invisible recovery from data loss, etc

– Old applications to use new networks

– New application to use old networks

• But then (nearly) all problems have the same symptom– Less than expected performance

– The details are hidden from nearly everyone

Page 4: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

TCP tuning is painful debugging• All problems reduce performance

– But the specific symptoms are hidden

• Any one problem can prevent good performance– Completely masking all other problems

• Trying to fix the weakest link of an invisible chain– General tendency is to guess and “fix” random parts

– Repairs are sometimes “random walks”

– Repair one problem at time at best

• The solution is to instrument TCP

Page 5: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

The Web100 project• Use TCP's ideal diagnostic vantage point

– What is limiting the data rate?

– RFC 4898 TCP-ESTATS-MIB • Standards track

• Prototypes for Linux (www.Web100.org) and Windows Vista

– Also TCP Autotuning• Automatically adjusts TCP buffers

• Linux 2.6.17 default maximum window size is 4 M Bytes

• Announced for Vista - details unknown

• But this has lead to a new insight:– Nearly all symptoms scale with round trip time

Page 6: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Nearly all symptoms scale with RTT

• Examples– TCP Buffer Space: – Packet loss:

• Think: the extra time needed to overcome a flaw is proportional to the RTT

Rate=Window /RTT

Rate= MSS /RTT 1/ Loss

Page 7: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Symptom scaling breaks diagnostics

• Local Client to Server– Flaw has insignificant symptoms

– All applications work, including all standard diagnostics

– False pass all diagnostic tests• Remote Client to Server: all applications fail

– Leading to faulty implication of other components• Implies that the flaw is in the wide are network

Page 8: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

The confounded problems• For nearly all network flaws

– The only symptom is reduced performance

– But the reduction is scaled by RTT

• Therefore, flaws are undetectable on short paths– False pass for even the best conventional diagnostics

– Leads to faulty inductive reasoning about flaw locations

– Diagnosis often relies on tomography and complicated inference techniques

• This is the real end-to-end problem

Page 9: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

The pathdiag solution• Test a short section of the path

– Most often first or last mile

• Use Web100 to collect detailed TCP statistics

– Loss, delay, queuing properties, etc

• Use models to extrapolate results to the full path

– Assume that the rest of the path is ideal

– You have to specify the end-to-end performance goal• Data rate and RTT

• Pass/Fail on the basis of the extrapolated performance

Page 10: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Deploy as a Diagnostic Server

• Use pathdiag in a Diagnostic Server (DS)• Specify End to End target performance

– From server (S) to client (C) (RTT and data rate)• Measure the performance from DS to C

– Use Web100 in the DS to collect detailed statistics• On both the path and client

– Extrapolate performance assuming ideal backbone• Pass/Fail on the basis of extrapolated performance

Page 12: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Pathdiag output

Page 13: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Pathdiag output

Page 14: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Pathdiag• One click automatic performance diagnosis

– Designed for (non-expert) end users• Future version will better support both expert and non-expert

– Accurate end-systems and last mile diagnosis• Eliminate most false pass results

• Accurate distinction between host and path flaws

• Accurate and specific identification of most flaws

– Basic networking tutorial info• Help the end user understand the problem

• Help train 1st tier support (sysadmin or netadmin)

• Backup documentation for support escalation

• Empower the user to get it fixed– The same reports for users and admins

Page 15: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Under the covers• Same base algorithm as “Windowed Ping” [Mathis, INET’94]

– Aka “mping”– See http://www.psc.edu/~mathis/wping/– Killer diagnostic in use at PSC in the early 90s– Stopped being useful with the advent of “fast path” routers

• Use a simple fixed window protocol– Scan window size in 1 second steps

• Pathdiag clamps cwnd to control the TCP window

• Varies step size – fine steps near interesting features– Measure data rate, loss rate, RTT, etc as window changes– Reports reflect key features of the measured data

Page 16: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Window Size vs Data Rate

Page 17: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Window Size vs Loss Rate

Page 18: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Window Size vs RTT

Page 19: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Window Size vs Power

Power=Rate/RTT

Page 20: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Key NPAD/pathdiag features• Results are intended for end-users

– Provides a list of specific items to be corrected• Failed tests are show stoppers for fast apps

– Includes explanations and tutorial information– Clear differentiation between client and path problems– Accurate escalation to network or system admins– The reports are public and can be viewed by either

• Coverage for a majority of OS and last-mile network flaws

– Coverage is one way – need to reverse client and server

– Does not test the application – need application tools

– Does not check routing – need traceroute – Eliminates nearly all(?) false pass results

Page 21: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

More features• Tests becomes more sensitive as the path gets shorter

– Conventional diagnostics become less sensitive– Depending on models, perhaps too sensitive

• New problem is false fail (e.g. queue space tests)

• Flaws no longer completely mask other flaws– A single test often detects several flaws

• E.g. Can find both OS and network flaws in the same test

– They can be repaired concurrently• Archived DS results include raw web100 data

– Can reprocess with updated reporting SW• New reports from old data

– Critical feedback for the NPAD project• We really want to collect “interesting” failures

Page 22: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Impact• Automatically diagnose first level problems

– Easily expose all path bottlenecks that limit performance to less than 10 MByte/s

– Easily expose all end-system/OS problems that limit performance to less than 10 MByte/s

• (Will become moot as autotuning is deployed)

• Empower the users to apply the proper motivation• Still need to recalibrate user expectations

– Less than 1 gigabyte / 2 minutes is too slow

– Many paths should support 5 gigabytes/minute• Less than 1 Gb/s

Page 23: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Recalibrate user expectations• Long history of very poor network performance

– Users do not know what to expect

– Users have become completely numb

– Users have no clue about how poorly they are doing

• Goal: New baseline expectations for R&E users:– 10 Mbytes/s (80 Mb/s) over a 20 ms path.

• Everyone should be able to reach these rates by default

• People who can’t should know why or be angry

Page 24: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

What about impact of the test traffic?• Pathdiag server is single threaded

– Only one test at a time

• Same load as any well tuned TCP application– Protected by TCP “fairness”

• Large flows are generally “softer” than small flows

• Large flows are easily disturbed by small flows

• Note that any short RTT flow is stiffer than a long RTT flow

Page 25: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

NPAD/pathdiag deployment• Why should a campus networking organization care?

– “Zero effort” solution to miss-tuned end-systems

– Accurate reports of real problems• You have the same view as the user

• Saves time when there really is a problem

• You can document reality for management

• Suggestion:– Require pathdiag reports for all performance problems

Page 26: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Download and install• User documentation:

http://www.psc.edu/networking/projects/pathdiag/

• Follow the link to “Installing a Server”– Easily customized with a site specific skin

– Designed to be easily upgraded with new releases• Roughly every 2 months

• Improving reports through ongoing field experience

– Drops into existing NDT servers• Plans for future integration

• Enjoy!

Page 27: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Backup slides

Page 28: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

The Wizard Gap

Page 29: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

The Wizard Gap Updated • Experts have topped out end systems & links

– 10 Gb/s NIC bottleneck

– 40 Gb/s “link” bandwidth (striped)

• Median I2 bulk rate is 3 Mbit/s– See http://netflow.internet2.edu/weekly/

• Current Gap is about 3000:1• Closing the first factor of 30 should now be “easy”

Page 30: Pathdiag: Automatic TCP Diagnosis Matt Mathis John Heffner Ragu Reddy 8/01/08 mathis/papers/ PathDiag20080108.ppt.

Pathdiag• Initial version aimed at “NSF domain scientists”

– People with non-networking analytical background

• Report designed to

– accurately identify subsystem

– provide tutorial

– provide good escalation to network or host admin

– support the user as the ultimate judge of success

• Future plan to split reports

– Even easier for non-experts

– Better information for experts