Using Load-Balancing To Build High-Performance Routers Isaac Keslassy Ph.D. Oral Examination...
Transcript of Using Load-Balancing To Build High-Performance Routers Isaac Keslassy Ph.D. Oral Examination...
Using Load-Balancing To Build High-Performance Routers
Isaac Keslassy
Ph.D. Oral ExaminationDepartment of Electrical Engineering
Stanford University
2
R
R
R
R
R
R
Typical Router Architecture
Input
Input
Input
Switch Fabric
Scheduler
Output
Output
Output
1122
11
3
Traffic matrix:
Uniform traffic matrix: λij = λ
Definitions: Traffic MatrixR
R
R
R
R
R
1
N
i
1
N
j
4
100% throughput: for any traffic matrix of row and column sum less than R,
λij < μij
Definitions: 100% ThroughputR
R
R
R
R
R
1
N
i
1
N
j
ij ij
5
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
6
Stanford 100Tb/s Router
“Optics in Routers” project http://yuba.stanford.edu/or/
Some challenging numbers: 100Tb/s 160Gb/s linecards 640 linecards
7
In
In
In
Out
Out
Out
R
R
R
R
R
R
Router capacity = NRSwitch capacity = N2R
100% Throughput in a Mesh Fabric
?
?
?
?
?
?
?
?
?
R
R
R
R
R
R
R
R
R
RRRR
8
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/NR/N
R/N
R/N
R/N
R/N
If Traffic Is Uniform
RNR /NR /NR /
R
NR / NR /
9
Real Traffic is Not Uniform
R
In
In
In
Out
Out
Out
R
R
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
RNR /NR /NR /
R
RNR /NR /NR /
R
RNR /NR /NR /
R
R
R
R
?
10
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Load-Balanced Switch
Load-balancing stage Forwarding stage
In
In
In
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R
R
R
100% throughput for weakly mixing traffic (Valiant, C.-S. Chang)
11
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
112233
Load-Balanced Switch
12
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N33
22
11
Load-Balanced Switch
13
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/NR/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Intuition: 100% Throughput
Arrivals to second mesh:
Capacity of second mesh:
Second mesh: arrival rate < service rate
111
111
111
where,1
UaUN
b
01
-b RUaUN
C
UN
RC
Cba
14
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
?
15
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Packet Reordering
12
16
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Bounding Delay Difference Between Middle Ports
1
2
cells
17
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
123
0
UFS (Uniform Frame Spreading)
12
18
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
FOFF (Full Ordered Frames First)
12
19
FOFF (Full Ordered Frames First)
Input Algorithm N FIFO queues corresponding to the N output flows Spread each flow uniformly: if last packet was sent to
middle port k, send next to k+1. Every N time-slots, pick a flow:
- If full frame exists, pick it and spread like UFS - Else if all frames are partial, pick one in round-robin order and send it
123
12
4
N
20
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Bounding Reordering
123
NN
21
FOFF
Output properties N FIFO queues corresponding to the N middle
ports Buffer size less than N2 packets If there are N2 packets, one of the head-of-line
packets is in order
111
22
333
Output
4
N
22
FOFF Properties
Property 1: FOFF maintains packet order.
Property 2: FOFF has O(1) complexity.
Property 3: Congestion buffers operate independently.
Property 4: FOFF maintains an average packet delay within constant from ideal output-queued router.
Corollary: FOFF has 100% throughput for any adversarial traffic.
23
In
In
In
Out
Out
Out
R
R
R
R
R
R
Output-Queued Router?
?
?
?
?
?
?
?
?
R
R
R
R
R
R
R
R
R
RRRR
24
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
25
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
From Two Meshes to One Mesh
One linecard
In
Out
26
From Two Meshes to One Mesh
First meshIn Out
In Out
In Out
In Out
One linecard
Second mesh
R R
R
R
R
27
From Two Meshes to One Mesh
Combined meshIn Out
In Out
In Out
In Out
2RR
2R
2R
2R
28
Many Fabric Options
Options
Space: Full uniform meshTime: Round-robin crossbarWavelength: Static WDM
Any spreadingdevice
C1, C2, …, CN
C1
C2
C3
CN
In Out
In Out
In Out
In Out
N channels each at rate 2R/NOne linecard
29
AWGR (Arrayed Waveguide Grating Router) A Passive Optical Component
Wavelength i on input port j goes to output port (i+j-1) mod N
Can shuffle information from different inputs
1,
2…N
NxN AWGR
Linecard 1
Linecard 2
Linecard N
1
2
N
Linecard 1
Linecard 2
Linecard N
30
In Out
In Out
In Out
In Out
Static WDM Switching: Packaging
AWGR
Passive andAlmost Zero
Power
A
B
C
D
A, B, C, D
A, B, C, D
A, B, C, D
A, B, C, D
A, A, A, A
B, B, B, B
C, C, C, C
D, D, D, D
N WDM channels, each at rate 2R/N
31
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
32
Scaling Problem
For N < 64, an AWGR is a good solution. We want N = 640. Need to decompose.
33
A Different Representation of the Mesh
In Out
In Out
In Out
In Out
R 2R
Mesh
2R In Out
In Out
In Out
In Out
R
2RR
34
A Different Representation of the Mesh
In Out
In Out
In Out
In Out
R In Out
In Out
In Out
In Out
R2R/N
35
1
2
3
4
Example: N=8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
2R/8
36
When N is Too LargeDecompose into groups (or racks)
4R/42R 2R1
2
3
4
5
6
7
8
2R2R
1
2
3
4
5
6
7
8
4R 4R
37
When N is Too LargeDecompose into groups (or racks)
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
2RL
2RL 2RL
2RL2RL/G
2RL/G
2RL/G
2RL/G
38
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
39
When Linecards Fail
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
2RL
2RL 2RL
2RL2RL/G
2RL/G
2RL/G
2RL/G
2RL
Solution: replace mesh with sum of permutations
= + +
2RL/G 2RL/G 2RL/G 2RL/G
≤
2RL 2RL/G
G *
40
Hybrid Electro-Optical ArchitectureUsing MEMS Switches
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
41
When Linecards Fail
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
42
Fiber Link Capacity
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
MEMSSwitch
Link Capacity ≈ 64 λ’s * 5 Gb/s/λ = 320 Gb/s = 2R
Laser/Modulator
MUX
43
Group/Rack 1
1
2
2R
2R 4R
Group/Rack 2
1
2
2R
2R 4R
Example2 Groups of 2 Linecards
1
2
2R
2R
Group/Rack 1
1
2
2R
2R
Group/Rack 2
4R
4R
2R
2R
2R
2R
2R
2R
44
Theorem: M≡L+G-1 MEMS switches are sufficient for bandwidth.
Number of MEMS Switches
Examples:
5540,16,640
2
MGLN
NMNGL
G groups, Li linecards in group i,
G
iiLN
1
,max kk
LL
45
Group A
1
2
2R
2R 4R
Group B
1
2
2R
2R 4R
Packet Schedule
1
2
2R
2R
Group A
1
2
2R
2R
Group B
4R
4R
2R
2R
2R
2R
46
At each time-slot: Each transmitting linecard sends one packet Each receiving linecard receives one packet (MEMS constraint) Each transmitting group i
sends at most one packet to each receiving group j through each MEMS connecting them
In a schedule of N time-slots: Each transmitting linecard sends exactly one
packet to each receiving linecard
Rules for Packet Schedule
47
Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 ? ? ? ?
Tx LC A2 ? ? ? ?
Tx LC B1 ? ? ? ?
Tx LC B2 ? ? ? ?
Tx Group A
Tx Group B
48
Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 A1 A2 B1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 B1 B2 A1
Tx Group A
Tx Group B
49
Bad Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 A1 A2 B1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 B1 B2 A1
Tx Group A
Tx Group B
50
Group Schedule
T+1 T+2 T+3 T+4
Tx Group A AB AB AB AB
Tx Group B AB AB AB AB
51
Good Packet Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 B1 A2 A1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 A1 B2 B1
Theorem: There exists a polynomial-time algorithm that finds the correct packet schedule.
Tx Group A
Tx Group B
52
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
53
Summary
The load-balanced switch Does not need any centralized scheduling Can use a mesh
Using FOFF It keeps packets in order It guarantees 100% throughput
Using the hybrid electro-optical architecture It scales to high port numbers It tolerates linecard failure
54
Summary of Contributions
Load-Balanced Switch
I. Keslassy and N. McKeown, “Maintaining Packet Order in Two-Stage Switches,” Proceedings of IEEE Infocom '02, New York, June 2002.
I. Keslassy, S.-T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard and N. McKeown, “Scaling Internet Routers Using Optics,” ACM SIGCOMM '03, Karlsruhe, Germany, August 2003. Also in Computer Communication Review, vol. 33, no. 4, p. 189, October 2003.
I. Keslassy, S.-T. Chuang and N. McKeown, “A Load-Balanced Switch with an Arbitrary Number of Linecards,” to appear in Proceedings of IEEE Infocom ’04, Hong Kong, March 2004.
I. Keslassy, C.-S. Chang, N. McKeown and D.-S. Lee, “Maximizing the Throughput of Fixed Interconnection Networks,” in preparation.
55
Summary of Contributions Packet-Switch Scheduling
I. Keslassy and N. McKeown, “Analysis of Scheduling Algorithms That Provide 100% Throughput in Input-Queued Switches,” Proceedings of the 39th Annual Allerton Conference on Communication, Control, and Computing, Monticello, Illinois, October 2001.
I. Keslassy, M. Kodialam, T. V. Lakshman and D. Stiliadis, “On Guaranteed Smooth Scheduling for Input-Queued Switches,” Proceedings of IEEE Infocom '03, San Francisco, California, April 2003.
I. Keslassy, R. Zhang-Shen and N. McKeown, “Maximum Size Matching is Unstable for Any Packet Switch,” IEEE Communications Letters, Vol. 7, No. 10, pp. 496-498, Oct. 2003.
I. Keslassy, M. Kodialam, T. V. Lakshman and D. Stiliadis, “On Guaranteed Smooth Scheduling for Input-Queued Switches,” submitted to IEEE/ACM Transactions on Networking.
56
Summary of Contributions
Scheduling in Optical Networks
I. Keslassy, M. Kodialam, T. V. Lakshman and D. Stiliadis, “Scheduling Schemes for Delay Graphs with Applications to Optical Packet Networks,” to appear in Proceedings of IEEE HPSR ’04, Phoenix, Arizona, April 2004.
Scheduling in Wireless Networks
I. Keslassy, M. Kodialam and T. V. Lakshman, “Faster Algorithms for Minimum-Energy Scheduling of Wireless Data Transmissions,” Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt '03), INRIA Sophia-Antipolis, France, March 2003.
57
Summary of Contributions
Router Buffer Sizing
G. Appenzeller, I. Keslassy and N. McKeown, “Sizing Router Buffers,” submitted to ACM SIGCOMM ’04.
Image Classification
I. Keslassy, M. Kalman, D. Wang, and B. Girod, “Classification of Compound Images Based on Transform Coefficient Likelihood,” Proceedings of the International Conference on Image Processing (ICIP '01), Thessaloniki, Greece, October 2001.
58
Merci ! Nick McKeown Balaji Prabhakar Mark Horowitz, David Miller, Olav Solgaard
John and Kate Wakerly (Stanford Graduate Fellowship) SNRC, DARPA/MARCO, Cisco, NSF
Da Rui and Nandita Group Members: Gireesh, Greg, Guido, Martin, Masayoshi, Matthew, Mingjie,
Pablo, Sundar, Theresa, Yashar Friends and Colleagues: Abtin, Alan, Allen, Amalia, Amelia, Anamaya, Ananthan,
Arjun, Athina, Bill, Brian, Chang, Chandra, Changhua, Chao-Kai, Chao-Lin, Christine, Christophe, Damon, Dana, Daniel, Danny, David, Denise, Derek, Devavrat, Dimitri, Elif, Emilio, Eric, Flavio, Giulio, Hanna, In-Sung, Ingrid, Joachim, Jonathan, Ken, Kevin, Kostas, Kyoungsik, Lakshman, Laurence, Lizzi, Marcy, Marissa, Mark, Maureen, Max-David, Mayank, Milind, Mina, Mohsen, Murali, Myles, Nathan, Neda, Neha, Nick, Ofer, Paolo, Pascal, Paul, Peter, Prashanth, Rivi, Rong, Ruben, Ryan, Sam, Sylvia, Tali, Vinayak, Vincent, Yoav, … and the audience!
In memory of my departed grandparents Z’’L. To My Family: Mamie, Papa, Maman, Michael
and the numerous cousins…
Thank you.