streda 19. mája 2010

Vykon 10GB virtualneho sietovania pre virtualizovane Windows a Linux pocitace

vNIC Networking

Existing performance improvement techniques:

Minimize copying during Tx
Make use of TSO (TCP Segmentation Offload)
Moderate virtual interrupt rate, heuristic
Generally reduce the number of “VMExits”
NetQueue for scaling with multiple vNICs
Limited use of LRO (for Guest OS that supports it)

VMDirectPath technology

Direct VM access to device hardware
FPT --Fixed Passsthroughin ESX 4.0, not VMotionable

Ways of Measuring Virtual Networking Performance

Metrics

Bandwidth
Packet rate, Particularly when packet sizes are small

Scaling within VM
Increase number of connections
Increase number of vCPUs

Scaling across VMs
Increase number of VM

Test Platform Systems:

ESX
2-socket, Quad-core Intel Xeon X5560 @ 2.80 GHz (Nehalem) system
Each core has L1 and 256KB L2 caches
Each socket has shared 8MB L3 cache
6 GB RAM (DDR 3 -1066 MHz)
pNIC: Intel 82598EB (Oplin) 10GigE, 8x PCIe
ESX 4.0

Other machine
2-socket Intel Xeon X5335 @ 2.66 GHz (Clovertown)
RHEL 5.1
Intel Oplin10GigE NIC (ixgbe; version 1.3.16.1-lro:8 RxQs, 1 TxQ
16GB RAM

Microbenchmark:
Netperf, 5 TCP connections

Single vNIC TCP Performance: Linux VM



Results with RHEL5 VM:

Test configs:
Spectrum of socket and message sizes

Txand Rx both reach ~9Gbps (~wirespeed) with 64kB and or auto-tuned socket sizes
Rx bandwidth of 9+Gbps => over 800k Rx pkts/s (std MTU size pkts)
Very small 8k socket size

Latency bound
reaches ~2Gbps throughput

Number of vCPU smakes little difference in micro-benchmark

Slight drop in Rx throughput going from 2 to 4 vCPUs due to cache effects
vSMP: additional CPU cycles for applications


Single vNIC TCP Performance: Windows VM


Results with Windows 2008 VM: (Enterprise Ed; SP1)

Very similar to Linux VM performance; key differences:

Windows Tx does not use auto-tuning
Rx throughput reaches peak of ~9Gbps with 2 vCPUs
Rx throughput higher then Linux at smaller socket sizes for vSMPs


TCP Throughput Scaling with # Connections


Results with Win2k8, 2-vCPU VM:

Large socket size runs:
Reach9+Gbps with very few connections (just a bit over 4)

Small socket size, moderate messages size:
- throughput continues to scale as number of sockets increase to 20
- Latency bound

Small socket, very small message size:
throughput flattens out, at close to 3Gbps for Rx, and close to 2Gbps for Tx


Multi-VM Scaling: RHEL5 UP VMs


VMs are UP, RHEL5 VMs

For large socket, 9+Gbps (wirespeed) sets the limit

Slight throughput increase going to 2 VMs
No throughput drop as more VMs are added

For small socket size, throughput scales as more VMs are added

For Txall the way through 8 VMs
For Rx, scaling flattens out after 4 to 5 VMs


In all cases, aggregate throughput exceed 5 Gbps

No scalability limit because of virtualization!!!! Only physical limit!!! :)


Multi-VM Scaling: Win2k8 UP VMs


VMs are UP, Win2k8 VMs

For large socket, 9+Gbps (wirespeed) sets the limit
Very similar to Linux VM case; differences:
Large socket size: slightly lower Rx throughput at single VM
Small socket, moderate message size (512): Rx scales extremely well, reaching 9+Gbps
Small socket sizes: Txthroughput somewhat lower than achieved with RHEL5 VM

In all cases, aggregate throughput exceed 4 Gbps

Key difference medium socket RX 8K-512 – Windows higher throughput than Linux– Windows got more acknowledgments


TSO’s Role in Tx Throughput


TSO plays significant role in Netperf Tx microbenchmarking
Large TSOs (>25kB avgsize) with Linux and auto-tuning of socket size
Beneficial even to small message and socket sizes for Linux when transmitting fast enough for aggregation

TSO very beneficial in virtual networking
zeroCopyTx+ largeTSOpacket: amortizes network virtualization overhead across a lot of data

Motivates looking at packet rate as additional performance metric

TSO - when socket is bigger. Win reverse effect lack of autotuning, bigger sizes than Linux.


Network Utilization of Sample Workloads



Very significant workloads at modest amount of network traffic!

Exchange server- loadgen – tool for Exchange benchmarking
TPC-C like benchmark – similar to TPC-C – huge CPU, transactions

Point is – high throughput of previous tests. Real aplication network throughput for Exchange is far lower!!!


Network Utilization of Sample Workloads (2)

SPECweb2005


SPECweb2005 3 modules:
Banking - SSL type of connection
E-Commerce- SSL and non SSL types of communication
Support – downloading patches, download manuals, etc.

People downloading: 2300, 3200, 2200

For support workload (highest network bandwidth workload)
Bandwidth usage highly skewed toward Tx bandwidth:
> 40 to 1 Tx to Rx bandwidth ratio
Tx traffic takes modest advantage of TSO (avg~3 x std MTU size)
Rx traffic has small pkts(avg~500 bytes) – mostly requests

Workload studies references:
Microsoft Exchange Server 2007 Performance on VMware vSphere™ 4, http://www.vmware.com/resources/techresources/10021
SPECweb2005 Performance on ESX Server 3.5, http://www.vmware.com/resources/techresources/1031

Zdroj: Virtual Network Performance
http://www.vmworld.com/docs/DOC-3875

Žiadne komentáre: