Virtualization – Part IIIVMwareAhmad Ibrahim1

VMwareTopics Covered – My Cheat Sheet Virtualization x86 VirtualizationReviewWhat is virtualizationDefinition of classical virtualizationTrap-and-EmulateMemory ManagementWhat are the challenges Memory Tricks Binary TranslationWhat are the solutions Approaches to Server Virtualization Memory ManagementFull VirtualizationParavirtualization OS Assisted virtualizationHardware-assisted virtualizationChartsMemory TaxChartBallooningContent based Page SharingCS 5204 – Fall, 20092

VMwareOverview Virtualization x86 Virtualization Approaches to Server Virtualization Memory Resource Management TechniquesCS 5204 – Fall, 20093

VMwareWhat is Virtualization?VirtualContainerApp. A App. BVirtualContainerApp. C App. DVirtualization LayerHardware Virtualization allows one computer to do thejob of multiple computers, by sharing theresources of a single hardware across multipleenvironmentsCS 5204 – Fall, 20094

VMwareVMWare Product Suite Desktop – runs in a host OSVMWare Workstation (1999) – runs on PCVMWare Fusion – runs on Mac OS XVMWare Player – run, but not create images ServerVMWare Server (GSX Server) –hosted on Linux orWindowsVMWare ESX (ESX Server) – no host OSVMWare ESXi (ESX 3i) – freeware (July 2008)

VMwareTerminology Virtual Machineabstracted isolated Operating System Virtual Machine Monitor (VMM)capable of virtualizing all hardware resources,processors, memory, storage, and peripheralsaka HypervisorVMMVMMVMMBase Functionality (e.g. scheduling)EnhancedFunctionalityHypervisorCS 5204 – Fall, 20096

VMwarePopek & Goldberg: Virtualization Criteria “Formal Requirements for Virtualizable Third GenerationArchitectures” (1974) Properties of Classical Virtualization1.Equivalence Fidelity 2.Efficiency Performance 3.Program running under a VMM should exhibit a behavioridentical to that of running on the equivalent machineA statistically dominant fraction of machine instructions maybe executed without VMM interventionResource Control SafetyVMM is in full control of virtualized resourcesCS 5204 – Fall, 20097

VMwareStrategies: CPU Virtualization GuestOSprivilegedinstructiontrapresourceemulate changechangevmmresourceDe-privilegingVMM emulates the effect onsystem/hardware resources of privilegedinstructions whose execution traps intothe VMMaka trap-and-emulateTypically achieved by running GuestOSat a lower hardware priority level thanthe VMMProblematic on some architectureswhere privileged instructions do not trapwhen executed at deprivileged priorityCS5204 – Operating Systems

VMwareStrategies: Memory VirtualizationPrimary/Shadow structures Isolation/protectionof Guest OS addressspacesAvoid the two levelsof translation onevery accessMemory traces CS 5204 – Fall, 2009Efficient MM addresstranslation9

VMwarePopek & Goldberg: Classically Virtualizable According to Popek and Goldberg,” an architecture is virtualizable if the set of sensitiveinstructions is a subset of the set of privilegedinstructions.” Is x86 Virtualizable? NoCS 5204 – Fall, 200910

VMwareOverview Virtualization x86 Virtualization Approaches to Server Virtualization Memory Resource Management TechniquesCS 5204 – Fall, 200911

VMwareChallenges to x86 Virtualization (1) Lack of trap when privilegedinstructions run at user-levelClassic Example: popf instructionSame instruction behaves differentlydepending on execution mode User Mode: changes ALU flags Kernel Mode: changes ALU andsystem flags Does not generate a trap in usermode CS 5204 – Fall, 200912

VMwareChallenges to x86 Virtualization (2) Visibility of privileged stateSensitive register instructions: read or change sensitiveregisters and/or memory locations such as a clockregister or interrupt registers:Protection system instructions: reference the storageprotection system, memory or address relocation system:CS 5204 – Fall, 200913

VMwareBinary DENT(ical)Characteristics Binary – input is machine-level codeDynamic – occurs at runtimeOn demand – code translated when needed for executionSystem level – makes no assumption about guest codeSubsetting – translates from full instruction set to safe subsetAdaptive – adjust code based on guest behavior to achieve efficiencyCS5204 – Operating Systems

VMwareBinary TranslationHash TableGuestCode([x], [y])Translation Cache3[x]1Binary Translator[y]2TU4executeCCF5TC:TU:CCF:translation cachetranslation unit (usually a basic block)compiled code fragment: continuationFew cache hits% translationPCWorking set capturedRunning timeCS5204 – Operating Systems

VMwareEliminating faults/traps ProcessPrivileged instructions – eliminated by simple binarytranslation (BT)Non-privileged instructions – eliminated by adaptive BT (a) detect a CCF containing an instruction that trap frequently(b) generate a new translation of the CCF to avoid the trap (perhapsinserting a call-out to an interpreter), and patch the originaltranslation to execute the new translationCS 5204 – Fall, 200916

VMwareBinary Translation - Performance Advantages Avoid privilege instruction trapsPentium privileged instruction (rdtsc) Trap-andemulate: 2030 cyclesCallout-and-emulate: 1254 cycles BT emulation: 216 cycles (but TSC value is stale)

VMwareOverview Virtualization x86 Virtualization Approaches to Server Virtualization Memory Resource Management TechniquesCS 5204 – Fall, 200918

VMwareApproaches to Server Virtualization 1st Generation: Full virtualization (Binarytranslation)– Software Based– VMware andMicrosoft VirtualMachine2nd Generation:Paravirtualization– Cooperativevirtualization– Modified guest– VMware, XenVirtualMachineVMDynamic Translation VM 3rd ization– Unmodified guest– VMware and Xen onvirtualization-awarehardware platforms VirtualMachineOperating CS 5204 – Fall, 2009VirtualMachine19

VMware1st Generation: Full VirtualizationCS 5204 – Fall, 200920

VMwareFull Virtualization - Drawbacks Hardware emulation comes with a performance price In traditional x86 architectures, OS kernels expect to runprivileged code in Ring 0– However, because Ring 0 is controlled by the host OS, VMs areforced to execute at Ring 1/3, which requires the VMM to trap andemulate instructions Due to these performance limitations, paravirtualizationand hardware-assisted virtualization were developedApplicationOperatingSystemRing 3Ring 0Traditional x86 ArchitectureApplicationRing 3Guest OSRing 1 / 3VirtualMachineMonitorRing 0Full Virtualization

VMware2nd Generation: ParavirtualizationCS 5204 – Fall, 200922

VMwareParavirtualization Challenges Guest OS and hypervisor tightly coupledRelies on separate kernel for native and in virtual machineTight coupling inhibits compatibilityChanges to the guest OS are invasiveInhibits maintainability and supportabilityGuest kernel must be recompiled when hypervisor is updatedCS 5204 – Fall, 200923

VMwareHardware Support for VirtualizationCS 5204 – Fall, 200924

VMwareSoftware vs Hardware Hardware extensions allow classical virtualization on the x86.The overhead comes with exits – it no exits, then native speedHardware Advantages:Code density is preserved – no translationPrecise exceptions – BT performs extra work torecover guest state for faults and interrupts in nonIDENT codeSystem calls run without VMM interventionSoftware Advantages:Trap elimination – replaced with callouts which areusually fasterEmulation speed – callouts provide emulationroutine whereas hardware must fetch and decodethe trapping instruction, then emulateCallout avoidance: BT can avoid a lot of callouts byusing in-TC emulationCS 5204 – Fall, 200925

VMwareSummaryCS 5204 – Fall, 200926

VMwareOverview Virtualization x86 Virtualization Approaches to Server Virtualization Memory Resource Management TechniquesCS 5204 – Fall, 200927

VMwareMemory resource management VMM (meta-level) memory management StrategiesMust identify both VM and pages within VM to replaceVMM replacement decisions may have unintended interactionswith GuestOS page replacement policyWorst-case scenario: double pagingEliminating duplicate pages – even identical pages acrossdifferent GuestOSs. VMM has sufficient perspectiveClear savings when running numerous copies of same GuestOS“ballooning” – add memory demands on GuestOS so that the GuestOS decides whichpages to replaceAlso used in XenAllocation algorithm Balances memory utilization vs. performance isolation guarantees“taxes” idle memoryCS5204 – Operating Systems

VMwareContent-based page sharing A hash table contains entries for sharedpages already marked “copy-on-write”A key for a candidate page is generatedfrom a hash value of the page’s contentsA full comparison is made between thecandidate page and a page with amatching key valuePages that match are shared – the pagetable entries for their VMMs point to thesame machine pageIf no match is found, a “hint” frame isadded to the hash table for possible futurematchesWriting to a shared page causes a pagefault which causes a separate copy to becreated for the writing GuestOSCS5204 – Operating Systems

VMwarePage sharing performance Identical Linux systems running same benchmark“best case” scenarioLarge fraction (67%) of memory sharableConsiderable amount and percent of memory reclaimedAggregate system throughput essentially unaffectedCS5204 – Operating Systems

VMwareBallooning: Inflate Inflating the balloonBalloon requests additional “pinned” pages from GuestOSInflating the balloon causes GuestOS to select pages to be replaced usingGuestOS page replacement policyBalloon informs VMM of which physical page frames it has been allocatedVMM frees the machine page frames s corresponding to the physical pageframes allocated to the balloon (thus freeing machine memory to allocate toother GuestOSs)CS5204 – Operating Systems

VMwareBallooning: Deflate Deflating the balloonVMM reclaims machine page framesVMM communicates to balloonBalloon unpins/ frees physical page frames corresponding to newmachine page framesGuestOS uses its page replacement policy to page in needed pagesCS 5204 – Fall, 200932

VMwareMeasuring Cross-VM memory usage Each GuestOS is given a number of shares, S, against the total available machine memory.The shares-per-page represents the “price” that a GuestOS is willing to pay for a page of memory.The price is determined as follows:sharespricepageallocation idle pagecostfractionalusageThe idle page cost is k 1/(1-τ) where 0 τ 1 is the “tax rate” that defaults to 0.75The fractional usage, f, is determined by sampling (what fraction of 100 randomly selected pagesare accesses in each 30 second period) and smoothing (using three different weights)CS5204 – Operating Systems

VMwareMemory tax experimentVM2: memory-intensive workloadVM1: idles Initially, VM1 and VM2 converge to same memory allocation with τ 0 (no idlememory tax) despite greater need for memory by VM2When idle memory tax applied at default level (75%), VM1 relinquishes memoryto VM2 which improves performance of VM2 by over 30%CS5204 – Operating Systems

VMware?CS 5204 – Fall, 200935

VMwareReferences and Sources A Comparison of Software and Hardware Techniques for x86 VirtualizationKeith Adams & Ole AgesenA Comparison of Software and Hardware Techniques for x86 VirtualizationMike MartyA Comparison of Software and Hardware Techniques for x86 VirtualizationJordan and Justin EhrlichA Survey on Virtualization Technologies Susanta K NandaDisco: Running Commodity Operating Systems on Scalable MultiprocessorsDivya ParekhHardware Support for Efficient Virtualization John Fisher-OgdenMemory Resource Management in VMware ESX Server Carl A. WaldspurgerMemory Resource Management in VMware ESX Server VMwareResource Management Carl A. WaldspurgerUnderstanding Intel Virtualization Technology (VT) Dr. Michael L. CollardCS 5204 – Fall, 200936

VMwareReferences and Sources Understanding Memory Resource Management in VMwareо ESX Server VMwareUnderstanding Full Virtualization, Paravirtualization and HardwareAssist VMwareVirtualization Intel and Argentina Software Pathfinding and InnovationVMware and CPU Virtualization Technology Jack LoVMware Virtualization of Oracle and Java Scott Drummonds & TimHarrisLecture on Vmware Dr. Dennis KafuraWhat is Virtualization Scott DevineCS 5204 – Fall, 200937