In the HVM case the "PV kernel" is actually hvmloader and isn't PV at all, the job of hvmloader is to load all the necessary information that qemu will use to build the necessary environment that will "fool" an unmodified guest that he is in fact on a physical machine. In summary hvmloader pretty much just fills data-structures as follows:
- Start info page (not sure if present for hvm)
- Shared info page (not sure if present for hvm)
- BIOS (necessary for real mode during booting) this includes the interrupt vector table (IVT).
In qemu terminology the "Kernel" image is located at /usr/lib/xen/boot/hvmloader
The BIOS is provided by hvmloader, the BIOS is a binary blob within hvmloader which is loaded to the correct address in guest physical space, It's not provided by qemu although it obviously needs to work in concert with qemu. Currently Xen use its independent BIOS (rombios, originally from the bochs project). And is in the process of switching to SeaBIOS, which happens to also be the qemu upstream and KVM BIOS. Information to work with upstream qemu and SeaBIOS (future of Xen) is here: QEMUUpstream
When hvmloader is finished it jumps to the BIOS entry point. From this point things look mostly like a native machine booting. Qemu uses the hvmloader binary blob as a kernel (hence kernel parameter in the description file) when building the domain, then it jumps to the guest kernel (this is done via the block device that qemu provides to connect to the guest kernel image or a CDROM using el torito).
Qemu is very relevant in Xen since it provides the ability to run unmodified Guests, it does this by fooling the qemu-guest to think that it is in fact on a physical machine, so qemu is in charge of:
- physical to pseudo-physical address translation
- emulated device communication with host hardware
- syscall to hypercall translation
- more (I still do not have the complete picture here)
Device Model: Currently Xen uses a modified (stripped) qemu version, called qemu-dm its job is to run as a user space process and communicate the guest and the Host using emulated devices. So in general all device access on x86 is done via either memory mapped or I/O mapped reads and writes (includes stuff like PCI CFG space accesses as well as accessing registers on specific devices). This process works as this:
- In an HVM guest we can trap on any such read or write.
- For memory mapped I/O we do this via shadow page (see below) tables or HAP, basically by marking those regions as unmapped and figuring it out in the hypervisors page fault handler
- For I/O mapped IO VMEXIT condition is based on flags in the VMControlStructure (VMCS) which cause a trap when the guest does I/O. This works when certain privileged instructions are called, QEMU traps, changes the instruction to arguments and calls VMexit, this will switch to host CPU state, and he will make the appropriate decisions. Please Correct if this last statement is incorrect or inaccurate.
- So now we are in the hypervisor and we know that the guest trapped on a specific instruction trying to do some sort of I/O
- So the hypervisor emulates the instruction and figures out if it was a read or a write, how big it was, what the address was, what the value was, etc etc.
- This gets packages up as an IOREQ and placed on a shared memory ring which is shared with the relevant userspace qemu process.
- how does qemu-dm deliver the info to the VM?
- qemu sole purpose is basically to read IOREQs from that shared ring and handle the emulation of which ever device is at the address
- Device emulations ("device models") register handlers for various addresses and the core qem code keeps track and calls the right ones
- And there's an event channel between the hypervisor and qemu so qemu knows when to look into the ring for more stuff to do
- once the emulation is complete qemu signals the hypervisor which resumes the VCPU (which has been blocked this whole time) at the point right after the trapping instruction
This is a quick explanation extracted from the Definitive Guide to Xen Hypervisor Whenever the guest attempts an update CR3 (what is CR3?), or modify the page tables, the CPU traps into the hypervisor and allow it to emulate the update.
Nested Page Tables
NPT. Each guest is allowed to manipulate CR3 directly, however the semantics of the register are modified. The guest see a completely virtualized address space, and only sets up mappings within the range allocated by the hypervisor. The hypervisor manipulates de MMU to manipulate the mappings, but does not need to get involved while it is running. This is accomplished using a tagged translation lookaside buffer (TTLB). Each TLB entry has a virtual machine identifier associated with it, and is only valid within the virtual machine for which it is created. Intel and AMD chips include such TTLB, but the implementation differs.
CPU register used when virtual addressing is enabled, hence when the PG bit is set in CR0. CR3 enables the processor to translate virtual addresses into physical addresses by locating the page directory and page tables for the current task. Typically, the upper 20 bits of CR3 become the page directory base register (PDBR). For an example look here
Processor support for virtualization is provided by a form of processor operation called VMX operation. There are two kinds of VMX operation: VMX root operation and VMX non-root operation. In general, a VMM will run in VMX root operation and guest software will run in VMX non-root operation. Transitions between VMX root operation and VMX non-root operation are called VMX transitions. There are two kinds of VMX transitions. Transitions into VMX non-root operation are called VM entries ('VMCALL). Transitions from VMX non-root operation to VMX root operation are called VM exits (VMEXIT). See attached file (pdf without filename extension).