I recently decided to rebuild my entire computer system and go to a fully virtualized arrangement, including GPU passthrough, etc.. And what a fight it was. After a lot of googling and finding out different things on different forums I finally have a system that actually works as I want it. But getting all that information was not easy. So here is the sum of my collected wisdom brought together into one place for easier reference. It's also here so I can refer back to it in the future to work out how I put this system together.
The first thing you need to do is make sure that your hardware is up to the task. I don't mean just having enough memory, etc, but that things like your CPU is capable of performing VT-d for PCIe passthrough. I had to replace my CPU (which was an i7-3770K which I got for the overclocking abilities) with a new one that had VT-d (the i7-3770 - Non-K version, so no overclocking). Newer CPUs generally all have VT-d (or the AMD equivalent), so this is now less of an issue, but just check to make sure.
Next is your motherboard. Not all motherboards are equal, and how the PCIe slots are arranged can play a big part in how you design your PCIe passthrough. Also understanding what can and can't be done helps big time. Get the manual for your motherboard: it will detail how the PCIe slots are mapped. For example you may well have two x16 slots for GPUs. But these will actually be a single x16 slot, and if you insert a second GPU half the lanes of the x16 will be rerouted to the other slot, making it a pair of x8. But these are still effectively the same slot and both GPUs will be tied together in the same IOMMU group (more on those later). Or, like my motherboard, you may have a number of x1 slots plus a x4 slot, but the x4 slot is the same physical lanes as the x1 slots, so you can use either the x1 slots or the x4 slot, but not both.
Needless to say the BIOS (EFI) has to be configured to enable VT-x and VT-d (or the equivalent for AMD). Also you must ensure that your computer boots from the GPU that you want to use for the host OS.
The host OS
Setting up the host OS is pretty straight forward. You don't need
anything fancy. I use Ubuntu, but any Linux flavour will do fine.
libvirt and virt-manager are pretty much all you need. The main thing
is setting up the kernel to run IOMMU. That means adding some parameters
to the kernel command line, which on Ubuntu is in
/etc/default/grub. For intel chips it's:
Then rebuild grub and reboot:
$ sudo update-grub $ sudo reboot
Now you should have IOMMU running, and it's time to investigate your groups. First grep dmesg to see what IOMMU groups are set up:
$ sudo dmesg | grep iommu [ 0.000000] Command line: BOOT_IMAGE=/BOOT/ubuntu_w24h3v@/vmlinuz-5.11.0-18-lowlatency root=ZFS=rpool/ROOT/ubuntu_w24h3v ro intel_iommu=on [ 0.061022] Kernel command line: BOOT_IMAGE=/BOOT/ubuntu_w24h3v@/vmlinuz-5.11.0-18-lowlatency root=ZFS=rpool/ROOT/ubuntu_w24h3v ro intel_iommu=on [ 0.206371] iommu: Default domain type: Translated [ 0.537171] pci 0000:00:00.0: Adding to iommu group 0 [ 0.537187] pci 0000:00:01.0: Adding to iommu group 1 [ 0.537198] pci 0000:00:02.0: Adding to iommu group 2 [ 0.537207] pci 0000:00:14.0: Adding to iommu group 3 [ 0.537219] pci 0000:00:16.0: Adding to iommu group 4 [ 0.537227] pci 0000:00:1a.0: Adding to iommu group 5 [ 0.537237] pci 0000:00:1b.0: Adding to iommu group 6 [ 0.537246] pci 0000:00:1c.0: Adding to iommu group 7 [ 0.537255] pci 0000:00:1c.1: Adding to iommu group 8 [ 0.537264] pci 0000:00:1c.3: Adding to iommu group 9 [ 0.537274] pci 0000:00:1c.4: Adding to iommu group 10 [ 0.537287] pci 0000:00:1c.5: Adding to iommu group 11 [ 0.537296] pci 0000:00:1c.6: Adding to iommu group 12 [ 0.537305] pci 0000:00:1c.7: Adding to iommu group 13 [ 0.537314] pci 0000:00:1d.0: Adding to iommu group 14 [ 0.537332] pci 0000:00:1f.0: Adding to iommu group 15 [ 0.537342] pci 0000:00:1f.2: Adding to iommu group 15 [ 0.537351] pci 0000:00:1f.3: Adding to iommu group 15 [ 0.537356] pci 0000:01:00.0: Adding to iommu group 1 [ 0.537360] pci 0000:01:00.1: Adding to iommu group 1 [ 0.537370] pci 0000:03:00.0: Adding to iommu group 16 [ 0.537374] pci 0000:04:04.0: Adding to iommu group 16 [ 0.537390] pci 0000:05:00.0: Adding to iommu group 17 [ 0.537400] pci 0000:05:00.1: Adding to iommu group 17 [ 0.537409] pci 0000:06:00.0: Adding to iommu group 18 [ 0.537413] pci 0000:07:00.0: Adding to iommu group 11 [ 0.537423] pci 0000:09:00.0: Adding to iommu group 19 [ 0.537432] pci 0000:0a:00.0: Adding to iommu group 20 [ 0.901962] intel_iommu=on
Great. That's all working. Now we can see what arrangements we can make from these groups.
lspci will show your PCIe devices, and you can match the PCI bus addresses with the groups. Here's my pci device list:
$ lspci 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09) 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09) 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09) 00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04) 00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1 (rev 04) 00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2 (rev 04) 00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller (rev 04) 00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 1 (rev c4) 00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 2 (rev c4) 00:1c.3 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 4 (rev c4) 00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4) 00:1c.5 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c4) 00:1c.6 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 7 (rev c4) 00:1c.7 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 8 (rev c4) 00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1 (rev 04) 00:1f.0 ISA bridge: Intel Corporation Z77 Express Chipset LPC Controller (rev 04) 00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04) 00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus Controller (rev 04) 01:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1) 01:00.1 Audio device: NVIDIA Corporation GM204 High Definition Audio Controller (rev a1) 03:00.0 PCI bridge: PLX Technology, Inc. PEX8112 x1 Lane PCI Express-to-PCI Bridge (rev aa) 04:04.0 Multimedia audio controller: C-Media Electronics Inc CMI8788 [Oxygen HD Audio] 05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Caicos XT [Radeon HD 7470/8470 / R5 235/310 OEM] 05:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI Audio [Radeon HD 6450 / 7450/8450/8490 OEM / R5 230/235/235X OEM] 06:00.0 USB controller: VIA Technologies, Inc. VL80x xHCI USB 3.0 Controller (rev 03) 07:00.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 30) 09:00.0 Ethernet controller: Qualcomm Atheros AR8151 v2.0 Gigabit Ethernet (rev c0) 0a:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller (rev 11)
Yes, I have three graphics cards there. The internal Intel graphics, an nVidia GTX 970 and a rather old and slow AMD Radeon HD7470. The arrangement I desire is to have one graphics card for the host (preferably the slowest), the other slow graphics for a low-priority guest OS, and the main nVidia graphics card for the main guest OS. The AMD graphics card I have connected through a PCIe x1 to x16 adapter cable (often sold for bitcoin mining) so that I could use my x1 slots for my nice sound card, so that is really slow. It would be nice to use that for the host OS. However there is a problem with that: I can't set that graphics card to be the "boot" graphics card. I can only select between the internal Intel GPU and "the PICe" graphics, with no way of selecting which PICe graphics. So I am limited to using the Intel graphics for the host.
That leaves the nVidia and the AMD for guests. These are on PICe addresses 01:00.x and 05:00.x. Matching those to the IOMMU groups above we see they are in groups 1 and 17 respectively. Now IOMMU groups pass through to the guest OS in their entirety, so everything in group 1 will be passed through along with the nVidia graphics card. 00:01.0 is also in group 1, which is just the root port for the nVidia's set of lanes. So that's great. For 17 there's nothing else in the same group. That's fantastic. So our graphics cards are arranged perfectly for optimal use. But we're not done yet. Now we need to remove those graphics cards from the host OS and make them available for use by the guests. This means more kernel parameters. You need to find the VID and PID numbers for the devices, which you can do with lspci. For many graphics cards you have two entries (.0 and .1) for both the GPU and the HDMI audio channel. We need them both (they're in the same IOMMU group after all):
$ lspci -n | grep 01:00. 01:00.0 0300: 10de:13c2 (rev a1) 01:00.1 0403: 10de:0fbb (rev a1) $ lspci -n | grep 05:00. 05:00.0 0300: 1002:6778 05:00.1 0403: 1002:aa98
Now we need to add those to the kernel boot parameters to assign them to the "virtual" kernel driver. This will stop them being used by the OS and leave them free to be used by the guests:
Rebuild grub and reboot once more and voila, those video cards will now not be seen by the OS.
So far it's all been pretty much run-of-the-mill. Nothing out of the ordinary. But now it's time to get down and dirty with installing guest operating systems in virtual machines. This is where most of the gathered information from many other sources is important.
First, something that no one told me, and I had to find out for myself:
- Guest operating systems must be EFI, not BIOS
Well, that's not strictly true: there is a way of doing GPU passthrough with BIOS guests, but it's messy and can cause other problems. So just don't. In virt-manager make sure that, when you go to install your fresh OS into a fresh VM you select the "Customize configuration before install" option. Then:
- In "Overview":
- Make suer chipset is Q35
- Make sure firmware is UEFI x86_64
- In storage settings:
- Make sure disks are SATA not VirtIO
- Delete the following hardware:
- Display Spice
- Sound ich9
- Channel qemu-ga
- Channel spice
- Video QXL
You may need to remove some items before it will allow you to remove others.
Now you are ready to add your PCI hardware. Step one is simple enough: select Add Hardware, go to PCI Host Device, and select your GPU (and its associated audio if it has one). You should also consider routing one of your PCIe USB host devices through as well. My system has three, which is convenient: one for the host and one for each of the guests I'll be running at once (note: you can share them between guests as long as they don't run at the same time - I have multiple variants of the same guest with different USB hosts attached).
Now we need more magic. But first we need to do one brief boot of the guest. We don't need to install the OS yet, just boot it. So go ahead and make as if you were going to install the OS. Then once it starts booting you can just forcibly turn it off.
What this does is parse the configuration and sort it out, and allocate internal PCIe bus numbers to the devices you have passed through. Now you can go back into the configuration and we can do some manual tweaking.
- Re-number the incorrectly numbered PCIe devices
By default libvirt will place the VGA card and its associated audio device on two separate devices with subsequent IDs. This is wrong. You need to change it so they are both on the same ID but with different function numbers.
For example my nVidia is (on the host) on 01:00.0 and 01:00.1, but libvirt puts them, for the guest, on 04:00.0 and 05:00.0. So select the audio device in the configuration and edit the XML settings (note: you will have to enable this functionality in virt-manager's preferences). Reduce the bus= entry by 1, so it matches the bus entry of the VGA device, and set function= to 1. Then hit apply.
- For nVidia you need to fool the drivers
If you want to use the nVidia proprietary drivers in Linux you need to fool them into thinking they're running on an nVidia licensed system. nVidia don't like you running on a VM, since they provide a licensed (paid for) "Virtual" GPU system that they would much rather pay them lots of money to use. Fortunately, though, it's easy to tell the drivers that they should work.
- Select the "Overview" section
- Edit the XML
- Change the first line from
<domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
- Scroll to the bottom and find the closing
If there is no
<qemu:commandline> block, then create one. If there is, then just insert the content into it. You should end up with something like:
.... snip ... </devices> <qemu:commandline> <qemu:arg value="-cpu"/> <qemu:arg value="host,hv_time,kvm=off,hv_vendor_id=null"/> </qemu:commandline> </domain>
And hit apply.
Now you can go ahead and actually install your OS. But first you will need to go into Boot Options and set it to boot from your DVD installation media, since libvirt only automatically does that on the first boot.
You may notice we have deleted the sound device. You won't get any sound out of the system. We did this because the normal sound implementation relies on you having the virt-manager window to the guest open. Which is far from ideal. The most ideal solution would be to pass your sound card itself through to the guest and have that handle everything. Unfortunately that's not possible with my card: it doesn't support the "reset" mechanisms required for passthrough. So:
Consider using PulseAudio's networking ability. On the guest, edit /etc/pulse/default.pa and add, at the end:
load-module module-native-protocol-tcp auth-ip-acl=127.0.0.1;192.168.0.0/16 load-module module-zeroconf-discover
You may need to install those modules, and you want to make sure that the IP range suits your network setup.
Similarly, on the host, add:
load-module module-native-protocol-tcp auth-ip-acl=127.0.0.1;192.168.0.0/16 load-module module-zeroconf-publish
Now you should have your host audio device available in the guest.
These get a little more tricky. I plumb mine directly into pulseaudio. On the host, in /etc/pulse/default.pa, add (at the end):
load-module module-native-protocol-unix auth-anonymous=1 socket=/tmp/pulse-socket
This will create an anonymous endpoint to your pulseaudio daemon in /tmp. You can then create an audio device in the configuration for your guest that connects to it. Edit the XML in overview. If you haven't already, change the top line to be:
<domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
And at the bottom, just as for the nVidia tweaks, add to the
<qemu:commandline> block (which you can create if it's not there, directly after
</devices> the following:
<qemu:arg value="-device"/> <qemu:arg value="ich9-intel-hda,bus=pcie.0,addr=0x1b"/> <qemu:arg value="-device"/> <qemu:arg value="hda-micro,audiodev=hda"/> <qemu:arg value="-audiodev"/> <qemu:arg value="pa,id=hda,server=unix:/tmp/pulse-socket"/>
- Set up multiple copies of the same VM with different hardware
arrangements. Share the same HD image(s) between them. Choose which
variant to boot depending on how you want to work. For instance, I have:
- Ubuntu, nVidia GFX, Main Keyboard
- Ubuntu, nVidia GFX, Secondary Keyboard
- Ubuntu, AMD GFX, Main Keyboard
- Ubuntu, AMD GFX, Secondary Keyboard
- Ubuntu, Full GFX, Main Keyboard
- "Barrier" is a FOSS fork of Synergy and is fantastic for sharing your mouse and keyboard across your multiple running VMs, and even your host, for a seamless working environment.
- Set your memory in the VM to be the most you will want, then turn down the current allocation. Most guests will allow you to change the allocated memory on the fly, but only up to the maximum set in the configuration.