f

September 16, 2017

QEMU Virtual Machine PCIe Device Passthrough Using vfio-pci


When running a QEMU virtual machine (VM) on a Linux based operating system, we have the ability to dedicate PCIe devices to VMs using the vfio-pci kernel module so that the VM can control them directly. You may want to do this for many reasons, a USB controller so anything plugged into that controller's USB ports are connected to the VM, a SATA controller for booting or controlling drives directly, a network card, sounds card, or even a graphics card for high performance graphics acceleration inside the VM. This guide will show how this can be done with out blacklisting the kernel module driver of the device you want to pass through. For example, if you have multiple devices that use the same kernel module, blacklisting it would prevent all of the devices in the system from using the kernel module. With this method you can have multiple devices in the same system which use the same kernel module, you could even have two identical devices (like two of the same GPU) and keep one for the host and pass one into the VM.
The system I will be using for this tutorial is a Lenovo Thinkpad T420s running Debian 9.1 and Linux kernel 4.12. This tutorial applies to all machines even ones using different hardware and operating systems. I currently use my desktop with an AMD FX 8350 CPU and GIGABYTE 990FXA-UD3 motherboard (also running Debian 9.1 and kernel 4.12) to run a VM with a GTX 1070 and USB controller passed through for mining Ethereum, occasionally playing games, and running the Adobe CC software suite. I have used this same method on Arch-Linux, OpenSUSE, and Ubuntu, all with success, although I do not recommend using any Linux kernel older than 4.4.


Ensuring Your Hardware is Capable of Intel VT-d or AMD-vi
First step of this process is to make sure that your hardware is even capable of this type of virtualization. You need to have a motherboard, CPU, and BIOS that has an IOMMU controller and supports Intel-VT-x and Intel-VT-d or AMD-v and AMD-vi. Some motherboards use different terminology for these, for example my 990FXA-UD3 lists AMD-v as "SVM" and AMD-vi as "IOMMU controller". Some systems may have these features, but sometimes can be implemented poorly or in a way that makes it impossible to use certain devices with vfio-pci. For example, many newer AMD x370 chipset motherboards have many PCIe devices in one IOMMU group which will prevent you from using any of them individually with vfio-pci (more on this later). Once you have confirmed that your hardware is capable and have enabled these options in your UEFI/BIOS, you can proceed to the next step, preparing your system.


Preparing Your System
Next you will have to pass the correct kernel parameters to the Linux kernel to enable IOMMU.
Open the following text file in your text editor of choice:
/etc/default/grub

You should see a line with the following text:
/etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet"

Append the following options inside the quotations separated by spaces. "intel_iommu=on" or "amd_iommu=on" depending if you have an Intel or AMD CPU. AMD users may also want to try "iommu=pt iommu=1" instead of "amd_iommu=on" if you notice increased boot times with "amd_iommu=on." Both ways will work although I have not tested to see if there is performance differences between the two.

This line should now look something like this:
/etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"

Save the file and then run the following command:
update-grub

Once that has completed, reboot the system.

It is good to note that are also more advanced kernel parameters pertaining to IOMMU that may be useful to you, but for now the two previous ones should be enough.
See https://www.kernel.org/doc/html/v4.12/admin-guide/kernel-parameters.html for a list of more kernel parameters.

After enabling IOMMU you can now check the IOMMU device map to ensure that there is proper seperation between the devices you wish to pass into your VM and the ones you wish to remain attached to the host. There are a few ways to do this, the simplest is to run the following command:
ls /sys/kernel/iommu_groups/*/devices

The output should look something like this:
...
/sys/kernel/iommu_groups/11/devices:
0000:00:1f.0  0000:00:1f.2  0000:00:1f.3
...
/sys/kernel/iommu_groups/14/devices:
0000:0d:00.0
...

You can see that group 11 has 3 devices in it which means they are not isolated from each other. If you wanted to pass one of these devices through to our VM, you could not just pick one, you have to pass through every device inside the IOMMU group.

To see what devices these PCIe IDs correspond to, you can run another command:
lspci

The output should look something like this:
...

00:1f.0 ISA bridge: Intel Corporation QM67 Express Chipset Family LPC Controller (rev 04)

00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller (rev 04)

00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 04)

...

0d:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 04)


Thankfully the 3 following devices that are all in IOMMU group 11 do not need to be used by the guest:
...

00:1f.0 ISA bridge: Intel Corporation QM67 Express Chipset Family LPC Controller (rev 04)

00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller (rev 04)

00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 04)

...

You need to run the lspci command again but this time you will need more information, specifically the device IDs of the device you wish to pass through. Run the following commands to create a text file and then print the output of the lspci command to it, you will need it later. Replace "/home/user/lspci-output.txt" with the path to the file you wish to save the output.
touch /home/user/lspci-output.txt


lspci -vnn > /home/user/lspci-output.txt

Then open the file with your text editor of choice or run the following command to print it to your terminal. Again replacing "/home/user/lspci-output.txt" with the path to the file you saved the output in.
cat /home/user/lspci-output.txt

The output should look like this, Note the device IDs of the devices you wish to pass through, they are the two four character strings separated by a colon, I have highlighted it for you in this example:
/home/user/lspci-output.txt
...

0d:00.0 0c03: 1033:0194 (rev 04) (prog-if 30 [XHCI])

 Subsystem: 17aa:21d2

 Flags: bus master, fast devsel, latency 0, IRQ 16

 Memory at f0c00000 (64-bit, non-prefetchable) [size=8K]

 Capabilities: <access denied>

 Kernel driver in use: xhci_hcd

 Kernel modules: xhci_pci


Create and open the following file in your text editor of choice:
/etc/modprobe.d/vfio.conf


Enter the following text adding the device IDs of the devices you wish to pass through to your VM separated by commas.
/etc/modprobe.d/vfio.conf
options vfio-pci ids=1033:0194

Some devices (such as many graphics cards) need to be bound to the vfio-pci driver early in the boot process. You could do this by just blacklisting the driver for that device, but that is a poor and unnecessary solution as it prevents other devices from using the same driver. A better way is to create a module alias for the device you wish to pass through. To do this you will need to run the following command in order to obtain the modalias ID for your device. Replace "0000:0d:00.0" with the PCI ID of your device.
 cat /sys/bus/pci/devices/0000:0d:00.0/modalias

The output should look like this:
pci:v00001033d00000194sv000017AAsd000021D2bc0Csc03i30

Add the output of the previous command to "/etc/modprobe.d/vfio.conf" while adding "alias" at the beginning of the line and "vfio-pci" at the end.
/etc/modprobe.d/vfio.conf
alias pci:v00001033d00000194sv000017AAsd000021D2bc0Csc03i30 vfio-pci
options vfio-pci ids=1033:0194

After making changes to the "/etc/modprobe.d/cfio.conf" file, you will need to update your initramfs for any changes to take effect. Do this by running the following command, it will create a new initramfs for each Linux kernel on the system:
update-initramfs -u -k all

After the command has finished, reboot your system before continuing to the next step.

Before you can continue to writing the script that will run your VM, you should ensure that your devices have been configured correctly. Do this by attempting to unbind your device/s from the kernel module they were using and bind them to the vfio-pci kernel module. To do this, run the following three commands. Replace "0000:0d:00.0" with the PCI ID of your device and "xhci_hcd" with the kernel driver in use of your device from the file you created earlier with the output of the lscpi -vnn command.
modprobe vfio-pci

echo '0000:0d:00.0' > /sys/bus/pci/drivers/xhci_hcd/unbind

echo '0000:0d:00.0' > /sys/bus/pci/drivers/vfio-pci/bind

After running those three commands, run the lspci -vnn command again to see if the kernel driver in use has changed to vfio-pci.
lspci -vnn

The kernel driver in use should now read vfio-pci, if it does then you can continue to the next step, writing the script. If it does not, that means one of the previous steps was forgotten or was done incorrectly, go back now and check everything.
...

0d:00.0 0c03: 1033:0194 (rev 04) (prog-if 30 [XHCI])

 Subsystem: 17aa:21d2

 Flags: bus master, fast devsel, latency 0, IRQ 16

 Memory at f0c00000 (64-bit, non-prefetchable) [size=8K]

 Capabilities: 

 Kernel driver in use: vfio-pci

 Kernel modules: xhci_pci

Writing The Script
Now you are ready to start creating the script to start the VM. Create a file somewhere on the system that you wish to use as the start script for your VM and then run the following command to make it executable. Replace /home/user/startvm.sh with the path to your start script.
chmod +x /home/user/startvm.sh

Below is an example script that I have made, copy it and paste it into your VM start script. You will need to make modifications to this script in order for it to work on your particular configuration. Start by changing and/or adding the PCI and device IDs to match the devices you wish to attach to your VM. You may also need to change things like, networking configuration, VM display configuration, and boot drive paths. Read through the comments I have made in the script, they explain what each part of the script does.
/home/user/startvm.sh
#!/bin/bash

#-------------------------------------------------------------------------

# set a counter to increment each time the loop runs
try_counter=1


# load required kernel modules and get required devices ready for the VM
function initvm {

# load vfio-pci kernel module
modprobe vfio-pci

# unbind 0000:0d:00.0 from xhci_hcd kernel module
echo '0000:0d:00.0' > /sys/bus/pci/drivers/xhci_hcd/unbind

# bind 0000:0d:00.0 to vfio-pci kernel module
echo '0000:0d:00.0' > /sys/bus/pci/drivers/vfio-pci/bind

}

#-------------------------------------------------------------------------

# run the qemu VM
function runvm {

qemu-system-x86_64 \
    -enable-kvm \
    -cpu host \
    -smp sockets=1,cores=2,threads=1 \
    -m 4G \
    -rtc base=localtime,clock=host \
    -device vfio-pci,host=0d:00.0 \
    -drive file=/dev/sda,format=raw \
    -net nic,model=e1000 \
    -net user \
    -usbdevice tablet \
    -vga qxl

# Use the following options for PCIe GPU pass through
#   -device vfio-pci,host=00:00.0,multifunction=on,x-vga=on \
#   -display none \
#   -vga none \

# Use the following options for networking using a TAP interface
#   -net tap,ifname=tap0,script=no,downscript=no \

# The following will use the host's first physical drive as boot device
#   -drive file=/dev/sda,format=raw \

# Use the following to improve Windows guest performance 
#   -cpu host,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff \

# For VMs with an Nvidia GPU attached, you must add the following
# options to bypass the Nvidia driver's virtualization check.
#   -cpu kvm=off,hv_vendor_id=null \
 
# For more information about QEMU and it's configuration options, please 
# visit the following web address: https://www.qemu.org/documentation/ 

}

#-------------------------------------------------------------------------

# make devices usable again to the host
function unloadvm {

# unbind 0000:0d:00.0 from vfio-pci kernel module
echo '0000:0d:00.0' > /sys/bus/pci/drivers/vfio-pci/unbind

# bind 0000:0d:00.0 to xhci_hcd kernel module
echo '0000:0d:00.0' > /sys/bus/pci/drivers/xhci_hcd/bind

}

#-------------------------------------------------------------------------

# main loop

# The main loop will attempt to start the VM up to 4 times if it exits
# with exit status 1. If the VM fails to start 4 times in a row it will
# exit the script with exit status 1.
# This is useful for vm scripts that someitmes fail on the first attempt
# but work on the second or third attempt.

while true; do
    initvm
    runvm
    status=$?
    if [ "$status" -eq "1" ]
    then
        if [ "$try_counter" = "4" ]
        then
            echo "Failed to start VM 4 times in a row, exiting..."
            unloadvm
            exit 1
        else
            echo "VM failed to start, restarting..."
            let try_counter++
            echo "Attempt number $try_counter"
            continue
        fi
    else
        echo "done"
        unloadvm
        break
    fi
done

exit 0

Please refer to the following link for QEMU documentation, it will be very useful when writing your VM script.
QEMU Documentation

You should now be ready to try running your script. Don't be frightened by any errors, they are your friend, pay attention to what they say as they usually tell you exactly what is broken or where to look to find out. A common mistake when adding QEMU arguments across many lines in a script is leaving a space or other character after one of the "\" characters at the end of each line. If there is a trailing space or character it will cause QEMU to ignore any arguments on following lines. I would love to hear about your success, or even your failures. Please feel free to share them with me and others in the comments, or by tweeting me.

Comments are encouraged, but before commenting, please read 'readme.txt' first.


Home readme.txt The Archive