Blog Article:

Bringing best of NFS and LVM together in Opennebula

Maxence Dunnewind

Feb 12, 2013

I am an engineer working for the european project BonFIRE, which :

Give researchers access to large-scale virtualised compute, storage and networking resources with the necessary control and monitoring services for detailed experimentation of their systems and applications.

In more technical words, it provides access to a set of testbeds behind a common API as well as a set of tools to monitor cloud infrastructure at both VM and Hypervisor levels, enabling experimenters to diagnose cross-experiment effects on the infrastructure.

Some of these testbeds are running OpenNebula ( 3.6 with some patches for BonFIRE ). As each testbeds has it’s own “administration domain”, the setup are different between sites. The way we use OpenNebula is:

  • we have some default images, used for almost all VMs
  • defaults images are not updated often
  • users can save their own images, but it does not happen often

The update of the BonFIRE software stack to last version was accompanied at our laboratory by an upgrade of our testbed. After some study / calculations, we bought this hardware:

  • 1 server with 2+8 disks (RAID1 for system, RAID 5 on 8 SAS 10k 600G disks), 6 cores and 48GB of RAM, 2 cards of 4 Gbps ports
  • 4 servers with 2 drives, 2 * 6 cores (total of 24 threads), 64GB of ram, 2 Gbps ports

We previously had 8 small worker nodes (4G of RAM) that was configured to use a LVM with a cache feature (snapshot based) to improve OpenNebula performances.

With this new hardware, the snapshot feature can’t be used anymore, as it has disastrous performance when you have more than a few snapshots on same source LV.

On the other side, everyone reading this blog probably knows that having some dozens of VMs volumes on shared (NFS) storage require a really strong backend with really good performances to achieve acceptable performances.

The idea of our setup is to bring NFS and local storage to work together, providing :

  • better management of copy of images through network (ssh has a huge performance impact)
  • good VM performance as their image is copied from NFS to local before being used.

Just before explaining the whole setup, let’s show some raw performances data (yes, I know, this is not a really relevant benchmark, It’s just to give some idea of hardware capacity).

  • Server has write performance (dd if=/dev/zero of=/dev/vg/somelv conv=fdatasync) > 700MB/s and read > 1GB/s
  • Workers have 2 disks added as 2 PVs, and LVs are stripped, performances are > 300MB/s for sync write, ~ 400MB/s in read.

As we use virtualisation a lot to host services, ONE frontend is installed in a virtual machine itself. Our previous install was on CentOS 5, but to get more recent kernel/virtio drivers I installed it on a CentOS 6.

To reduce network connections and improve disk perfs, the VM is hosted on the disk server (this is almost the only VM).

The default setup (ssh + lvm) didn’t have good performances, mostly due to the cost of ssh encryption. So I then switched to netcat, that was much better (almost at max Gb link speed) but has at least 2 drawbacks :

  • Doesn’t efficiently manage cache (no cache on client, only FS cache on server, so almost no cache between 2 copies)
  • Needs to setup a netcat to listen on worker for each copy

So I finally setup NFS. To avoid extra I/O between server/network and VM, I put the NFS server on the hypervisor itself, and mounted it on OpenNebula frontend. Advantage of NFS is that it handles cache on both client and server pretty well (for static content at least). That way, we have a good solution to:

  • Copy images when they are created on frontend (they are just copied on the NFS mounted (synchronously) from hypervisor)
  • Copy images from NFS to workers (a dd from NFS mount to local LV) are good, and may benefit of client cache when the same image is copied many times (remember, we use mostly the same set of 5/6 source images)

So, let’s assume we have a good copy solution, one bottleneck is the network. To avoid/limit this, I aggregate Gb links between NFS server and our switch (and between each worker node and our switch) to have a 4Gbps capacity between NFS server and switch. Moreover, as the load-balancing algorithm used to decide which link is used for a given transaction (a same transaction can’t go through more than 1 link), I computed IP addresses of each worker node so that each one use a given, unique link (it does not mean this given link won’t be used for other things / for other workers, but it ensure that when you try to copy on the 4 nodes at same time, network is optimally used, 1Gb link per transaction).

Also, in order to reduce the useless network transaction, I updated the TM drivers to :

  • handle copy from nfs to lv
  • don’t use ssh/scp when possible, just do a cp on the NFS (for example for saving vm, generating context, etc …)

We first wish to put the NFS read-only on workers, but it requires to do a scp when saving a VM, which isn’t optimal. So a few things are still written from worker to NFS:

  • deployment file (as is is sent over ssh on a ‘cat > file’)
  • saved images

This way, we :

  • have a efficient copy of images to workers (no ssh tunneling)
  • may have significant improve thanks to NFS cache
  • don’t suffer of concurrent write access to NFS because VMs are booted on a local copy

Some quick benchmark  to finish this post :

  • From first submission (no cache, nfs share just mounted) to 100VM running (ssh-able) : < 4min
  • From first submission (no cache, nfs share just mounted) to 200VM running (ssh-able) : > 8min

In the simultaenous deployment of large number of VMs, OpenNebula reveals some bottlenecks as monitoring of already-running vms slow down the deployment of new ones. In more details, when deploying a large number of vms, it might happen that the monitorization threads interfere with the deployment of new vms. This is because OpenNebula enforces a single vmm task per host simultaneously because in general hypervisors don’t support multiple concurrent operations robustly.

In our particular 200 vms deployment we noticed this effect, where the deployment of new vms was slowed down because of the monitorization of already running vms. We did the test with the default values for monitoring interval, but,to mitigate this issue, OpenNebula offers the possibility to adjust the  monitorization and scheduling times as well as tuning the number of vms sent per cycle.

Additionally there’s a current effort to optimize the monitorization strategy to remove this effect, to: (i) move the VM monitoring to the Information system and so prevent the overlap with VM control operations and (ii) obtain the information of all running VMs in a single operations as currently implemented when using Ganglia (see tracker 1739 for more details).

5 Comments

  1. PP

    Is there any chance that this deployment will be part of opennebula any time soon? Can you share the scripts required for this deployment type?

    Thanks in advance!

    Reply
  2. Maxence

    Hi, sorry about the late reply. I plan to share this scripts in the next days/weeks , just need to cleanup.
    I’ll let a comment here when the scripts will be released.

    Maxence

    Reply
  3. Ari Biel

    Hello Maxence,

    I am a teacher at a vocational school in Germany and planning a Cloud
    Infrastructure for a students networking lab.

    From what I have read in the documentations up to now I understand that using
    the Opennebula lvm drivers means, that the worker nodes either need a SAN or
    clvm to mount the Logical Volumes from.

    If I have understood your article well, your setup is a shared datastore via
    NFS, NFS – Server on a central File Server, the datastore being nfs – mounted
    on the nodes.
    If a non persistent VM is deployed on a node, the disk image the VM will be
    running from is copied from the nfs – mounted datastore on the node NOT to a
    file image in the system datastore but to a LOCAL (not SAN/clvm?) LVM volume on
    the node belonging to the system datastore thus improving the performance of
    the VM compared to a file based disk image.

    Does my understanding of your article describe your setup?

    If it does will you be so kind as to send me your ds_mad and tm_mad drivers?

    The week after next I will start to setup our lab environment and I would
    prefer to use images on LVM volumes for the VMs because of performance
    (my intended setup for the network lab is described as following: the VMs
    export theire Spice-Screens to a node local Spice-Client running on the
    hypervisor being the user interface to a VM for a student sitting in front of
    the worker node. So we are able to provision a private VM for every student in
    different classes and the performance of the VM should not differ considerably
    from a bare metall Installation).

    So not to double the effort for setting up the Lab environment (first with file
    image based shared datastore and then a second try with local lvm based shared
    datastore) it would be very kind of you sharing your driver scripts.

    Kind regards Ari

    Reply
  4. alfiechan

    hi
    Maxence
    does this script compatible with one 4.4?
    any possible to give us an deployment example?

    Reply

Submit a Comment

Your email address will not be published. Required fields are marked *