Virtual Machines High Availability 4.4
OpenNebula delivers the availability required by most applications running in virtual machines. This guide's objective is to provide information in order to prepare for failures in the virtual machines or physical nodes, and recover from them. These failures are categorized depending on whether they come from the physical infrastructure (Host failures) or from the virtualized infrastructure (VM crashes). In both scenarios, OpenNebula provides a cost-effective failover solution to minimize downtime from server and OS failures.
If you are interested in setting up a high available cluster for OpenNebula, check the High OpenNebula Availability Guide.
When OpenNebula detects that a host is down, a hook can be triggered to deal with the situation. OpenNebula comes with a script out-of-the-box that can act as a hook to be triggered when a host enters the ERROR state. This can very useful to limit the downtime of a service due to a hardware failure, since it can redeploy the VMs on another host.
Let's see how to configure /etc/one/oned.conf
to set up this Host hook, to be triggered in the ERROR state. The following should be uncommented in the mentioned configuration file:
<xterm> #——————————————————————————- HOST_HOOK = [
name = "error", on = "ERROR", command = "host_error.rb", arguments = "$HID -r n", remote = no ]
#——————————————————————————- </xterm>
We are defining a host hook, named “error”, that will execute the script 'host_error.rb' locally with the following arguments:
Argument | Description |
---|---|
Host ID | ID of the host containing the VMs to treat. It is compulsory and better left to $HID, that will be automatically filled by OpenNebula with the Host ID of the host that went down. |
Action | This defined the action to be performed upon the VMs that were running in the host that went down. This can be -r (recreate) or -d (delete). |
DoSuspended | This argument tells the hook to perform Action to suspended VMs belonging to the host that went down (y), or not to perform Action to them (n) . |
More information on hooks here.
Additionally, there is a corner case that in critical production environments should be taken into account. OpenNebula also has become tolerant to network errors (up to a limit). This means that a spurious network error won't trigger the hook. But if this network error stretches in time, the hook may be triggered and the VMs deleted and recreated. When (and if) the network comes back, there will be a potential clash between the old and the reincarnated VMs. In order to prevent this, a script can be placed in the cron of every host, that will detect the network error and shutdown the host completely (or delete the VMs).
The Virtual Machine lifecycle management can fail in several points. The following two cases should cover them:
/etc/one/oned.conf
(and restarting oned
):<xterm> #——————————————————————————- VM_HOOK = [
name = "on_failure_recreate", on = "FAILURE", command = "onevm delete --recreate", arguments = "$VMID" ]
#——————————————————————————- </xterm>