How to handle a (temporary or permanent) failure of a master server in Xenserver

http://flickr.com/photos/kevincollins/74279815/
http://flickr.com/photos/kevincollins/74279815/

OK. So you found you logged in, and you see one of the servers down, but don't see the virtual machines that were on that host anywhere in the resource pool. It's as if the entire system just freaked out and the server, and the VMs running on it, disappeared without a trace.

You've verified your shared storage is okay, and thus you're stable enough to begin remediation.

Now you have a problem, but it's easy to fix. Begin by connecting to the remaining server as root with ssh.

The following commands will get you back to a known good state. Make sure your master server is good and dead (unreachable and unlikely to come back without intervention of some sort – in my case it was a kernel panic, and a reboot and fsck -y / fixed the issue, this procedure will work fine in this scenario)

Last login: Fri Mar 18 16:42:03 2011 from (hostname redacted)
Type "xsconsole" for access to the management console.
[root@xenserver2 ~]# xe pool-emergency-transition-to-master
Host agent will restart and transition to master in 10 seconds...
[root@xenserver2 ~]# xe pool-recover-slaves
[root@xenserver2 ~]#

You are now at a point where the cluster is now usefully reporting a master server, and any remaining slaves are now pointing at it.

Now for the finishing touch – marking the powered down VMs as dead, so you can restart them on other servers. Note that if you're not correct on this, really awful inconsistent things will happen. You have been warned.

[root@xenserver2 ~]# xe vm-reset-powerstate --force --multiple
operation failed on ae820fb7-8416-68ee-35c8-c37REDACTEDf18: The operation could not be performed because a domain still exists for the specified VM.
vm: ae820fb7-8416-68ee-35c8-c37REDACTEDdf18 (REDACTED)
domid: 51
operation failed on aa2bc87f-9ef4-dd8e-492e-bREDACTED4: The operation could not be performed because a domain still exists for the specified VM.
vm: aa2bc87f-9ef4-dd8e-492e-b4REDACTE14 (REDACTED)
domid: 54
operation failed on e4f76313-2be1-0d4e-1e78-78e1REDACTED75: The operation could not be performed because a domain still exists for the specified VM.
vm: e4f76313-2be1-0d4e-1e78-78e1REDACTED5 (REDACTED)
domid: 2
operation failed on 557c9030-b995-fbaa-ab0b-61REDACTED4df: The operation could not be performed because a domain still exists for the specified VM.
vm: 557c9030-b995-fbaa-ab0b-61e4abb074df (REDACTED)
domid: 21
[root@xenserver2 ~]#

This marked every unreachable VM as "powered off" and thus it can now be restarted. You'll see it mercifully decided not to hurt my VMs that were confirmed to be operational. You may want to be more cautious and use host-id=your-uuid-here rather than –multiple but that's up to you.

More information available here: http://docs.vmd.citrix.com/XenServer/4.0.1/reference/ch02s06.html

Leave a Reply