Diagnosing the virtual data center cloud pt 1.

This post is the first in an ongoing series on techniques for diagnosing and resolving problems in a data center cloud based on the excellent Xen Cloud Platform package. Many thanks to Dave Scott of Citrix for lending his expertise to smoothing the little bumps in our cloud operation.

In this first example, we had taken the number 1 cloud down during the maintenance window for a major upgrade of the NFS storage servers (all storage for virtuals is provided by NFS servers over a dedicated storage network. The shared storage allows virtuals to be live migrated between hosts in the cloud.). On completion of the work, we restarted the NFS servers and verified they were running properly, then we started restoring the cloud with the pool manager first. The pool manager restarted fine, all the hosts started and joined the pool, everything looked normal until we started restoring virtuals.

Virtuals could be started on hosts 1 and 3, live migrated between 1 and 3, and live migrated from host 2s console, however, doing a live migrate didn’t even show host 2 as an option to migrate to. Uh oh.

The first useful command in this instance is xe diagnostic-vm-status. This command will show whether a particular virtual will run on every host in the cloud and if not, why not.

Second is to check xe host-list params=all for enabled and host-metrics-live both set to true. This verifies that the host is joined to the resource pool properly.

Third is to use xe pbd-list to verify that the storage repository is connected to the host. It was here that the problem came into focus. The NFS storage repository was showing attached to hosts 1 and 2 but not attached to host 3. Using the xe pbd-plug command allowed us to attach the NFS storage to host 3 and it worked perfectly, we could then migrate virtuals to host 3.

Subsequent test reboots of host 3 never produced another failure to attach the NFS storage, so I have no idea why it did it that one time. It’s now a prominent item on the cloud host restart checklist to verify.

And know you know how to get all the information you need to fix a host that XCP won’t allow you to migrate virtuals to!

Vern, SwiftWater Telecom

