Data Center Disaster – Lesson Learned

Well, this post going to have some technical information but it’s more like a document on “what’s happened and what should do”. Our company has multiple “data center design and support” project and three days ago there was disaster in one of them. Customer site consist of:

Last week:

We had a problem in this data center as one of storage controllers restarted on its own and because of this, whole “virtual infrastructure” lost their connectivity with storage.

“VMware Virtual Center” (yes, they are on VI no vSphere) shows some VMs are still power on but they didn’t have any “CPU or Memory” action and usage, while I have browsed datastores for checking for some files, I noticed that there is no folder in any datastore, it was completely empty.

So after searching for this problem and not find something similar, we decided to power of everything and then start them again as we were in the worst situation and it can’t be worse than that (at least we though like that which was wrong); so we connected to “Oracle Databases” and shut them down as well as all other physical servers and as there was no way to interact with “VMs”, we shut down the blade servers physically.

After about two minutes we powered on the storage server (HP EVA) and waited for it to comes up completely, then moved to servers, first DL servers and then BL servers.

After watching the ESX servers boot process, it’s time to power on most critical parts, “Active Directory and DNS” and “Virtual Center”.

I have connected to all ESX servers using KVM to find this two VMs by using “vmware-vim-cmd vmsvc/getallvms” (also “vmware-cmd -l” command can be used as well), after finding them, I’ve used “vmware-vim-cmd vmsvc/power.on VMID#” to start these VMs. After DNS came up correctly and “Virtual Center” as well, we start monitoring problems and powering up all other VMs.

We wrote a letter to our customer and inform them on their “Storage Controller Problem” so they can buy one soon before other problems happening.

These are all happened in last week disaster, now let’s check three days ago disaster which is worse than this ;).

Three days ago, Last Disaster:

I was out of data center for some meeting when one of our support team members called me and informed me about problem on powering on some VMs.

He told me that one VM is down (for some reason) and he wants to power it on but he can’t as he getting “not enough license” Error, I told him it’s rare/impossible in our condition as we have two license more than our servers in case of expansion so try to check it online and try to power off one VM which is for our internal test and try to power that on, but still the same goes.

I asked him to check the VMs health (CPU and Memory usage) as I worried about last week disaster, and YES, it’s the same (at least from storage perspective), no data, VMs, folders, nothing on the datastore on the SAN so he asked if we should go the last action we did and shut everything down, “Sadly, I think yes”.

I called my boss and our customer contact person right away to inform them from what happening, during our conversation with my boss I asked to go to our office for this problem.

Anyway, time passes and I went to data center for fixing the problem, in that time (20:50) it was only me and my friend who called me in first place, after checking the state we start shutting down everything, started from blade servers, Oracle DB, DL servers and finally storage itself.

It was easy as we did it just last week, so we powered on storage server and the moved to powering on ESX servers, after a while all of them up and running fine; I connected to them to power on those two important VMs (AD/DNS and Virtual Center) so we can start our work faster and easier, but … I get following error which cost us 2~3 hours of extra works.

Sorry for the image quality, it taken by my iPhone from KVM … 😉

Powering on VM:
(vmodl.fault.NotEnoughLicenses) {
dynamicType = ,
msg = “There are not enough licenses installed to perform the operation.”
}

I have searched for this error and come across these KB article on VMware site:

KB 1005265KB 1003623 and KB 1005153

All of them pointing to licensing issues so I tried to connect to each host using “VMware Infrastructure Client (VI Client)” from NOC for checking the state but there was no way to connect these hosts; I tried to ping them but again, no luck, no SSH or … so I went back to data center itself for checking if the hosts can see each other and the gateway as well, everything works fine inside all hosts!

Guess what? the firewalls blocked all connections from different zones to each other because of DNS absence, so we need DNS in order of making connection in the first place and because of this, I have installed DNS role on one of our spare server and added some records to it, which we needed most like “ESX hosts, Virtual Center, Firewalls, …”.

After adding this new DNS server to the firewalls configurations, all connections are back to the way they should be but we still have problem with licenses and “Virtual Center” itself installed in the VM in this environment.

You can see the fear of “virtualizing everything” hear, but most of times (if not saying always) there is a way to solve problems like this, like now…! 😉

I have installed “License Server for ESX 3.5” on the physical server which was one of our test servers, after the installation completed, I added license file which we had for this environment and changed the “Virtual Center” record in “New DNS” so it point “vc.domain.com” to the IP address of new license server, I did it to make hosts looking to this server as their license server.

I rebooted first host and waited a little, it came up successfully with enough license and hopefully one of those two important VMs registered with this hosts, AD/DNS.

I powered on this VM and waited for second host, it came up successfully as well but without license file, this host has “Virtual Center” so I removed this VM from its inventory and register it on the healthy host ;), Powered on “Virtual Center” and waited sometimes so all its services could be up and running, then used RDP to connect to this VM.

I opened services console for checking the VMware services and shocking view was there, no service started automatically, I start them one by one and they did what they should except one … License Service …!

I tried to start this service multiple times but no luck, so I finally backed up my license file and uninstalled this part (Yes, license server can be installing and uninstall separately), rebooted the VM and connect to it after it’s came up completely, now it’s time for installing license server again, this time on its original place, so I installed new license server which is about 9.0 MB and added license file.

Now let’s check what would be happen if the server rebooted and seeing healthy “Virtual Center” and “License Server”, I rebooted all hosts except that running these VMs.

All of them came up correctly with proper license and we were able to power on all VMs.

This time we had problem same as before (Storage Controller Failure) but this time it was followed by another problem which make it different and even worse and that was the fact that the “License Service” wasn’t start from previous time and it made a lot of problem this time.

Lesson Learned:

  1. It’s good and wise to have one DNS server outside of “Virtual Environment”, especially when there are firewalls on network which are using DNS records for their policies.
  2. After any failure or disaster, always, always check the critical services, even if you are sure it will start automatically.
  3. It’s good to have one license server up or at least ready outside of “Virtual Environment”, if you are using “VMware Virtual Infrastructure”.

It's your kindness to leave a reply/feedback