I’ve yet to post details on my setup to the site as planned, but regardless, the site is half powered by some virtual machines running under Proxmox and half powered by a pair of Dell PowerEdge 1850’s. This post will detail some of that setup, and some of the problems I’m currently experiencing.
I currently have Promox running on an 8-core Opteron 6212 @ 2.6Ghz with 32GB of ram, and 8 2TB hard drives. This machine will eventually serve as my main file server/backup machine, but currently serves as a mirror for my old file server in the event something happens to the old one, and as my VM server to handle various services for my internal and external network.
This Proxmox machine runs a DNS OpenVZ serving primary DNS for my various domains, and also lookups and reverse DNS for my LAN clients/IP’s. It also runs multiple KVM based VM’s to handle the main content serving of this site. A Zen Load Balancer KVM serves to load balance the web site across the two 1850’s, as well as a backup node which is also a KVM running on the Proxmox machine in the event I want to serve everything from a single machine, or a failure of some sort on both the 1850’s. Also running as KVM’s are a database machine, and an NFS machine to serve as a shared docroot for each of the web nodes.
This should provide some understanding of how very important this Proxmox machine is to my network, and why it’s such a huge risk to have a single point of failure. If the Proxmox machine goes down, so does the websites it powers, DNS for internal/external clients, my domain emails, and more.
Which brings me to the current issues…
This Proxmox install was setup back in March, and has been stable as a rock up until recently. Recently as in when I moved services back to my local residence from the data center it was once housed in. It seems no matter how much testing you do before hand, it’s not until things go live and you cut off ties to your previous service than things start to go wrong. Every few days it seems now I wake up, or glance over to my laptop, only to find that the Proxmox machine has kernel panicked. Imagine how wonderful it must feel to know that everything has possibly been down for 1, 2, or maybe even 3 hours before it’s noticed. Although it doesn’t negate the fact it’s happening, monitoring and notifications is something I am in the process of setting up.
Monday, however, things got interesting. Shortly after midnight I found the machine in the kernel panic state. Obviously, once rebooted everything was fine. Where it got interesting, however, is that 12 hours later almost to the minute at around 12:20pm it did it again. Now, I have a sync job setup for twice a day between file servers. Can you guess the times? Yep, midnight and noon. Thinking this was the cause, the only logical conclusion was to wait for midnight so as to expect another kernel panic and hopefully catch it in the act to determine a root cause.
Midnight came. Midnight went. Kernel panic…never came. Very suspicious indeed. The machine currently has a 1 day 20 hour uptime, and I am no closer to figuring out what is going on. The uptime prior to Monday’s issues – just over 12 days, and I suspect that will be the best I will get out of this machine for the time being. I will just have to continue to keep my eye on it. At this point I’m suspecting memory, but I cannot very well take down the machine for hours in order to run a memtest.
For now, it’s time to write a monitoring script to ping my phone if the machine crashes…