I went to the colo this morning and cleaned the remaining file systems. I found that they'd never been cleaned since I built the system because I hadn't configured them to clean themselves regularly. I've now set all volumes to to a filesystem check every 3 months. I replaced one of the old disks as well. I also paid for hosting up to the end of 2007.
Last night I enabled a set of backup DNS servers and a set of backup mail servers in Reno. Now if the server fails, we still have DNS resolution and we control the queuing of the mail instead of depending on the sender.
At this point I feel confident that the recent problems of the last week (other than the power failure) have been solved and we're back at the level of stability we had before (good, but not perfect).
I'm planning to migrate email to google this week but with time running short and me being out of town this coming weekend it may not happen.
Wednesday, September 5, 2007
I went down and tried to figure out the problem. I had little luck determining what was wrong but I did get the problem to manifest again. I then succeeded in cleaning the filesystem on the main volume on the server which had some corruption on it. Since then I haven't seen the problem so hopefully that fixed it. I'm going to go back to the factory tomorrow morning and attempt to clean the remaining filesystems and replace some disks which are really old.
The server was power cycled this morning at 7:05am and came back up. All email that senders were attempting to send to our users during that time period will be queued up on the sending side and senders will continue to attempt to send the mail. Typically senders attempt this for up to 72 hours before giving up.
Subscribe to:
Posts (Atom)