Wednesday, September 5, 2007

I went to the colo this morning and cleaned the remaining file systems. I found that they'd never been cleaned since I built the system because I hadn't configured them to clean themselves regularly. I've now set all volumes to to a filesystem check every 3 months. I replaced one of the old disks as well. I also paid for hosting up to the end of 2007.

Last night I enabled a set of backup DNS servers and a set of backup mail servers in Reno. Now if the server fails, we still have DNS resolution and we control the queuing of the mail instead of depending on the sender.

At this point I feel confident that the recent problems of the last week (other than the power failure) have been solved and we're back at the level of stability we had before (good, but not perfect).

I'm planning to migrate email to google this week but with time running short and me being out of town this coming weekend it may not happen.
I went down and tried to figure out the problem. I had little luck determining what was wrong but I did get the problem to manifest again. I then succeeded in cleaning the filesystem on the main volume on the server which had some corruption on it. Since then I haven't seen the problem so hopefully that fixed it. I'm going to go back to the factory tomorrow morning and attempt to clean the remaining filesystems and replace some disks which are really old.
The server was power cycled this morning at 7:05am and came back up. All email that senders were attempting to send to our users during that time period will be queued up on the sending side and senders will continue to attempt to send the mail. Typically senders attempt this for up to 72 hours before giving up.
Mail services started acting funny on Cementhorizon. I went on and couldn't figure it out so I attempted to reboot the server. The machine never came back up. There was nobody with physical access to the server over the weekend which was the reason it was down the entire time.
PG&E and AT&T finished getting everything fixed and our server is back online
A semi truck drove into the power substation that serves the building our server is hosted in. This caused a loss of power to the building. It also caused electrical damage to the telecom equipment which serves the building. All sites are down.

Tuesday, September 4, 2007

The same mail problem occured again. It appears that the cause of both problems is some hardware or disk issue. I'm going to be riding down to San Leandro to the server at around noon to try to figure out the problem.