It was 3 in the morning. I just got one of the pages (well, now text messages) that you dread. Site’s down. Shit. This is the 4th unexpected “Availability Event” in the past 6 months. What is it now?

I fire up the grid control panel on my laptop. Well, try to. The browser is just spinning. Grrr. Let’s see if I can SSH into the grid controller. Nope, just spinning there too. What the heck man? All the websites and applications are down. You’ve got to be kidding me!

I fire off a quick helpdesk ticket and then call the emergency line for our guys.  Leave a voicemail.  Get a callback in 5 minutes from them.  Hmm, they are awake already?

John Doe:  “Mike, just calling you back.”
Me: “What’s the story?”
John Doe: “Yeah, well you see, the Data Center is doing network maintenance.  And we just got the email 20 minutes ago, ourselves.”
Me: “?!??!?!!”
John Doe: “Yeah, that’s what we said.  There is nothing we can do but wait it out.”

You know, this one wasn’t their fault.  But it was the last straw.  It is time to cut our losses and run.  RUN!

In the past 12 months, for just one particular app, we had more downtime then in the entire 7 years before.  During one “Availability Event,” it was recommended by our grid support team that we shut down the application for 48 hours to rebuild a corrupt filesystem.  Yeah, all sorts of “are you fucking kidding me” statements come to mind.  Needless to say, I didn’t put up with that and found another way around it.  However, there was still 3 hours of downtime.  Like what the fuck, is this a dream?  I can’t believe how bad things got.

If I or my company was this bad at doing something, I would close the fucking doors and take up selling tomatoes by the highway.

Don’t get me wrong.  The crew over there is some of the smartest people I have ever (not in person) met.  I drank the koolaid for quite some time.  It was probably why we stayed with them for as long as we did.  We even setup a consulting/implementation practice around the tech.  We were working on partner status with them.  They had the makings to rule the “cloud infrastructure” world.  Truly a unique way to think about operating systems and infrastructure.  But something was missing.  Maybe they grew too fast?  Alas.  I do hope they pull through in the future.  We just won’t be along for the ride.

We have three major applications, DNS, Email, our websites, and our client websites all hosted here.  23 different “cloud applications.”  It’s a relatively small “grid” with 9 quadcore, dual processor servers.  72 GB of ram.  18 TB of storage.

And now I have to migrate all of that away.

I’m about 3 weeks into it.  That’s the real reason for this post.  All of our apps (except for one) are moving to Joyent.  These guys have a great offering, especially when compared with Amazon and Google Apps.  For the first time in almost a decade, I am in a strange new world.

OpenSolaris.

And I love it.

I’ve decided to post about my experiences in moving from my home of 12+ years in Linux to OpenSolaris.  It’s a slow process and there are a lot of places to trip up on.  I hope I can document them here to help others.  I’m sure OpenSolaris will continue to gain in popularity as people discover how smartly it does things.

Stay tuned…  Coming up, I have some SMF scripts for Tomcat, SOLR, and Red5 to post.

About the Author:

Learned something? Great! Need help on your development project? I'm available for hire:

  • Ruby on Rails
  • iOS Development
  • System Architecture & Performance

Get in touch:

Discussion

  1. mike says:

    Not only did the datacenter not inform all of their clients, what datacenter in this day and age is not N+1 !!??

    Like if you have to do network maintenance, no big deal right? Just switch everyone over to the other network. You have two connections to the backbone, right? RIGHT??

Leave a Comment