Home › System Admin
We are spinning up a new startup. It’s about time to go live. For this startup, we are using Rackspace and their cloud offerings. We do quite a bit of business with them (up to $1K per month) and it was just easy here.
I’m prepping for an internal final review of the first MVP (Minimum Viable Product). Among many tasks, I decide to prune all of our test data and then empty a few history tables we don’t need data for anymore. In the middle of a stack of “delete from X” queries, the database goes offline. Connection lost and I’m unable to reconnect via command line?!
Okay. Let’s log in to the control panel and see what it says:
Unexpected. Okay, let’s dial up Fanatical Support and see what is up…
It could be rude to post the transcript. To cut a 30 minute chat short, the good word from Rackspace is that the database became blocked because “you are almost out of disk space.”
Notice the screenshot above, it shows 1.9 GB used of 2 GB. Now, a cloud database that has a critical failure mode when disk space is close to, but not completely used up, is un-freakin-believable. Even if disk space was completely used up, it’s an arbitrary limit that was created for pricing. The database shouldn’t go down if we hit it. It’s especially ridiculous when the failure happened as I was deleting records (not inserting!) to clear up disk space. (Likely because MySQL needed to use some additional space for referential integrity checks or something) Finally, this requires manual intervention from support personnel to remedy the situation. My two options given were to increase the disk space of the cloud database, or give support the queries I need to run to finish clearing up disk space. I’d be happy to resize disk space, if it was temporary. However, insanely, you cannot reduce disk space of a cloud server after increasing it.
Even though we are in testing, think forward to golive and beyond. The database disk usage goes to within 5% of the limit and will then go dark. Complete and critical failure of the app. No warning.
Would you trust your 3 AMs to this? Would you trust your business to this? I’m appalled that Rackspace would bring this to market.
We have another startup on Heroku. What happens when you hit a pricing limit with Heroku databases? You get emailed for up to two weeks before they start denying inserts. Even then, the database is still responding to read requests.
Credit where credit is due. The support rep that was talking with me for 30 minutes was great. Even emailed me afterwards when I callously dismissed his last attempt at help. I don’t get that from Heroku.
Damnit, this one catches me every time. I setup new cloudserver instances so infrequently that it never hits muscle memory. Open up that damn firewall port 80!
Add the following line
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT
Save and restart iptables.
service iptables restart
I’m not sure if it is CentOS or Rackspace that is locking down the instances. I do know that I don’t have to do this for my local CentOS machines.
Setting up Solr was a bit different then the others. Solr’s distribution is multifaceted, containing multiple server examples, client code, as well as, a simple production ready server. So let me start off with a few steps I took in setting up Solr.
As I was working on my SMF scripts for the migration from Linux, I found some fancy ways to trace what was happening with a service that failed to start with new configuration.
There is a command called truss which allows you to follow along as something is being executed. See: man truss.
What I found really cool is that you can modify your existing SMF script from the command line. No need to make changes to xml and then reimport.
So back to the tip, if a service isn’t starting and you have eliminated all of the other possibilities with configuration, you need to see what the service script is doing. In my case, I was having a hell of a time with Red5. So here is what you can do:
Red5 is the open source answer to Adobe’s Flash Streaming Server. I use it not for streaming, but for recording audio and video from a user’s web browser. Yeah, that is a really cool feature if you have the need for it.
It turns out that the Red5 team actually created an SMF configuration. However, you have to download the source to get to it. I was half way through my own before I saw the change entry on the Red5 commit mailing list. However, it used some funky paths. I took this and fixed it for my Joyent specific installation.
For my Tomcat SMF setup, I have two files. One, the configuration for SMF. The second is the script used to start and restart tomcat. (My stop is simply a kill. Unfortunately, there are a few threads that aren’t shutting down properly.)
I was having a horrendous time logging into one particular Linux server. It would take anywhere from 20 seconds to a minute to let me log in.
Turns out, there are two sshd config settings you should pay attention to:
VerifyReverseMapping and GssAuthentication
The latter only really helps if you are connecting from a Mac (like me). VerifyReverseMapping will tell the server to look up the host name for client IPs, and if the IP you’re SSH’ing from doesn’t have a reverse DNS entry, this will result in a DNS timeout. (The source of most of the delay in logging in.)
So open up your sshd config file: /etc/ssh/sshd_config
Remove VerifyReverseMapping, if it exists. The default value for it is no. If you don’t see it, this isn’t the source of your problem.
Remove GssAuthentication, if it exists. It also, defaults to no.
Leading in with my post on switching from Linux to OpenSolaris, there are quite a few things I am in love with on OpenSolaris.
Since my first real world exposure to OpenSolaris is the process of migrating web applications away from Linux, the first thing I really had to get familiar with is the Service Management Facility framework. SMF for short. SMF is the Solaris equivalent of UNIX/Linux init.d, Apple’s launchd, or Windows services.
I’m coming from a decade long Linux background so let me compare and contrast init.d and SMF. <Read on…>
It was 3 in the morning. I just got one of the pages (well, now text messages) that you dread. Site’s down. Shit. This is the 4th unexpected “Availability Event” in the past 6 months. What is it now?
I fire up the grid control panel on my laptop. Well, try to. The browser is just spinning. Grrr. Let’s see if I can SSH into the grid controller. Nope, just spinning there too. What the heck man? All the websites and applications are down. You’ve got to be kidding me!
I fire off a quick helpdesk ticket and then call the emergency line for our guys. Leave a voicemail. Get a callback in 5 minutes from them. Hmm, they are awake already?
John Doe: “Mike, just calling you back.”
Me: “What’s the story?”
John Doe: “Yeah, well you see, the Data Center is doing network maintenance. And we just got the email 20 minutes ago, ourselves.”
John Doe: “Yeah, that’s what we said. There is nothing we can do but wait it out.”
You know, this one wasn’t their fault. But it was the last straw. It is time to cut our losses and run. RUN!