Dot Star Question Mark and Other Mishaps
I’m in the process of rolling out a new feature for Raveal. Sometime during development of it, I noticed that suddenly a large portion of the portfolio slowed way down. In fact, pulling up the edit screen took about 3 seconds on my dev computer! That’s not quite right.
This new feature adds projects to the portfolio. In the context of Raveal, a project is a free form “document” that allows you to create just about anything to describe work you have done. It could be straight up text. Or it could be text with diagrams and illustrations. Or it could be images and video with short paragraphs explaining each. It truly is so powerful, I worry that people will have a hard time figuring out what they want to use it for!
Part of the feature set is to allow file attachments embedded into the document. I do this using a filter metaphor. Before the document is saved or viewed, it is run through a series of filters. Each filter can scan the content for a special tag and then do some sort of processing on it. Not revolutionary, but certainly extensible and powerful.
This project in my test case had about 4000 characters of latin text. That’s when I noticed the slowdown. My first fear was that what I was doing was simply t0o cpu intensive and I would have to cache the results. Adding a cache here wasn’t a big deal. But bearing the brunt of 3 seconds of real compute time, even once, was hard to swallow. So I started digging.
I pulled out my trusty tools:
long start = System.currentTimeMillis(); // code ... System.out.println("Process took: " + (System.currentTimeMillis() - start) + " MS"); // (err actually logger.debug)
And went to work. (While I wish I had access to some of those fancy profiling tools, what small dev shop can justify the cost?) I immediately started with the filtering system as there was heavy use of regular expressions there. Sure enough, I quickly saw that each filter was taking about 500 MS to run! There are 6 currently. That was 99.9999% of my performance problem.
I’ve been using regexp for quite some time, dating back to my Perl days. It’s one of those tools that you just can’t live without. I immediately threw suspicion towards some new code I added to this project. A ReplaceCallback. You can actually find an early version of it here:
It basically takes a regular expression and, for each match, it is passed to a function you provide. Quite handy! But it was unproven code in my eyes. Sure enough, I commented out the calls to this new fangled code and processing time dropped to a few MS. I found the culprit. (Much quicker then I thought I would, I was feeling quite proud of myself.)
So I dug deep into it, littering the code with timings and log statements. I was going to optimize the shit out of this thing. But after getting some timings on it, none of the ReplaceCallback code was signifincant! What the heck, man?
I took a step up the call stack and started looking at the actual regular expressions I was using in the ReplaceCallback. They looked like:
/.*?(\[tagName *[^\]]*?\]).*?/ /.*?(\[tagName( *[^\]]*?)\](.*?)\[/tagName\]).*?/
These patterns basically look for old UBB style tags [tag attr=”value”] and [tag]content[/tag]. Nothing spectacular there. But then I noticed something.
WHAT THE FUCK IS DOT STAR QUESTION MARK DOING SURROUNDING MY TAGS?
If your are not familiar with regular expressions, let me decipher this for you. Dot (.) means match any character. Star (*) means zero or more times. And question mark (?) means be greedy and keep going as long as the pattern still matches. Put it all together, it means try and match any character, zero or more times, and get as many as you can.
Having it at the beginning of the pattern means, after finding a match, move left for every single character you come across to match that. Also having it at the end of the pattern means then again match on every character you come across to the right. Translation: That sucks! Luckily, a simple modification and recompile and… yep. Things are the way the should be, a few MS and we are done.
I have no explanation of how .*? ended up in there. Only that I’m 30 years old, married, have a one year old, and a 4 week old. Programmers don’t loose their edge as they get older. There are just more distractions. :p