Thursday, June 2, 2011

Every word counts

You can't have a Bible website without a search function. Well you can, and it would be great for people who have the whole text of scripture memorized, and know where to find things, but the rest of us would just give up and go somewhere else.

MySQL offers a very advanced search algorithm, that allows you to do all kind of fancy things, like ignore very common words (stopwords), short words and a number of other things. The problem though, is that it doesn't only allow you to do fancy things. Unless you're on a dedicated server where you can edit the settings any way you want to, MySQL forces you to do all of those fancy things, whether you want to or not.

Now if you're building the next Google, than all of those functions are somewhat useful. But if you're building a Bible, every word counts. Which is why "SELECT * FROM Bible WHERE MATCH Text AGAINST 'search string'" just isn't going to cut it.

So is there way to disable the fanciness on a shared web host, and just do a plain old search? Not really. Most forums I've seen seem to suggest dropping the match function altogether, and use the LIKE function, like this:
 SELECT * FROM Bible WHERE Text LIKE '%search%' OR Text LIKE '%string%'
 And it works. The only problem is it matches partial words as well. Which is fine if that's what you intended it to do, but if your users put a short word like 'a' or 'I' in there, they're going to get a lot of matches. (Every word that contains the letter 'a' or 'i'.)

So for the Expositor's Bible I want a checkbox that sets whether or not to allow partial matches. Now what? Add a space to each side of the word? Close, but no. It won't match words at the beginning or end of the string, or words followed by a comma, period, or any other non-word character except for space.

So do we add an OR clause for every possible variation? That's one long query, and not a very pretty one either. But what we're looking for is starting to sound an awful lot like regular expressions. And lo and behold, MySQL indeed has a REGEXP operator. *cue calm, pleasant music* Some developers fear and hate them, but I for one love them.

Now if you have no idea what a regular expression is, that is beyond my scope, but the above link is a good starting point to see what they can do. I'll concentrate on what they can do specifically for what I want here and now, which is to match the exact words, and not produce partial matches. Thankfully the MySQL implementation even has a specific set of markers predefined for that exact purpose: [[:<:]] and [[:>:]]

So what does our query end up looking like? Not bad at all, really. Very clean and readable, and fast enough at least for my use. Drumroll please...
SELECT * FROM Bible WHERE Text REGEXP '[[:<:]]search[[:>:]]' OR Text REGEXP '[[:<:]]string[[:>:]]'
That's all, folks. Now go build excellent search functions.

1 comment:

  1. It's all Hebrew to me (I know Greek decently, so I can't say it's "all Greek to me" ;-). :-D

    Godspeed to you on this project, brother.

    ReplyDelete

Related Posts Plugin for WordPress, Blogger...