Full-Text Search on iOS with FMDB
Overview
When I started working on an iPhone app to play music from the phone’s library based on GPS location, I needed a way to index the song metadata and other textual content. SQLite is built into iOS, and I wanted to use its full-text module support (FTS3/4). Apple provides no full-text search index functionality for iOS, and other options, such as Lucene, are focused on Java-based environments.
Since I was working with SQLite, I knew the best approach was to work with the excellent FMDB library, which provides an Objective-C wrapper to the SQLite C API. My effort extends that library with additional Objective-C interfaces and protocols to simplify working with the FTS3 module.
Building the Index
The full-text index is a SQLite “virtual table”. You work with it like a regular database table, but it does not support all the features of SQLite. For example, virtual tables don’t support custom indices or triggers. You can create an index using FMDB just like creating a regular database table:
FMDatabase *db = self.database;
[db executeUpdate:@"CREATE VIRTUAL TABLE my_songs USING fts4(song_name, album_name, artist_name)"];
Tokenizers
One of the most important aspects in building the full-text index is how the words are “tokenized” - how they are turned into the actual text that is indexed. SQLite’s FTS3 module implements several tokenizers. The default tokenizer considers a word to be ASCII characters separated by whitespace; it lowercases those words when putting them in the index. If you are working with any kind of non-ASCII text, you will want to use a tokenizer that is Unicode-aware. SQLite includes a Unicode-aware tokenizer, but I wanted to create a mechanism to provide a tokenizer that could be customized for a particular application. Thus, the FMDB extension provides a mechanism to implement a custom tokenizer via an Objective-C protocol: FMTokenizer.
There are two tokenizer implementations provided in the extension: a “simple” tokenizer and a “stop word” tokenizer. You can use one or both of these in your projects, or you can use them as an example and create your own. If you have multiple full-text virtual tables, each can have its own named tokenizer implementation.
FMSimpleTokenizer
The simple tokenizer implements a Unicode-aware tokenizer using CFStringTokenizer. This tokenizer supports Unicode characters and has an associated CFLocale which influences what it considers to be words and word boundaries. This tokenizer is a good alternative to the built-in SQLite tokenizer since you can control the locale that’s used. You can also take the code in this class and create a new tokenizer implementation that customizes how CFStringTokenizer determines what is a “word”.
FMStopWordTokenizer
The stop-word tokenizer is used when you want to exclude words from being added to the index or used in a query match. For example, an index for English will often exclude the articles “a”, “an”, and “the” from the index. These are very common but are not useful in queries. The stop-word tokenizer can be initialized using a LF-delimited text file of words, or via an NSSet instance containing the words to exclude.
Note that the stop-word tokenizer is designed to work with another tokenizer implementation that handles the basic tokenization (e.g. FMSimpleTokenizer). This pattern of “chaining” tokenizers is useful because it keeps the individual tokenizers simple, but let’s them be composed in various ways.
Querying the Index
You can execute queries against the full-text index just like queries for regular SQLite tables. The important expression to use in the SQL WHERE clause is the MATCH term. The following selects songs from the full-text index table we created above:
FMDatabase *db = self.database;
FMResultSet *results = [db executeQuery:@"SELECT * FROM my_songs WHERE my_songs MATCH 'Jump'"];
You can use operators such as AND, OR, NOT just like regular SQL statements, although you must capitalize them for them to be recognized within a MATCH statment. See the FTS3 documentation for more details on boolean operators inside queries.
There are also some special functions which can be used with full-text queries to aid in determining how the text was matched in the query. These include offsets(), snippet(), and matchinfo().
Offsets
Offsets are used when you want to know what part of the text was matched by a particular query. This can be helpful if you want to prioritize search hits from one column over another, or from one part of a piece of text. Each offset contains three values:
- the column index in the table (not the index of the column in the SELECT)
- the term number that matched, usually zero unless your query had multiple terms (i.e. multiple words).
- the range (NSRange) of the match in the column’s text content.
Snippet
The snippet function can be used to format the text around a match so that you can display “hit highlighting”. This library doesn’t add any special interfaces or methods for creating snippets because it is fairly straightforward to call the function inside the SELECT statement. See the FTS3 documentation for examples on the snippet function.
Match Info
This library doesn’t have any special interfaces or methods for working with matchinfo() data. This would be a welcome addition!
Examples
Using a custom tokenizer
FMDatabase *db = self.database;
FMSimpleTokenizer *simpleTok = [[FMSimpleTokenizer alloc] initWithLocale:NULL];
// This installs a tokenizer module named "fmdb"
[db installTokenizerModule];
// This registers the delegate using the name "simple", which should be used when creating the table (below).
[FMDatabase registerTokenizer:simpleTok withName:@"simple"];
// Create the table with the "simple" tokenizer
[db executeUpdate:@"CREATE VIRTUAL TABLE my_songs USING fts4(song_name, album_name, artist_name, tokenize=fmdb simple)"];
// Use a property to keep the tokenizer instance from being de-allocated.
self.tokenizer = simpleTok;
Using the stop word tokenizer
FMDatabase *db = self.database;
FMSimpleTokenizer *simpleTok = [[FMSimpleTokenizer alloc] initWithLocale:NULL];
// This assumes that there is a newline-delimited text file in the app's main bundle.
NSURL *wordFile = [[NSBundle mainBundle] URLForResource:@"StopWords" withExtension:@"txt"];
FMStopWordTokenizer *stopTok = [FMStopWordTokenizer tokenizerWithFileURL:wordFile baseTokenizer:simpleTok error:NULL];
[db installTokenizerModule];
[FMDatabase registerTokenizer:stopTok withName:@"stopper"];
// Create the table with the "stop word" tokenizer
[db executeUpdate:@"CREATE VIRTUAL TABLE my_songs USING fts4(song_name, album_name, artist_name, tokenize=fmdb stopper)"];
// Use a property to keep the tokenizer instance from being de-allocated.
self.tokenizer = stopTok;
Selecting offsets
FMDatabase *db = self.database;
FMResultSet *results = [db executeQuery:@"SELECT song_name, offsets(my_songs) FROM my_songs WHERE my_songs MATCH 'Jump'"];
while ([results next]) {
NSString *songName = [results stringForColumnIndex:0];
FMTextOffsets *offsets = [results offsetsForColumnIndex:1];
[offsets enumerateWithBlock:^(NSInteger columnNumber, NSInteger termNumber, NSRange matchRange) {
NSLog(@"Text match for column %d", columnNumber);
}];
}