2018-05-11

Diacritic-insensitive String Search

I want to take the contents of search field, break it into words, and check the fields of a database to see which fields contain all of the words. The data in the database came from multiple outside sources. Much of it is in simple ASCII encoding, but some may be in UTF-8 with extended Latin characters. (Some could be in non-Latin character sets like Cyrillic or Devanagari or East Asian glyphs, but that is not relevant for this discussion.) For text using extended Latin characters, I would like the search to work as if any diacritics were removed from both the search text and the database text. For example, the following searches should all work:

Search for in string
canon Cañon
soupçon soupcon
ftacnik Ftačnik
mélange melange
cote côté

The NSString comparison functions have a useful option NSCaseInsensitiveSearch. It would be nice if there were a similar option NSDiacriticInsensitiveSearch to handle the situation here, but that does not exist. Instead Cocoa provides the opposite option NSLiteralSearch which prevents different composed character sequences from matching. So if I leave that option out, will it do what I want? No, the documentation slips in a little note: "Search and comparison are currently performed as if the NSLiteralSearch option were specified."

Non-literal searches are much more difficult. If you are not familiar with Unicode composition and canonicalization, read through this page. How much code would you need to handle every flavor of string comparison? What a headache! So how do I get what I need in spite of this? My problem is simpler than the general case. I need to reduce Latin character combinations to simpler forms without worrying about compatibility.


One approach that is deceptively simple is to let NSString do a lossy conversion:

- (NSData *)dataUsingEncoding:(NSStringEncoding)encoding allowLossyConversion:(BOOL)flag
	// If flag is YES and the receiver can’t be converted without losing some information, some 
	// characters may be removed or altered in conversion. For example, in converting a character
	// from NSUnicodeStringEncoding to NSASCIIStringEncoding, the character ‘Á’ becomes ‘A’, 
	// losing the accent.

Using that method, removing diacritics from a string can be done with two function calls:

// Remove diacritics from one string.

- (NSString *)removeDiacriticsFromString: (NSString *) input {
	NSData *data = [input dataUsingEncoding: NSASCIIStringEncoding allowLossyConversion: YES];
	return [[NSString alloc] initWithData: data encoding: NSASCIIStringEncoding];
}

The problem with that approach appears when you do have non-Latin text. A sequence of non-Latin characters will be converted to a sequence of question marks. If the search pattern consisted of three Chinese characters and the text to be searched contained a sequence of three Russian Cyrillic characters, you would get a false match.


A better path is to modify the character sequence yourself. You could request a canonical decomposed string to convert pre-composed characters to sequences of printable characters followed by combining marks. Copy that string's contents to an array of unichar. Iterate through the array to remove combining characters from the Unicode U+0300 block. Construct a new string from the remaining unichar array. Non-Latin characters would be returned unchanged. Here are helper methods that can be inserted into any class. (They could also be implemented as a category on NSString.)

// Remove diacritics from one string.

- (NSString *)removeDiacriticsFromString: (NSString *) input {
	NSData *data = [[input decomposedStringWithCanonicalMapping] dataUsingEncoding: NSUTF16LittleEndianStringEncoding];
	NSMutableData *data2 = nil;	// Delay allocation until you find a combining mark.
	unichar *ip = (unichar *) data.bytes;
	int i, j, n = data.length / sizeof(unichar);
	for (i = j = 0; i < n; i++) {
		if (ip[i] >= 0x300 && ip[i] < 0x0370) {
			if (0 == j) data2 = [NSMutableData dataWithCapacity: data.length];
			if (i > j) [data2 appendBytes: ip+j length: sizeof(unichar)*(i-j)];
			j = i+1;
		}
	}
	if (0 == j) return input;	// Did not find any combining characters, return original string.
	if (j < n) [data2 appendBytes: ip+j length: sizeof(unichar)*(n-j)];
	return [[NSString alloc] initWithData: data2 encoding: NSUTF16LittleEndianStringEncoding];
}

// Find a substring using a case-insensitive and diacritic-insensitive match.

-(BOOL) matchSubstring: (NSString *) pattern ignoringDiacriticsInString: (NSString *) input {
	NSString *input2 = [self removeDiacriticsFromString: input];
	NSString *pattern2 = [self removeDiacriticsFromString: pattern];

	NSRange matchRange = [input2 rangeOfString: pattern2 options: NSCaseInsensitiveSearch];
	return (NSNotFound != matchRange.location);
}

My original task can now be split into two methods. One method converts the contents of a text field into a set of search terms. The second method is used inside a search loop to match the search set against any number of database fields.

// Convert user-supplied string into a set of diacritic-free search words.

-(NSSet *) wordSetFromString: (NSString *) pattern {
	NSMutableSet *wordSet = [NSMutableSet new];
	NSString *pattern2 = [self removeDiacriticsFromString: pattern];

	NSCharacterSet *separators = [NSCharacterSet letterCharacterSet].invertedSet;
	NSArray *words = [pattern2 componentsSeparatedByCharactersInSet: separators];
	for (NSString *word in words) {
		if (word.length > 0) [wordSet addObject: word];
	}
	return wordSet;
}

// Return YES if every word in wordSet is a substring of input using a case-insensitive and diacritic-insensitive match.

-(BOOL) matchWordSet: (NSSet *) wordSet ignoringDiacriticsInString: (NSString *) input {
	NSString *input2 = [self removeDiacriticsFromString: input];

	for (NSString *word in wordSet) {
		NSRange matchRange = [input2 rangeOfString: word options: NSCaseInsensitiveSearch];
		if (NSNotFound == matchRange.location) return NO;
	}
	return YES;
}