“Some people, when confronted with a problem, think ‘I know, I’ll use
NSRegularExpression
.’ Now they have three problems.”
Regular expressions fill a controversial role in the programming world. Some find them impenetrably incomprehensible, thick with symbols and adornments, more akin to a practical joke than part of a reasonable code base. Others rely on their brevity and their power, wondering how anyone could possibly get along without such a versatile tool in their arsenal.
Happily, on one thing we can all agree. In NSRegularExpression
, Cocoa has the most long-winded and byzantine regular expression interface you’re ever likely to come across. Don’t believe me? Let’s try extracting the links from this snippet of HTML, first using Ruby:
htmlSource="Questions? Corrections? <a href=\"https://twitter.com/NSHipster\">@NSHipster</a> or <a href=\"https://github.com/NSHipster/articles\">on GitHub</a>."linkRegex=/<a\s+[^>]*href="([^"]*)"[^>]*>/ilinks=htmlSource.scan(linkRegex)puts(links)# https://twitter.com/NSHipster# https://github.com/NSHipster/articles
Two or three lines, depending on how you count—not bad. Now we’ll try the same thing in Swift using NSRegularExpression
:
lethtmlSource="Questions? Corrections? <a href=\"https://twitter.com/NSHipster\">@NSHipster</a> or <a href=\"https://github.com/NSHipster/articles\">on GitHub</a>."letlinkRegexPattern="<a\\s+[^>]*href=\"([^\"]*)\"[^>]*>"letlinkRegex=try!NSRegularExpression(pattern:linkRegexPattern,options:.caseInsensitive)letmatches=linkRegex.matches(in:htmlSource,range:NSMakeRange(0,htmlSource.utf16.count))letlinks=matches.map{result->StringinlethrefRange=result.rangeAt(1)letstart=String.UTF16Index(hrefRange.location)letend=String.UTF16Index(hrefRange.location+hrefRange.length)returnString(htmlSource.utf16[start..<end])!}print(links)// ["https://twitter.com/NSHipster", "https://github.com/NSHipster/articles"]
The prosecution rests.
This article won’t get into the ins and outs of regular expressions themselves (you may need to learn about wildcards, backreferences, lookaheads and the rest elsewhere), but read on to learn about NSRegularExpression
, NSTextCheckingResult
, and a particularly sticky point when bringing it all together in Swift.
NSString
Methods
The simplest way to use regular expressions in Cocoa is to skip NSRegularExpression
altogether. The range(of:...)
method on NSString
(which is bridged to Swift’s native String
type) switches into regular expression mode when given the .regularExpression
option, so lightweight searches can be written easily:
letsource="For NSSet and NSDictionary, the breaking..."// Matches anything that looks like a Cocoa type: // UIButton, NSCharacterSet, NSURLSession, etc.lettypePattern="[A-Z]{3,}[A-Za-z0-9]+"iflettypeRange=source.range(of:typePattern,options:.regularExpression){print("First type: \(source[typeRange])")// First type: NSSet}
NSString*source=@"For NSSet and NSDictionary, the breaking...";// Matches anything that looks like a Cocoa type: // UIButton, NSCharacterSet, NSURLSession, etc.NSString*typePattern=@"[A-Z]{3,}[A-Za-z0-9]+";NSRangetypeRange=[sourcerangeOfString:typePatternoptions:NSRegularExpressionSearch];if(typeRange.location!=NSNotFound){NSLog(@"First type: %@",[sourcesubstringWithRange:typeRange]);// First type: NSSet}
Replacement is also a snap using replacingOccurrences(of:with:...)
with the same option. Watch how we surround each type name in our text with Markdown-style backticks using this one weird trick:
letmarkedUpSource=source.replacingOccurrences(of:typePattern,with:"`$0`",options:.regularExpression)print(markedUpSource)// "For `NSSet` and `NSDictionary`, the breaking...""
NSString*markedUpSource=[sourcestringByReplacingOccurrencesOfString:typePatternwithString:@"`$0`"options:NSRegularExpressionSearchrange:NSMakeRange(0,source.length)];NSLog(@"%@",markedUpSource);// "For `NSSet` and `NSDictionary`, the breaking...""
This approach to regular expressions can even handle subgroup references in the replacement template. Lo, a quick and dirty Pig Latin transformation:
letourcesay=source.replacingOccurrences(of:"([bcdfghjklmnpqrstvwxyz]*)([a-z]+)",with:"$2$1ay",options:[.regularExpression,.caseInsensitive])print(ourcesay)// "orFay etNSSay anday ictionaryNSDay, ethay eakingbray..."
NSString*ourcesay=[sourcestringByReplacingOccurrencesOfString:@"([bcdfghjklmnpqrstvwxyz]*)([a-z]+)"withString:@"$2$1ay"options:NSRegularExpressionSearch|NSCaseInsensitiveSearchrange:NSMakeRange(0,source.length)];NSLog(@"%@",ourcesay);// "orFay etNSSay anday ictionaryNSDay, ethay eakingbray..."
These two methods will suffice for many places you might want to use regular expressions, but for heavier lifting, we’ll need to work with NSRegularExpression
itself. First, though, let’s sort out a minor complication when using this class from Swift.
NSRange
and Swift
Swift provides a more comprehensive, more complex interface to a string’s characters and substrings than does Foundation’s NSString
. The Swift standard library provides four different views into a string’s data, giving you quick access to the elements of a string as characters, Unicode scalar values, or UTF-8 or UTF-16 code units.
How does this relate to NSRegularExpression
? Well, many NSRegularExpression
methods use NSRange
s, as do the NSTextCheckingResult
instances that store a match’s data. NSRange
, in turn, uses integers for its location and length, while none of String
’s views use integers as an index:
letrange=NSRange(location:4,length:5)// Not one of these will compile:source[range]source.characters[range]source.substring(with:range)source.substring(with:range.toRange()!)
Confusion. Despair.
But don’t give up! Everything isn’t as disconnected as it seems—the utf16
view on a Swift String
is meant specifically for interoperability with Foundation’s NSString
APIs. As long as Foundation has been imported, you can create new indices for a utf16
view directly from integers:
letstart=String.UTF16Index(range.location)letend=String.UTF16Index(range.location+range.length)letsubstring=String(source.utf16[start..<end])!// substring is now "NSSet"
With that in mind, here are a few additions to String
that will make straddling the Swift/Objective-C divide a bit easier:
extensionString{/// An `NSRange` that represents the full range of the string.varnsrange:NSRange{returnNSRange(location:0,length:utf16.count)}/// Returns a substring with the given `NSRange`, /// or `nil` if the range can't be converted.funcsubstring(withnsrange:NSRange)->String?{guardletrange=nsrange.toRange()else{returnnil}letstart=UTF16Index(range.lowerBound)letend=UTF16Index(range.upperBound)returnString(utf16[start..<end])}/// Returns a range equivalent to the given `NSRange`,/// or `nil` if the range can't be converted.funcrange(fromnsrange:NSRange)->Range<Index>?{guardletrange=nsrange.toRange()else{returnnil}letutf16Start=UTF16Index(range.lowerBound)letutf16End=UTF16Index(range.upperBound)guardletstart=Index(utf16Start,within:self),letend=Index(utf16End,within:self)else{returnnil}returnstart..<end}}
We’ll put these to use in the next section, where we’ll finally see NSRegularExpression
in action.
NSRegularExpression
& NSTextCheckingResult
If you’re doing more than just searching for the first match or replacing all the matches in your string, you’ll need to build an NSRegularExpression
to do your work. Let’s build a miniature text formatter that can handle *bold* and _italic_ text.
Pass a pattern and, optionally, some options to create a new instance. miniPattern
looks for an asterisk or an underscore to start a formatted sequence, one or more characters to format, and finally a matching character to end the formatted sequence. The initial character and the string to format are both captured:
letminiPattern="([*_])(.+?)\\1"letminiFormatter=try!NSRegularExpression(pattern:miniPattern,options:.dotMatchesLineSeparators)// the initializer throws an error if the pattern is invalid
NSString*miniPattern=@"([*_])(.+?)\\1";NSError*error=nil;NSRegularExpression*miniFormatter=[NSRegularExpressionregularExpressionWithPattern:miniPatternoptions:NSRegularExpressionDotMatchesLineSeparatorserror:&error];
The initializer throws an error if the pattern is invalid. Once constructed, you can use an NSRegularExpression
as often as you need with different strings.
lettext="MiniFormatter handles *bold* and _italic_ text."letmatches=miniFormatter.matches(in:text,options:[],range:text.nsrange)// matches.count == 2
NSString*text=@"MiniFormatter handles *bold* and _italic_ text.";NSArray<NSTextCheckingResult*>*matches=[miniFormattermatchesInString:textoptions:kNilOptionsrange:NSMakeRange(0,text.length)];// matches.count == 2
Calling matches(in:options:range:)
fetches an array of NSTextCheckingResult
, the type used as the result for a variety of text handling classes, such as NSDataDetector
and NSSpellChecker
. The resulting array has one NSTextCheckingResult
for each match.
The information we’re most interested are the range of the match, stored as range
in each result, and the ranges of any capture groups in the regular expression. You can use the numberOfRanges
property and the rangeAt(_:)
method to find the captured ranges—range 0 is always the full match, with the ranges at indexes 1 up to, but not including, numberOfRanges
covering each capture group.
Using the NSRange
-based substring method we declared above, we can use these ranges to extract the capture groups:
formatchinmatches{letstringToFormat=text.substring(with:match.rangeAt(2))!switchtext.substring(with:match.rangeAt(1))!{case"*":print("Make bold: '\(stringToFormat)'")case"_":print("Make italic: '\(stringToFormat)'")default:break}}// Make bold: 'bold'// Make italic: 'italic'
for(NSTextCheckingResult*matchinmatches){NSString*delimiter=[textsubstringWithRange:[matchrangeAtIndex:1]];NSString*stringToFormat=[textsubstringWithRange:[matchrangeAtIndex:2]];if([delimiterisEqualToString:@"*"]){NSLog(@"Make bold: '%@'",stringToFormat);}elseif([delimiterisEqualToString:@"_"]){NSLog(@"Make italic: '%@'",stringToFormat);}}// Make bold: 'bold'// Make italic: 'italic'
For basic replacement, head straight to stringByReplacingMatches(in:options:range:with:)
, the long-winded version of String.replacingOccurences(of:with:options:)
. In this case, we need to use different replacement templates for different matches (bold vs. italic), so we’ll loop through the matches ourselves (moving in reverse order, so we don’t mess up the ranges of later matches):
varformattedText=textFormat:formatchinmatches.reversed(){lettemplate:Stringswitchtext.substring(with:match.rangeAt(1))??""{case"*":template="<strong>$2</strong>"case"_":template="<em>$2</em>"default:breakFormat}letmatchRange=formattedText.range(from:match.range)!// see aboveletreplacement=miniFormatter.replacementString(for:match,in:formattedText,offset:0,template:template)formattedText.replaceSubrange(matchRange,with:replacement)}// 'formattedText' is now:// "MiniFormatter handles <strong>bold</strong> and <em>italic</em> text."
NSMutableString*formattedText=[NSMutableStringstringWithString:text];for(NSTextCheckingResult*matchin[matchesreverseObjectEnumerator]){NSString*delimiter=[textsubstringWithRange:[matchrangeAtIndex:1]];NSString*template=[delimiterisEqualToString:@"*"]?@"<strong>$2</strong>":@"<em>$2</em>";NSString*replacement=[miniFormatterreplacementStringForResult:matchinString:formattedTextoffset:0template:template];[formattedTextreplaceCharactersInRange:[matchrange]withString:replacement];}// 'formattedText' is now:// @"MiniFormatter handles <strong>bold</strong> and <em>italic</em> text."
Calling miniFormatter.replacementString(for:in:...)
generates a replacement string specific to each NSTextCheckingResult
instance with our customized template.
Expression and Matching Options
NSRegularExpression
is highly configurable—you can pass different sets of options when creating an instance or when calling any method that performs matching.
NSRegularExpression.Options
Pass one or more of these as options
when creating a regular expression.
.caseInsensitive
: Turns on case insensitive matching. Equivalent to thei
flag..allowCommentsAndWhitespace
: Ignores any whitespace and comments between a#
and the end of a line, so you can format and document your pattern in a vain attempt at making it readable. Equivalent to thex
flag..ignoreMetacharacters
: The opposite of the.regularExpression
option inString.range(of:options:)
—this essentially turns the regular expression into a plain text search, ignoring any regular expression metacharacters and operators..dotMatchesLineSeparators
: Allows the.
metacharacter to match line breaks as well as other characters. Equivalent to thes
flag..anchorsMatchLines
: Allows the^
and$
metacharacters (beginning and end) to match the beginnings and ends of lines instead of just the beginning and end of the entire input string. Equivalent to them
flag..useUnixLineSeparators
,.useUnicodeWordBoundaries
: These last two opt into more specific line and word boundary handling: UNIX line separators
NSRegularExpression.MatchingOptions
Pass one or more of these as options
to any matching method on an NSRegularExpression
instance.
.anchored
: Only match at the start of the search range..withTransparentBounds
: Allows the regex to look past the search range for lookahead, lookbehind, and word boundaries (though not for actual matching characters)..withoutAnchoringBounds
: Makes the^
and$
metacharacters match only the beginning and end of the string, not the beginning and end of the search range..reportCompletion
,.reportProgress
: These only have an effect when passed to the method detailed in the next section. Each option tellsNSRegularExpression
to call the enumeration block additional times, when searching is complete or as progress is being made on long-running matches, respectively.
Partial Matching
Finally, one of the most powerful features of NSRegularExpression
is the ability to scan only as far into a string as you need. This is especially valuable on a large string, or when using an pattern that is expensive to run.
Instead of using the firstMatch(in:...)
or matches(in:...)
methods, call enumerateMatches(in:options:range:using:)
with a closure to handle each match. The closure receives three parameters: the match, a set of flags, and a pointer to a Boolean that acts as an out parameter, so you can stop enumerating at any time.
We can use this method to find the first several names in Dostoevsky’s Brothers Karamazov, where names follow a first and patronymic middle name style (e.g., “Ivan Fyodorovitch”):
letnameRegex=try!NSRegularExpression(pattern:"([A-Z]\\S+)\\s+([A-Z]\\S+(vitch|vna))")letbookString=...varnames:Set<String>=[]nameRegex.enumerateMatches(in:bookString,range:bookString.nsrange){(result,_,stopPointer)inguardletresult=resultelse{return}letname=nameRegex.replacementString(for:result,in:bookString,offset:0,template:"$1 $2")names.insert(name)// stop once we've found six unique namesstopPointer.pointee=ObjCBool(names.count==6)}// names.sorted(): // ["Adelaïda Ivanovna", "Alexey Fyodorovitch", "Dmitri Fyodorovitch", // "Fyodor Pavlovitch", "Pyotr Alexandrovitch", "Sofya Ivanovna"]
NSString*namePattern=@"([A-Z]\\S+)\\s+([A-Z]\\S+(vitch|vna))";NSRegularExpression*nameRegex=[NSRegularExpressionregularExpressionWithPattern:namePatternoptions:kNilOptionserror:&error];NSString*bookString=...NSMutableSet*names=[NSMutableSetset];[nameRegexenumerateMatchesInString:bookStringoptions:kNilOptionsrange:NSMakeRange(0,[bookStringlength])usingBlock:^(NSTextCheckingResult*result,NSMatchingFlagsflags,BOOL*stop){if(result==nil)return;NSString*name=[nameRegexreplacementStringForResult:resultinString:bookStringoffset:0template:@"$1 $2"];[namesaddObject:name];// stop once we've found six unique names*stop=(names.count==6);}];
With this approach we only need to look at the first 45 matches, instead of nearly 1300 in the entirety of the book. Not bad!
Once you get to know it, NSRegularExpression
can be a truly useful tool. In fact, you may have used it already to find dates, addresses, or phone numbers in user-entered text—NSDataDetector
is an NSRegularExpression
subclass with patterns baked in to identify useful info. Indeed, as we’ve come to expect of text handling throughout Foundation, NSRegularExpression
is thorough, robust, and has surprising depth beneath its tricky interface.