Quantcast
Channel: NSHipster
Viewing all articles
Browse latest Browse all 382

NSRegularExpression

$
0
0

“Some people, when confronted with a problem, think ‘I know, I’ll use NSRegularExpression.’ Now they have three problems.”

Regular expressions fill a controversial role in the programming world. Some find them impenetrably incomprehensible, thick with symbols and adornments, more akin to a practical joke than part of a reasonable code base. Others rely on their brevity and their power, wondering how anyone could possibly get along without such a versatile tool in their arsenal.

Happily, on one thing we can all agree. In NSRegularExpression, Cocoa has the most long-winded and byzantine regular expression interface you’re ever likely to come across. Don’t believe me? Let’s try extracting the links from this snippet of HTML, first using Ruby:

htmlSource="Questions? Corrections? <a href=\"https://twitter.com/NSHipster\">@NSHipster</a> or <a href=\"https://github.com/NSHipster/articles\">on GitHub</a>."linkRegex=/<a\s+[^>]*href="([^"]*)"[^>]*>/ilinks=htmlSource.scan(linkRegex)puts(links)# https://twitter.com/NSHipster# https://github.com/NSHipster/articles

Two or three lines, depending on how you count—not bad. Now we’ll try the same thing in Swift using NSRegularExpression:

lethtmlSource="Questions? Corrections? <a href=\"https://twitter.com/NSHipster\">@NSHipster</a> or <a href=\"https://github.com/NSHipster/articles\">on GitHub</a>."letlinkRegexPattern="<a\\s+[^>]*href=\"([^\"]*)\"[^>]*>"letlinkRegex=try!NSRegularExpression(pattern:linkRegexPattern,options:.caseInsensitive)letmatches=linkRegex.matches(in:htmlSource,range:NSMakeRange(0,htmlSource.utf16.count))letlinks=matches.map{result->StringinlethrefRange=result.rangeAt(1)letstart=String.UTF16Index(hrefRange.location)letend=String.UTF16Index(hrefRange.location+hrefRange.length)returnString(htmlSource.utf16[start..<end])!}print(links)// ["https://twitter.com/NSHipster", "https://github.com/NSHipster/articles"]

The prosecution rests.

This article won’t get into the ins and outs of regular expressions themselves (you may need to learn about wildcards, backreferences, lookaheads and the rest elsewhere), but read on to learn about NSRegularExpression, NSTextCheckingResult, and a particularly sticky point when bringing it all together in Swift.


NSString Methods

The simplest way to use regular expressions in Cocoa is to skip NSRegularExpression altogether. The range(of:...) method on NSString (which is bridged to Swift’s native String type) switches into regular expression mode when given the .regularExpression option, so lightweight searches can be written easily:

letsource="For NSSet and NSDictionary, the breaking..."// Matches anything that looks like a Cocoa type: // UIButton, NSCharacterSet, NSURLSession, etc.lettypePattern="[A-Z]{3,}[A-Za-z0-9]+"iflettypeRange=source.range(of:typePattern,options:.regularExpression){print("First type: \(source[typeRange])")// First type: NSSet}
NSString*source=@"For NSSet and NSDictionary, the breaking...";// Matches anything that looks like a Cocoa type: // UIButton, NSCharacterSet, NSURLSession, etc.NSString*typePattern=@"[A-Z]{3,}[A-Za-z0-9]+";NSRangetypeRange=[sourcerangeOfString:typePatternoptions:NSRegularExpressionSearch];if(typeRange.location!=NSNotFound){NSLog(@"First type: %@",[sourcesubstringWithRange:typeRange]);// First type: NSSet}

Replacement is also a snap using replacingOccurrences(of:with:...) with the same option. Watch how we surround each type name in our text with Markdown-style backticks using this one weird trick:

letmarkedUpSource=source.replacingOccurrences(of:typePattern,with:"`$0`",options:.regularExpression)print(markedUpSource)// "For `NSSet` and `NSDictionary`, the breaking...""
NSString*markedUpSource=[sourcestringByReplacingOccurrencesOfString:typePatternwithString:@"`$0`"options:NSRegularExpressionSearchrange:NSMakeRange(0,source.length)];NSLog(@"%@",markedUpSource);// "For `NSSet` and `NSDictionary`, the breaking...""

This approach to regular expressions can even handle subgroup references in the replacement template. Lo, a quick and dirty Pig Latin transformation:

letourcesay=source.replacingOccurrences(of:"([bcdfghjklmnpqrstvwxyz]*)([a-z]+)",with:"$2$1ay",options:[.regularExpression,.caseInsensitive])print(ourcesay)// "orFay etNSSay anday ictionaryNSDay, ethay eakingbray..."
NSString*ourcesay=[sourcestringByReplacingOccurrencesOfString:@"([bcdfghjklmnpqrstvwxyz]*)([a-z]+)"withString:@"$2$1ay"options:NSRegularExpressionSearch|NSCaseInsensitiveSearchrange:NSMakeRange(0,source.length)];NSLog(@"%@",ourcesay);// "orFay etNSSay anday ictionaryNSDay, ethay eakingbray..."

These two methods will suffice for many places you might want to use regular expressions, but for heavier lifting, we’ll need to work with NSRegularExpression itself. First, though, let’s sort out a minor complication when using this class from Swift.

NSRange and Swift

Swift provides a more comprehensive, more complex interface to a string’s characters and substrings than does Foundation’s NSString. The Swift standard library provides four different views into a string’s data, giving you quick access to the elements of a string as characters, Unicode scalar values, or UTF-8 or UTF-16 code units.

How does this relate to NSRegularExpression? Well, many NSRegularExpression methods use NSRanges, as do the NSTextCheckingResult instances that store a match’s data. NSRange, in turn, uses integers for its location and length, while none of String’s views use integers as an index:

letrange=NSRange(location:4,length:5)// Not one of these will compile:source[range]source.characters[range]source.substring(with:range)source.substring(with:range.toRange()!)

Confusion. Despair.

But don’t give up! Everything isn’t as disconnected as it seems—the utf16 view on a Swift String is meant specifically for interoperability with Foundation’s NSString APIs. As long as Foundation has been imported, you can create new indices for a utf16 view directly from integers:

letstart=String.UTF16Index(range.location)letend=String.UTF16Index(range.location+range.length)letsubstring=String(source.utf16[start..<end])!// substring is now "NSSet"

With that in mind, here are a few additions to String that will make straddling the Swift/Objective-C divide a bit easier:

extensionString{/// An `NSRange` that represents the full range of the string.varnsrange:NSRange{returnNSRange(location:0,length:utf16.count)}/// Returns a substring with the given `NSRange`, /// or `nil` if the range can't be converted.funcsubstring(withnsrange:NSRange)->String?{guardletrange=nsrange.toRange()else{returnnil}letstart=UTF16Index(range.lowerBound)letend=UTF16Index(range.upperBound)returnString(utf16[start..<end])}/// Returns a range equivalent to the given `NSRange`,/// or `nil` if the range can't be converted.funcrange(fromnsrange:NSRange)->Range<Index>?{guardletrange=nsrange.toRange()else{returnnil}letutf16Start=UTF16Index(range.lowerBound)letutf16End=UTF16Index(range.upperBound)guardletstart=Index(utf16Start,within:self),letend=Index(utf16End,within:self)else{returnnil}returnstart..<end}}

We’ll put these to use in the next section, where we’ll finally see NSRegularExpression in action.

NSRegularExpression& NSTextCheckingResult

If you’re doing more than just searching for the first match or replacing all the matches in your string, you’ll need to build an NSRegularExpression to do your work. Let’s build a miniature text formatter that can handle *bold* and _italic_ text.

Pass a pattern and, optionally, some options to create a new instance. miniPattern looks for an asterisk or an underscore to start a formatted sequence, one or more characters to format, and finally a matching character to end the formatted sequence. The initial character and the string to format are both captured:

letminiPattern="([*_])(.+?)\\1"letminiFormatter=try!NSRegularExpression(pattern:miniPattern,options:.dotMatchesLineSeparators)// the initializer throws an error if the pattern is invalid
NSString*miniPattern=@"([*_])(.+?)\\1";NSError*error=nil;NSRegularExpression*miniFormatter=[NSRegularExpressionregularExpressionWithPattern:miniPatternoptions:NSRegularExpressionDotMatchesLineSeparatorserror:&error];

The initializer throws an error if the pattern is invalid. Once constructed, you can use an NSRegularExpression as often as you need with different strings.

lettext="MiniFormatter handles *bold* and _italic_ text."letmatches=miniFormatter.matches(in:text,options:[],range:text.nsrange)// matches.count == 2
NSString*text=@"MiniFormatter handles *bold* and _italic_ text.";NSArray<NSTextCheckingResult*>*matches=[miniFormattermatchesInString:textoptions:kNilOptionsrange:NSMakeRange(0,text.length)];// matches.count == 2

Calling matches(in:options:range:) fetches an array of NSTextCheckingResult, the type used as the result for a variety of text handling classes, such as NSDataDetector and NSSpellChecker. The resulting array has one NSTextCheckingResult for each match.

The information we’re most interested are the range of the match, stored as range in each result, and the ranges of any capture groups in the regular expression. You can use the numberOfRanges property and the rangeAt(_:)method to find the captured ranges—range 0 is always the full match, with the ranges at indexes 1 up to, but not including, numberOfRanges covering each capture group.

Using the NSRange-based substring method we declared above, we can use these ranges to extract the capture groups:

formatchinmatches{letstringToFormat=text.substring(with:match.rangeAt(2))!switchtext.substring(with:match.rangeAt(1))!{case"*":print("Make bold: '\(stringToFormat)'")case"_":print("Make italic: '\(stringToFormat)'")default:break}}// Make bold: 'bold'// Make italic: 'italic'
for(NSTextCheckingResult*matchinmatches){NSString*delimiter=[textsubstringWithRange:[matchrangeAtIndex:1]];NSString*stringToFormat=[textsubstringWithRange:[matchrangeAtIndex:2]];if([delimiterisEqualToString:@"*"]){NSLog(@"Make bold: '%@'",stringToFormat);}elseif([delimiterisEqualToString:@"_"]){NSLog(@"Make italic: '%@'",stringToFormat);}}// Make bold: 'bold'// Make italic: 'italic'

For basic replacement, head straight to stringByReplacingMatches(in:options:range:with:), the long-winded version of String.replacingOccurences(of:with:options:). In this case, we need to use different replacement templates for different matches (bold vs. italic), so we’ll loop through the matches ourselves (moving in reverse order, so we don’t mess up the ranges of later matches):

varformattedText=textFormat:formatchinmatches.reversed(){lettemplate:Stringswitchtext.substring(with:match.rangeAt(1))??""{case"*":template="<strong>$2</strong>"case"_":template="<em>$2</em>"default:breakFormat}letmatchRange=formattedText.range(from:match.range)!// see aboveletreplacement=miniFormatter.replacementString(for:match,in:formattedText,offset:0,template:template)formattedText.replaceSubrange(matchRange,with:replacement)}// 'formattedText' is now:// "MiniFormatter handles <strong>bold</strong> and <em>italic</em> text."
NSMutableString*formattedText=[NSMutableStringstringWithString:text];for(NSTextCheckingResult*matchin[matchesreverseObjectEnumerator]){NSString*delimiter=[textsubstringWithRange:[matchrangeAtIndex:1]];NSString*template=[delimiterisEqualToString:@"*"]?@"<strong>$2</strong>":@"<em>$2</em>";NSString*replacement=[miniFormatterreplacementStringForResult:matchinString:formattedTextoffset:0template:template];[formattedTextreplaceCharactersInRange:[matchrange]withString:replacement];}// 'formattedText' is now:// @"MiniFormatter handles <strong>bold</strong> and <em>italic</em> text."

Calling miniFormatter.replacementString(for:in:...) generates a replacement string specific to each NSTextCheckingResult instance with our customized template.

Expression and Matching Options

NSRegularExpression is highly configurable—you can pass different sets of options when creating an instance or when calling any method that performs matching.

NSRegularExpression.Options

Pass one or more of these as options when creating a regular expression.

  • .caseInsensitive: Turns on case insensitive matching. Equivalent to the i flag.
  • .allowCommentsAndWhitespace: Ignores any whitespace and comments between a # and the end of a line, so you can format and document your pattern in a vain attempt at making it readable. Equivalent to the x flag.
  • .ignoreMetacharacters: The opposite of the .regularExpression option in String.range(of:options:)—this essentially turns the regular expression into a plain text search, ignoring any regular expression metacharacters and operators.
  • .dotMatchesLineSeparators: Allows the . metacharacter to match line breaks as well as other characters. Equivalent to the s flag.
  • .anchorsMatchLines: Allows the ^ and $ metacharacters (beginning and end) to match the beginnings and ends of lines instead of just the beginning and end of the entire input string. Equivalent to the m flag.
  • .useUnixLineSeparators, .useUnicodeWordBoundaries: These last two opt into more specific line and word boundary handling: UNIX line separators
NSRegularExpression.MatchingOptions

Pass one or more of these as options to any matching method on an NSRegularExpression instance.

  • .anchored: Only match at the start of the search range.
  • .withTransparentBounds: Allows the regex to look past the search range for lookahead, lookbehind, and word boundaries (though not for actual matching characters).
  • .withoutAnchoringBounds: Makes the ^ and $ metacharacters match only the beginning and end of the string, not the beginning and end of the search range.
  • .reportCompletion, .reportProgress: These only have an effect when passed to the method detailed in the next section. Each option tells NSRegularExpression to call the enumeration block additional times, when searching is complete or as progress is being made on long-running matches, respectively.

Partial Matching

Finally, one of the most powerful features of NSRegularExpression is the ability to scan only as far into a string as you need. This is especially valuable on a large string, or when using an pattern that is expensive to run.

Instead of using the firstMatch(in:...) or matches(in:...) methods, call enumerateMatches(in:options:range:using:) with a closure to handle each match. The closure receives three parameters: the match, a set of flags, and a pointer to a Boolean that acts as an out parameter, so you can stop enumerating at any time.

We can use this method to find the first several names in Dostoevsky’s Brothers Karamazov, where names follow a first and patronymic middle name style (e.g., “Ivan Fyodorovitch”):

letnameRegex=try!NSRegularExpression(pattern:"([A-Z]\\S+)\\s+([A-Z]\\S+(vitch|vna))")letbookString=...varnames:Set<String>=[]nameRegex.enumerateMatches(in:bookString,range:bookString.nsrange){(result,_,stopPointer)inguardletresult=resultelse{return}letname=nameRegex.replacementString(for:result,in:bookString,offset:0,template:"$1 $2")names.insert(name)// stop once we've found six unique namesstopPointer.pointee=ObjCBool(names.count==6)}// names.sorted(): // ["Adelaïda Ivanovna", "Alexey Fyodorovitch", "Dmitri Fyodorovitch", //  "Fyodor Pavlovitch", "Pyotr Alexandrovitch", "Sofya Ivanovna"]
NSString*namePattern=@"([A-Z]\\S+)\\s+([A-Z]\\S+(vitch|vna))";NSRegularExpression*nameRegex=[NSRegularExpressionregularExpressionWithPattern:namePatternoptions:kNilOptionserror:&error];NSString*bookString=...NSMutableSet*names=[NSMutableSetset];[nameRegexenumerateMatchesInString:bookStringoptions:kNilOptionsrange:NSMakeRange(0,[bookStringlength])usingBlock:^(NSTextCheckingResult*result,NSMatchingFlagsflags,BOOL*stop){if(result==nil)return;NSString*name=[nameRegexreplacementStringForResult:resultinString:bookStringoffset:0template:@"$1 $2"];[namesaddObject:name];// stop once we've found six unique names*stop=(names.count==6);}];

With this approach we only need to look at the first 45 matches, instead of nearly 1300 in the entirety of the book. Not bad!


Once you get to know it, NSRegularExpression can be a truly useful tool. In fact, you may have used it already to find dates, addresses, or phone numbers in user-entered text—NSDataDetector is an NSRegularExpression subclass with patterns baked in to identify useful info. Indeed, as we’ve come to expect of text handling throughout Foundation, NSRegularExpression is thorough, robust, and has surprising depth beneath its tricky interface.


Viewing all articles
Browse latest Browse all 382

Trending Articles