Wednesday, March 25, 2009

Perl Compatible Regular Expressions with Cocoa

If you want Perl Compatible Regular Expressions with Cocoa, Christopher Bess has created ObjPCRE, a library that makes PCRE easy in Cocoa.

However, there isn't much for documentation, so I thought I'd at least show how to get started. Implementation is pretty straight forward. First, you need to add the following files to your XCode project:

libpcre.a
pcre.h
objpcre.h
objpcre.m

You can find the libpcre.a file in the pcre static lib download, and the other three files are in the source file download. I created a new "PCRE" group folder in my XCode project and dropped them all in there.

Now, its just a matter of using it. So first, lets create a one-liner that search/replaces text in a string. We'll search for "string" and replace it with "foobar".

#import "objpcre.h"

NSString *myText = @"This is my string of text.";
NSLog(@"text before: %@", myText);
[[ObjPCRE regexWithPattern:@"string"] replaceAll:&myText replacement:@"foobar"];
NSLog(@"text after: %@", myText);


There you go, your first one-line to search/replace a string of text inline with perl regular expressions. Now this isn't very interesting, as no regular expressions were used. So now, let's try something useful. How about a regular expression that removes all HTML tags from the string. Lets try to think of a regex that will match every HTML tag:

<.*>

Ok that one is pretty basic. It says match <, followed by zero or more of ANY character, followed by >. This could cause a problem because it can match too much, such as multiple html tags along with any text between them. So we'll go with something a bit more restrictive:

<\w+[^>]*>

Now we will only match <, followed by one or more word characters (letter, number, underscore), followed by zero or more characters that are NOT >, followed by >.


We still have a problem though, this will not match closing HTML tags.

</?\w+[^>]*>

There, now we match tags with 0 or 1 "/" after the opening tag.

Notice that backslashes must be escaped inside @"double quotes", so we use two of them in the string.

#import "objpcre.h"

NSString *myText = @"<title>This is my <b>string</b> of <class name="foo">text</class>.</title>";
NSLog(@"text before: %@", myText);
[[ObjPCRE regexWithPattern:@"</?\\w+[^>]*>"] replaceAll:&myText replacement:@""];
NSLog(@"text after: %@", myText);


And now for something a bit trickier. Let's try extracting all words within [brackets] in the text. This is where ObjPCRE could use some more features! But for now, here is how we accomplish this task. First the regular expression that matches the tags:

\[\w+\]

The brackets have special meaning to PCRE, so we have to escape them. This matches [, followed by one or more word characters, followed by ]. But, lets say we want to capture just the text, without the brackets. We put parenthesis around each subpattern we want to capture. These will have no affect on the regex.

\[(\w+)\]

And now we put this into code. Remember to escape backslashes.

NSString *myText = @"This is [some] more [text] to parse.";

ObjPCRE *pcre = [ObjPCRE regexWithPattern:@"\\[(\\w+)\\]"];

int start = 0;
int len = 0;
int offset = 0;
int i = 0;
while([pcre regexMatches:myText options:0 startOffset:offset]) {
for(i=0; i<[pcre matchCount]; i++) {
NSLog(@"match %d: %@",i,[pcre match:myText atMatchIndex:i]);
}
[pcre match:&start length:&len atMatchIndex:0];
offset = start + len;
}


We call regexMatches for each [bracket] pattern it finds. For each of those we loop over the subpatterns and echo them. The first subpattern is the entire match, followed by each parenthesized subpattern (which we have only one.)

Alright, so that's a start! To continue, check all the functions available in objpcre.h, and also see the documentation on PCRE for all the good regex stuff.

No comments: