Sunday, March 22, 2009

stripping HTML with objective-c/cocoa

I was looking for a simple way to strip HTML from an NSString. Since NSString has no native regular expression support, I had to resort to other means. I found many posts requiring regex support libs and/or libxml2 support. Bleh. Then I found this simple solution using NSScanner. It worked well for me.

It's not super smart though. I'm guessing it will bork on any stray < or > tags in the text that are not part of HTML markup. Make sure they are escaped.

I made my own small addition, optionally trimming whitespace too.

- (NSString *)flattenHTML:(NSString *)html trimWhiteSpace:(BOOL)trim {

NSScanner *theScanner;
NSString *text = nil;

theScanner = [NSScanner scannerWithString:html];

while ([theScanner isAtEnd] == NO) {

// find start of tag
[theScanner scanUpToString:@"<" intoString:NULL] ;
// find end of tag
[theScanner scanUpToString:@">" intoString:&text] ;

// replace the found tag with a space
//(you can filter multi-spaces out later if you wish)
html = [html stringByReplacingOccurrencesOfString:
[ NSString stringWithFormat:@"%@>", text]
withString:@" "];

} // while //

// trim off whitespace
return trim ? [html stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] : html;



malaki1974 said...

Great snippet. Where would this code go? Would I put it in my xml parser or where I add the data to the tableview?

Amit Kumar Battan said...

Hi I am using this code it works ok.. but fail in case if HTML code contain the special character like " as “
then its show it as “ correct form is that it should show it as "

like in my case it show "NO" as “NO“