| |
comp.lang.python |
news:1180965792.757685.132580@q75g2000hsh.googlegroups.com... > import re Also, this completely ignores non-name entities as also found in XML. (eg Regards,
> still a one-liner. Untested, but I think this should work:
> from htmlentitydefs import name2codepoint
> def htmlentitydecode(s):
> return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
> name2codepoint[m.group(1)], s)
XML) allows you to skip the semicolon after the entity if it's followed by a
white space (IIRC). Should this be respected, it looks more like this:
r'&(%s)([;\s]|$)'
%x20; for ' ' or so) Maybe some part of the HTMLParser module is useful, I
wouldn't know. IMHO, these particular batteries aren't too commonly needed.
Thomas Jollans