Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
How do you htmlentities in Python
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  9 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
js  
View profile  
 More options Jun 4 2007, 11:31 pm
Newsgroups: comp.lang.python
From: "js " <ebgs...@gmail.com>
Date: Mon, 4 Jun 2007 22:31:56 +0900
Local: Mon, Jun 4 2007 11:31 pm
Subject: How do you htmlentities in Python
Hi list.

If I'm not mistaken, in python, there's no standard library to convert
html entities, like &amp; or &gt; into their applicable characters.

htmlentitydefs provides maps that helps this conversion,
but it's not a function so you have to write your own function
make use of  htmlentitydefs, probably using regex or something.

To me this seemed odd because python is known as
'Batteries Included' language.

So my questions are
1. Why doesn't python have/need entity encoding/decoding?
2. Is there any idiom to do entity encode/decode in python?

Thank you in advance...


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Adam Atlas  
View profile  
 More options Jun 5 2007, 12:03 am
Newsgroups: comp.lang.python
From: Adam Atlas <a...@atlas.st>
Date: Mon, 04 Jun 2007 07:03:12 -0700
Local: Tues, Jun 5 2007 12:03 am
Subject: Re: How do you htmlentities in Python
As far as I know, there isn't a standard idiom to do this, but it's
still a one-liner. Untested, but I think this should work:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
    return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
name2codepoint[m.group(1)], s)


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Cameron Laird  
View profile  
 More options Jun 5 2007, 1:41 am
Newsgroups: comp.lang.python
From: cla...@lairds.us (Cameron Laird)
Date: Mon, 4 Jun 2007 15:41:37 +0000
Local: Tues, Jun 5 2007 1:41 am
Subject: Re: How do you htmlentities in Python
In article <1180965792.757685.132...@q75g2000hsh.googlegroups.com>,
Adam Atlas  <a...@atlas.st> wrote:

>As far as I know, there isn't a standard idiom to do this, but it's
>still a one-liner. Untested, but I think this should work:

>import re
>from htmlentitydefs import name2codepoint
>def htmlentitydecode(s):
>    return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
>name2codepoint[m.group(1)], s)

How strange that this doesn't appear in the Cookbook!  I'm
curious about how others think:  does such an item better
belong in the Cookbook, or the Wiki?

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thomas Jollans  
View profile  
 More options Jun 5 2007, 2:14 am
Newsgroups: comp.lang.python
From: "Thomas Jollans" <tho...@jollans.NOSPAM.com>
Date: Mon, 4 Jun 2007 17:14:56 +0100
Local: Tues, Jun 5 2007 2:14 am
Subject: Re: How do you htmlentities in Python
"Adam Atlas" <a...@atlas.st> wrote in message

news:1180965792.757685.132580@q75g2000hsh.googlegroups.com...

> As far as I know, there isn't a standard idiom to do this, but it's
> still a one-liner. Untested, but I think this should work:

> import re
> from htmlentitydefs import name2codepoint
> def htmlentitydecode(s):
>    return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
>         name2codepoint[m.group(1)], s)

'&(%s);' won't quite work: HTML (and, I assume, SGML, but not XHTML being
XML) allows you to skip the semicolon after the entity if it's followed by a
white space (IIRC). Should this be respected, it looks more like this:
r'&(%s)([;\s]|$)'

Also, this completely ignores non-name entities as also found in XML. (eg
%x20; for ' ' or so) Maybe some part of the HTMLParser module is useful, I
wouldn't know. IMHO, these particular batteries aren't too commonly needed.

Regards,
Thomas Jollans


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Cameron Laird  
View profile  
 More options Jun 5 2007, 2:54 am
Newsgroups: comp.lang.python
From: cla...@lairds.us (Cameron Laird)
Date: Mon, 4 Jun 2007 16:54:53 +0000
Local: Tues, Jun 5 2007 2:54 am
Subject: Re: How do you htmlentities in Python
In article <1180965792.757685.132...@q75g2000hsh.googlegroups.com>,
Adam Atlas  <a...@atlas.st> wrote:

>As far as I know, there isn't a standard idiom to do this, but it's
>still a one-liner. Untested, but I think this should work:

>import re
>from htmlentitydefs import name2codepoint
>def htmlentitydecode(s):
>    return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
>name2codepoint[m.group(1)], s)

A.  I *think* you meant
        import re
        from htmlentitydefs import name2codepoint
        def htmlentitydecode(s):
            return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m: chr(name2codepoint[m.group(1)]), s)
    We're stretching the limits of what's comfortable
    for me as a one-liner.
B.  How's it happen this isn't in the Cookbook?  I'm
    curious about what other Pythoneers think:  is
    this better memorialized in the Cookbook or the
    Wiki?

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matimus  
View profile  
 More options Jun 5 2007, 3:17 am
Newsgroups: comp.lang.python
From: Matimus <mccre...@gmail.com>
Date: Mon, 04 Jun 2007 17:17:27 -0000
Local: Tues, Jun 5 2007 3:17 am
Subject: Re: How do you htmlentities in Python
On Jun 4, 6:31 am, "js " <ebgs...@gmail.com> wrote:

I think this is the standard idiom:

>>> import xml.sax.saxutils as saxutils
>>> saxutils.escape("&")
'&amp;'
>>> saxutils.unescape("&gt;")
'>'
>>> saxutils.unescape("A bunch of text with entities: &amp; &gt; &lt;")

'A bunch of text with entities: & > <'

Notice there is an optional parameter (a dict) that can be used to
define additional entities as well.

Matt


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
js  
View profile  
 More options Jun 5 2007, 9:14 am
Newsgroups: comp.lang.python
From: "js " <ebgs...@gmail.com>
Date: Tue, 5 Jun 2007 08:14:34 +0900
Local: Tues, Jun 5 2007 9:14 am
Subject: Re: How do you htmlentities in Python
 Thanks you Matimus.
That's exactly what I'm looking for!
Easy, clean and customizable.
I love python :)

On 6/5/07, Matimus <mccre...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Cameron Laird  
View profile  
 More options Jun 6 2007, 4:36 am
Newsgroups: comp.lang.python
From: cla...@lairds.us (Cameron Laird)
Date: Tue, 5 Jun 2007 18:36:26 +0000
Local: Wed, Jun 6 2007 4:36 am
Subject: Re: How do you htmlentities in Python
In article <1180977447.745432.109...@q19g2000prn.googlegroups.com>,

                        .
                        .
                        .
Good points; I like your mention of the optional entity dictionary.

It's possible that your solution is to a different problem than the original
poster intended.  <URL: http://wiki.python.org/moin/EscapingHtml > has de-
tails about HTML entities vs. XML entities.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John J. Lee  
View profile  
 More options Jun 7 2007, 8:07 am
Newsgroups: comp.lang.python
From: j...@pobox.com (John J. Lee)
Date: Wed, 06 Jun 2007 22:07:36 GMT
Local: Thurs, Jun 7 2007 8:07 am
Subject: Re: How do you htmlentities in Python

Here's one that handles numeric character references, and chooses to
leave entity references that are not defined in standard library
module htmlentitydefs intact, rather than throwing an exception.

It ignores the missing semicolon issue (and note also that IE can cope
with even a missing space, like "tr&eacutes mal", so you'll see that
in the wild).  Probably it could be adapted to handle that (possibly
the presumably-slower htmllib-based recipe on the python.org wiki
already does handle that, not sure).

import htmlentitydefs
import re
import unittest

def unescape_charref(ref):
    name = ref[2:-1]
    base = 10
    if name.startswith("x"):
        name = name[1:]
        base = 16
    return unichr(int(name, base))

def replace_entities(match):
    ent = match.group()
    if ent[1] == "#":
        return unescape_charref(ent)

    repl = htmlentitydefs.name2codepoint.get(ent[1:-1])
    if repl is not None:
        repl = unichr(repl)
    else:
        repl = ent
    return repl

def unescape(data):
    return re.sub(r"&#?[A-Za-z0-9]+?;", replace_entities, data)

class UnescapeTests(unittest.TestCase):

    def test_unescape_charref(self):
        self.assertEqual(unescape_charref(u"&#38;"), u"&")
        self.assertEqual(unescape_charref(u"&#x2014;"), u"\N{EM DASH}")
        self.assertEqual(unescape_charref(u"&#8212;"), u"\N{EM DASH}")

    def test_unescape(self):
        self.assertEqual(
            unescape(u"&amp; &lt; &mdash; &#8212; &#x2014;"),
            u"& < %s %s %s" % tuple(u"\N{EM DASH}"*3)
            )
        self.assertEqual(unescape(u"&a&amp;"), u"&a&")
        self.assertEqual(unescape(u"a&amp;"), u"a&")
        self.assertEqual(unescape(u"&nonexistent;"), u"&nonexistent;")

unittest.main()

John


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google