I'm trying to strip things like \n
, >
, etc from comments I retrieved from the reddit API. What is the standard way to achieve this?
评论:
rz2yoj:
manifold360:The standard way to get comments from the Reddit API without the escaping is to use the "raw_json" query string parameter.
For example: https://www.reddit.com/user/kjnkjnkjnkjnkjnkjn.json?raw_json=1
nesigma:
1lann:Is there any reason using bluemonday when using html/template which already escapes dangerous characters?
nesigma:html/template escapes everything, bluemonday only escapes XSS. The differences are detailed in its README.
1lann:Okay so since html/template also covers XSS there no reason to use bluemonday on top.
broady:Yes it would be pointless to use bluemonday on top of html/template. You would only use bluemonday by itself to escape HTML (and then possibly pass it to html/template as type template.HTML).
magpiecub:Use the html/template package.
If you don't need templating, you can just use the html package. (html.EscapeString)
If you mean to strip tags, then you probably want to check out x/net/html or goquery
arp242:If you're just removing arbitrary strings then I think you want
strings.Replace
orregexp
.
magpiecub:Probably not a good idea to "roll your own" HTML escaping code, especially not if the input is untrusted (like Reddit comments).
arp242:Sanitizing and escaping are two totally different concepts. OP is asking about sanitizing.
If they only need to strip newlines and things that look like HTML character entity references, then
regexp
should work fine.
relvae:OP isn't 100% clear on what the input looks like, but mention of
>
makes it sound like there could be embeded HTML in there.I would expect that the Reddit API removes most of the truly harmful
and
onload=..
stuff, but you can never be sure.Either way, I don't really see a downside to using an established library; so why not use it, just to be on the safe side?
And regexp is super slow
