XSS Sanitizing for HTML Input in PHP

May 13, 2009 | 12:25 p.m. CDT

I’ve been working on a PHP framework for years that I try to keep up because I use it on several projects. I’ve been currently working on getting the framework to a point where I can open source it so that others can use it. So I’m open to name suggestions!

One aspect of the framework that I’ve been working on in particular is that of sanitizing user input for data that I want the user to be allowed to use a limited subset of HTML. The sanitization is needed so that you can protect your website from XSS exploits. I’ve stayed up until the early morning hours way to many times researching this and I still have not been able to find a solution that I’m happy with.

There are libraries out there already that you can use to sanitize against XSS attacks but from what I’ve found there are drawbacks to all of them.

The htmlentities/htmlspecialchars/filter extension non-Solution

It is true that you can use htmlentities / htmlspeicalchars to sanitize your html. However, this only works if your not wanting to allow any html. So if you’re wanting to allow users to use tags like <strong> or <em> in a comment field then this wouldn’t work because instead of seeing bold visitors would see literally see <strong>bold</strong> when the comment is rendered.

The Regex Blacklist Method

There are libraries out there that try to remove all unsafe data by using lots of regular expression replacements. The problem with this is that hackers are smart or at least than can be, and if you go this route then you have to update your list of regex replacements every time a new type of XSS exploit is discovered. And by the time you discover the exploit the damage could already have be done. Also, if your like me, you don’t really don’t want to spend time always updating your XSS sanitizer.

The Regex Whitelist Method

The regex whitelist method is where you choose what tags and attributes you want to allow and then remove any tag and attribute that isn’t in the whitelist. This method seems the most logical to me and probably the safest because instead of looking of all the possibilities of what you need to remove, you just remove everything you don’t accept. The HTML Purifier library looks like it probably does this very well. However, looking at the library makes me think twice about using it because it’s freaking huge! HTML Purifier is 22,025 lines of code. I don’t know about you but I’m always thinking about speed and scalability and to me this library is just not usable because of it’s size.

The Plain Text to HTML Method

The plain text to HTML Method is where in areas where you want to allow a subset of HTML for users to style their content, then you would use strip_tags() to strip out all HTML and then convert plain text to HTML using a library like Markdown or Textile. I’ve personally been using Markdown for years and love how simple it is for users to pick up and how content written in Markdown is more humanly readable for the non-Web Developer savvy people.

However, one major drawback that I’ve found is that Markdown still allows things like [evil link](javascript:alert(‘XSS’) which gets converted into a link that when rendered and clicked on displays a javascript alert. This of course if benign and harmless, however there is the potential for an attacker to use it as an actual XSS exploit. So this leaves us back to where we would need to use either the regex blacklist method or the regex whitelist method to take out the bad code.

No perfect Solution

After exploring all my possibilities it seems that there is no perfect solution. I think ultimately the best way to go about areas where you want to allow html, is to use something like Markdown or Textile to convert text to html and then pick a library that uses either the whitelist or the blacklist method for filtering and sanitizing bad code. But like I have already mentioned previously, you will have to settle for the drawbacks. If you go with the blacklist method you will probably end up having to update your library consistently to keep up with XSS exploits. If you choose the whitelist method you will have to settle for bloated sluggish libraries like HTML Purifier, unless you can either find a better library or you write your own.

I’m personally leaning toward modifying CodeIgniter’s xss_clean method in their input class for my framework and trying to keep it maintained or writing my own library that uses the whitelist method that allows the same tags as Markdown, but cleans out the harmful code that Markdown doesn’t filter out. Please leave a comment if you have a better method or if you know of a better html sanitation library that I didn’t mention.


Simon Willison
1.   At 9:49 a.m. CDT on May 14, 2009, Simon Willison wrote:

The problem with HTML sanitisation is that you're not just writing code to sanitise based on the HTML spec... you need to take in to account the way actual browsers work. If a browser (such as IE) accepts malformed HTML and executes it, it might be a vector for an XSS attack that you couldn't have figured out from just reading the spec.

For example, there have been XSS attacks in the past that place a newline character in the middle of the string "javascript:" (the infamous Samy is my Hero MySpace worm abused that one).

If you want to sanitise HTML properly, you need a full HTML parser that behaves like real browsers do. In the Python world, that means html5lib (the HTML 5 spec attempts to reverse engineer and specify the real-world parsing behaviour of existing browsers). I imagine that's why the HTML Purifier library for PHP is 22,000 lines of code.

Your best bet is to take advantage of someone else's work, provided they have a comprehensive unit test suite to back it up (which it looks like HTML Purifier does). If you roll your own solution you are very, very likely to miss something.

I suggest you benchmark HTML Purifier before writing it off for being "too bloated" - for the vast majority of web apps it's the database, not the application code that's the bottleneck. You may well find that its performance is perfectly fine.

Brent O'Connor
2.   At 6:43 p.m. CDT on May 14, 2009, Brent O'Connor wrote:

Thanks Simon for your thoughts. You are absolutely correct that I should benchmark HTML Purifier and not just assume that it's bloated based on it's size. I think I do remember reading that someone else thought it was slow because of the way it relied so heavily on caching or that they didn't like the way it cached it's tests to speed things up.

I'm also going to need to look at licensing if I'm including libraries like HTML Purifier in my framework.

It would be nice if PHP had a native library that worked well. Also, one of the things that I haven't done yet is really dug in and looked at how Django sanitizes it's input and output at the lower level. All I know is that it seems to at least sanitize things at the output/template layer and then if you want to use HTML you can use the safe filter in the template.

Simon Willison
3.   At 12:38 p.m. CDT on May 17, 2009, Simon Willison wrote:

Django doesn't have HTML sanitization built in - it's secure by default only because it defaults to escaping all entities that are output in the templates (the equivalent of htmlentities). These days I'd look to the Python html5lib library for sanitization.

Glenn J. Melton
4.   At 3:44 p.m. CDT on Aug. 20, 2009, Glenn J. Melton wrote:

I appreciate the candor and wonder are there any examples of whitelist regex patterns out there?

Brent O'Connor
5.   At 9:25 a.m. CDT on Aug. 21, 2009, Brent O'Connor wrote:

@Glenn, I've pretty much shared all my reseach, I can't think of any other examples off the top of head. I would just do a Google search. :)

Comments are closed.

Comments have been close for this post.