Welcome to the Question2Answer Q&A. There's also a demo if you just want to try it out.
+1 vote
4.2k views
in Q2A Core by
edited by

As I am new to htmLawed I have got 5 newbie question regarding the sanitization of certain HTML  tags.

Question n°1:

In q2a, file qa-htmLawed.php on line 20 we have:

$e = array('a'=>1, 'abbr'=>1, 'acronym'=>1, 'address'=>1, 'applet'=>1, 'area'=>1, 'b'=>1, 'bdo'=>1, 'big'=>1, 'blockquote'=>1, 'br'=>1, 'button'=>1, 'caption'=>1, 'center'=>1, 'cite'=>1, 'code'=>1, 'col'=>1, 'colgroup'=>1, 'dd'=>1, 'del'=>1, 'dfn'=>1, 'dir'=>1, 'div'=>1, 'dl'=>1, 'dt'=>1, 'em'=>1, 'embed'=>1, 'fieldset'=>1, 'font'=>1, 'form'=>1, 'h1'=>1, 'h2'=>1, 'h3'=>1, 'h4'=>1, 'h5'=>1, 'h6'=>1, 'hr'=>1, 'i'=>1, 'iframe'=>1, 'img'=>1, 'input'=>1, 'ins'=>1, 'isindex'=>1, 'kbd'=>1, 'label'=>1, 'legend'=>1, 'li'=>1, 'map'=>1, 'menu'=>1, 'noscript'=>1, 'object'=>1, 'ol'=>1, 'optgroup'=>1, 'option'=>1, 'p'=>1, 'param'=>1, 'pre'=>1, 'q'=>1, 'rb'=>1, 'rbc'=>1, 'rp'=>1, 'rt'=>1, 'rtc'=>1, 'ruby'=>1, 's'=>1, 'samp'=>1, 'script'=>1, 'select'=>1, 'small'=>1, 'span'=>1, 'strike'=>1, 'strong'=>1, 'sub'=>1, 'sup'=>1, 'table'=>1, 'tbody'=>1, 'td'=>1, 'textarea'=>1, 'tfoot'=>1, 'th'=>1, 'thead'=>1, 'tr'=>1, 'tt'=>1, 'u'=>1, 'ul'=>1, 'var'=>1); // 86/deprecated+embed+ruby

Do I have to change 1 to 0 to disallow certain elements? → No, you use the config parameters for that, no need to change the source of qa-htmlawed.php!


Question n°2: How can I remove all style="..." attributes but allowed ones?

Question n°3: Are empty style-elements removed automatically?

Question n°4: How can class="" and id="" attributes be removed completely (what settings do we need)?

Question n°5: How can we remove empty tags, such as <b></b> or <p></p>?

 

Related question: http://www.question2answer.org/qa/17798/has-somebody-used-htmlawed-to-clean-user-input-in-q2a

Q2A version: 1.5.3
by
Getting deeper into q2a. The qa-plugin/wysiwyg-editor/qa-wysiwyg-editor.php takes care for the posted content, on line 228 it calls:

'content' => qa_sanitize_html($html, false, true), // qa_sanitize_html() is ESSENTIAL for security

With qa_sanitize_html() - you find it in qa-include/qa-base.php from line 690 - the "htmlawed cleaner" is called. The array there with: $safe=htmLawed($html, array(...) <- is the config for htmLawed!

and here is the config doc of htmLawed: http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/htmLawed_README.htm#s2.2

1 Answer

0 votes
by

Answered by the developer of htmLawed, thanks patnaik!

Restricting elements, attributes, and attribute values / 5 questions

Q1. I would use config: 
'elements' => 'img, a, p, br, span, b, strong, i, em, u, sub, sup, strike, table, caption, tbody, tr, td',

******** Q1. Yes, the config. is correct.

******** Q2. There are a couple of ways to achieve this. One uses the 'spec' argument of htmLawed, and the other a 'hook_tag' function. For the latter, see this web-page or this forum topic. Below is an example that uses 'spec':

// The 'style' attribute value for these elements ('p', 'span', 'b'...) cannot soft-match our pattern.
// The pattern looks for presence of a CSS style property name (like 'align') in the 'style' value.
// The 'style' attribute name and value will be filtered if 'style' value includes a property whose name
//   does not end in 'color', 'font-weight', 'text-decoration', 'background-color'

$spec = '
  p, span, b, i, em, strong, sub, sup, strike, table, caption, tbody, tr, td = 
    style(nomatch=%"("?<!background-color"|"color"|"font-weight"|"text-decoration")"\s*:%i);
';

$out = htmLawed($in, $config, $spec);

******** Q3. With either 'hook_tag' or 'spec' option, both the 'style' attribute name and the attribute value will be removed fro the attribute string of the element.

******** Q4. There are a number of options, including using 'hook_tag.' But the simplest is to use the 'deny_attribute' config. parameter:

$config['deny_attribute'] = 'class, id';
$out = htmLawed($in, $config, $spec);

******** Q5. I assume you mean that you want to trim the end of content to remove all white space and empty elements (elements without non-white-space content). htmLawed does not have a direct functionality for this. I suggest using some regular expression-based search-replace operation on the content before it is htmLawed-filtered.

from: http://www.bioinformatics.org/phplabware/forum/viewtopic.php?pid=695

 

...