Welcome to the Question2Answer Q&A. There's also a demo if you just want to try it out.
+3 votes
1.3k views
in Plugins by
edited by

To PHP developers:

I am developing caching plugin for Q2A. I am troubled that I can not delete new line (LF?) with regular expression in XAMPP (Windows). Finally, I want to remove comments, tabs, and new lines. New line is row of only CR/LF/CRLF there is no character on the top of row.

Processing result:

<!DOCTYPE html>
<html>
    <<<=== New line (LF?) remains.
<head>

Expected results:

<!DOCTYPE html>
<html>
<head>

Failure code1 (New line (LF?) is not removed):

private function compress_html($html) {
    $searchs = array(
        '/<!--[\s\S]*?-->/s', // remove comment
        '/\t/', // remove tab
        '/^(\r\n|\n\r|\n|\r)/', // remove only new line
    );
    $replaces = array(
        '',
        '',
        '',
    );
    return preg_replace($searchs, $replaces, $html);
}

Failure code2 (New line (LF?) is not removed):

private function compress_html($html) {
    $searchs = array(
        '/<!--[\s\S]*?-->/s', // remove comment
        '/\t/', // remove tab
        '/^(\r\n|\n\r|\n|\r)/', // remove only new line
    );
    $replaces = array(
        '',
        '',
        '',
    );
    foreach($searchs as $key => $search)
        $html = preg_replace($search, $replaces[$key], $html);
    return $html;
}

Thanks.

Q2A version: 1.7
by
Thanks Leo for your tips. I will investigate and try it.

2 Answers

+2 votes
by
selected by
 
Best answer

If you want to remove all line breaks, then you should replace with the empty string. The thing is that you'll get only one long line. In your expected, result you want something like this:

<!DOCTYPE html>\n
<html>\n
<head>

So what you want to do is actually turn all repeated line breaks into a maximum of one. I guess something like this should get you your expected result:

$html = "<!DOCTYPE html>
<html>

<!--xyz blah-->Something else
        
<!--xyz

blah-->Again...

 

<head>";
$searchs = array(
    '/<!--.*?-->/s',
    '/\t/',
    '/[\r\n]+/',
);
$replaces = array(
    '',
    '',
    "\n",
);
echo preg_replace($searchs, $replaces, $html);

A better approach would be to use already existing libraries: https://github.com/mrclay/minify/blob/master/min/lib/Minify/HTML.php

Having said so, there are something this is leaving apart. For example, if the HTML loos like this "word\tword" you would see something this in the browser "word word". However, if you remove the tabs, you'll see something like this "wordword". The same happens with the line brakes so it would make more sense to use a multiple match on the tabs too an leave only one.

Also note you're trying to parse HTML with a regular expression. Regular expressions don't understand the hierarchical data of HTML so having comments inside the comments would again break things. Better to user an HTML parser (which will, obviously, degrade performance).

Finally, take into account you are not considering <pre> tags. This will absolutely destroy their content so, again, you need an HTML parser to avoid them.

Conclusion: try the library. If it doesn't do the trick, better not to touch the HTML. If it does the trick, then measure the time it consumes to make sure it really ends up being faster. Don't leave aside enabling gzip HTML compression as it might be all it is actually needed and you don't have to worry at all about the compression as it would be happening in a different layer.

by
Thanks pupi. Since I was not able to solve this problem even if I try several ways, I removed new lines with another logic.

private function remove_newline($html) {
    $lines = explode("\n", $html);
    $lines = array_filter($lines, 'strlen');
    $html = implode("\n", $lines);
    return $html;
}
However, since your suggestions are correct, I decided to use Minify. And your suggestion about gzip is good. But, since plugin refers generated HTML, I do not apply it.

Thanks.
by
minify is a great idea :) @pupi1985

Doing $lines = array_filter($lines, 'strlen'); is also an elegant/fast solution @sama55
+1 vote
by
I think you may have a single character, new line or cr, which is then not removed since your regex requires two characters.

You could try '/^(\r|\n)/'

easier...try: '/^(\s)/'

which removes spaces/new lines/cr from the beginning of the expression.
by
To remove space/cr/lf from around the string, you could also use trim() function.  I am guessing that is what you are doing.

So after $html = preg_replace($search, $replaces[$key], $html);
you could do  $html = trim( $html );
by
Thanks. Unfortunately, both did not become expected results.
by
This is a note for the future, as there is already a solution.

Using '^' anchors to the beginning of the whole html string.  If you want to anchor to the beginning of each line, need to do '/^(\r|\n)/m'
for 'm' see PCRE http://php.net/manual/en/reference.pcre.pattern.modifiers.php
by
Thanks steven. I got it.
...