Broken 🔗 Link Checker Using PHP and cURL

Broken 🔗 Link Checker Using PHP and cURL

Whether operating a commercial site, a directory, or a personal site, it is important to ensure you do not have ‘dead’ links on your website. Broken links; links that point to inactive domains or 404 pages are of little use to your site visitors and may jeapordise any good search engine rankings you have, as it can be inferred your site is not well maintained while having broken links on it.

To remedy any potential problem, using a script to periodically check links on your pages means you can quickly alter & remove links that are no longer active or useful.

The following script will Pagination do this task for you, using PHP and cURL, with a simple HTML parser to find links on a page. Simply enter a URL into the form, and the results will appear on an IFrame in the same page.

class html_parser {

 

    // A function to convert relative links to absolute links

    public function rel2abs($rel,$base) {

        @$p = parse_url($rel);

        if(!$rel)

            return $base;

        if(isset($p[‘scheme’]) && $p[‘scheme’]) {

            if(!isset($p[‘path’])) {

                if(isset($p[‘query’]))

                    $rel = preg_replace(“‘\?'”,‘/?’,$rel,1);

                else

                    $rel .= ‘/’;

            }

            return $rel; /* return if already absolute URL */

        }

        if($rel[0]==‘#’ || $rel[0]==‘?’) 

            return $base.$rel; /* queries and anchors */

        extract(parse_url($base)); /* parse base URL and convert to local variables:$scheme, $host, $path */

        $path = preg_replace(‘#/[^/]*$#’, ”, $path); /* remove non-directory element from path */

        if ($rel[0] == ‘/’) 

            $path = ”; /* destroy path if relative url points to root */

        $abs = “$host$path/$rel”; /* dirty absolute URL */

        $re = array(‘#(/.?/)#’, ‘#/(?!..)[^/]+/../#’); /* replace ‘//’ or ‘/./’ or ‘/foo/../’ with ‘/’ */

        for($n=1; $n>0; $abs=preg_replace($re, ‘/’, $abs, –1, $n)) 

            ;

        return $scheme.‘://’.$abs; /* absolute URL is ready! */

    }

 

    // DOM functions used to find URLs

    function parse_for_links($dom,$url,$tag,$attr,&$i) {

        foreach($dom->getElementsByTagName($tag) as $link) {

            $href = $link->getAttribute($attr);

            if(!strlen($href) || $href[0] == ‘#’ || preg_match(“‘^javascript’i”,$href))

                continue;

            $href = preg_replace(array(“‘^[^:]+://'”,“‘#.+$'”),”,$this->rel2abs($href,$url));

            if(isset($done[$href]))

                continue;

            $anchor = $link->nodeValue;

            $string = ‘curl -I -A “Broken Link Checker” -s –max-redirs 5 -m 5 –retry 1 –retry-delay 10 -w “%{url_effective}\t%{http_code}\t%{time_total}” -o temp2.txt ‘.escapeshellarg($href);

            $string = explode(“\t”,$string);

            if($string[1][0] == ‘2’)

                $color = ‘green’;

            elseif($string[1][0] == ‘3’)

                $color = ‘yellow’;

            else

                $color = ‘red’;

            echo (++$i).‘. ’.$string[1].‘ ‘.$string[2].‘ ‘.str_pad($string[0],50,‘ ‘,STR_PAD_RIGHT).“\n”;

            $done[$href] = true;

            if($i > 100) // Limiting to 100 URLs, you can change this to suit your needs.

                break;

            flush();

        }

    }

}

 

// Loads up an Iframe with some default text

    if(isset($_GET[‘iframe’])) {

        echo ‘Results will appear here’;

        exit(0);

    }

// You have submitted a URL to check

    if(isset($_POST[‘url’],$_POST[‘choice’])) {

        @$url = parse_url($_POST[‘url’]);

        if(!isset($url[‘host’]))

            echo ‘The URL you provided was invalid, please submit a valid URL’;

        else {

            // Prepare the command to send to cURL (on the command line)

            $string = ‘curl -A “Broken Link Checker” -s –max-redirs 5 -m 5 –retry 1 –retry-delay 10 -w “%{url_effective}\t%{http_code}\t%{size_download}\t%{time_total}” -o temp.txt ‘.escapeshellarg($_POST[‘url’]);

 

            // Check the HTTP response type

            $string = explode(“\t”,$string);

            

            if($string[1][0] == ‘2’)

                $color = ‘green’;

            elseif($string[1][0] == ‘3’)

                $color = ‘yellow’;

            else

                $color = ‘red’;

 

            echo ‘Fetched ‘.$string[0].‘ (‘.$string[2].‘ bytes) in ‘.$string[3].‘ seconds, it returned a ’.$string[1].‘ response’;

            echo ‘

’; $_html_parser = new html_parser; $dom = new DOMDocument; @$dom->loadHTML(file_get_contents(‘temp.txt’)); $i = 0; if($_POST[‘choice’] == ‘Check Links’) { // Checking and references $_html_parser->parse_for_links($dom,$_POST[‘url’],‘a’,‘href’,$i); $_html_parser->parse_for_links($dom,$_POST[‘url’],‘area’,‘href’,$i); } elseif($_POST[‘choice’] == ‘Check Files’) { // Checking ,