php - simple_html_dom: 403 Access denied

276

I implemented this function in order to parse HTML pages using two different "methods". As you can see both are using the very handy class calledsimple_html_dom. The difference is the first method is also using curl to load the HTML while the second is not using curl

Both methods are working fine on a lot of pages but I'm struggling with this specific call:searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');

In both cases, I end up with a 403 access denied response. Did I do something wrong? Or is there another method in order to avoid this type of denial?

function searchThroughDOM ($url, $method)
{
    echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
    $time_start = microtime(true);

    switch ($method) {
        case 'curl':
            $curl = curl_init();
            curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
            curl_setopt($curl, CURLOPT_HEADER, false);
            curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($curl, CURLOPT_URL, $url);
            curl_setopt($curl, CURLOPT_REFERER, $url);
            curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
            curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
            $str = curl_exec($curl);
            curl_close($curl);

            // Create a DOM object
            $html = new simple_html_dom();
            // Load HTML from a string
            $html->load($str);
            break;

        case 'simple_html_dom':
            $html = new simple_html_dom();
            $html->load_file($url);
            break;
    }

    $collection = $html->find('h1');

    foreach($collection as $x => $x_value) {
        echo 'x = '.$x.' => value = '.$x_value.'<br>';
    }

    $html->save('result.htm');
    $html->clear();

    $time_end = microtime(true);
    echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}
364

Answer

Solution:

From my point of view , there is nothing wrong with "simple_html_dom" you may remove the simple html dom "part" of the code , leave only for the CURL which I assume is the source of the problem. There are lots of reasons cause the curl Not working on page first of all I can see you add

curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); you should also try to add CURLOPT_SSL_VERIFYHOST , false

Secondly , check your curl version, see if it is too old third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request . lastly , if all above attempt failed , try other library or even file_get_content , Curl is not your only option, of cause it is the most powerful one.

People are also looking for solutions to the problem: When I run PHP function to append parameter to array, it only shows the most recent parameter passed as the only element within the array

Source

Didn't find the answer?

Our community is visited by hundreds of web development professionals every day. Ask your question and get a quick answer for free.

Ask a Question

Write quick answer

Do you know the answer to this question? Write a quick response to it. With your help, we will make our community stronger.

Similar questions

Find the answer in similar questions on our website.