php - How to create a sitemap with page relationships

450

I'm currently trying to figure out a way to write a script (preferrably PHP) that would crawl through a site and create a sitemap. In addition to the traditional standard listing of pages, I'd like the script to keep track of which pages link to other pages.

Example pages

A
B
C
D

I'd like the output to give me something like the following.

Page Name: A

Pages linking to Page A:

  • B
  • C
  • D

Page Name: B

Pages linking to Page B:

  • A
  • C

etc...

I've come across multiple standard sitemap scripts, but nothing that really accomplishes what I am looking for.


EDIT Seems I didn't give enough info. Sorry about my lack of clarity there. Here is the code I currently have. I've used simple_html_dom.php to take care of the tasks of parsing and searching through the html for me.

<?php

include("simple_html_dom.php");

url = 'page_url';

$html = new simple_html_dom(); 
$html->load_file($url);

$linkmap = array();

foreach($html->find('a') as $link):
    if(contains("cms/education",$link)):
        if(!in_array($link, $linkmap)):
            $linkmap[$link->href] = array();
        endif;
    endif;
endforeach;

?>

Note: My little foreach loop just filters based on a specific substring in the url.

So, I have the necessary first level pages. Where I am stuck is in creating a loop that will not run indefinitely, while keeping track of the pages you have already visited.

249

Answer

Solution:

Basically, you need two arrays to control the flow here. The first will keep track of the pages you need to look at and the second will track the pages you have already looked at. Then you just run your existing code on each page until there are none left:

<?php

include("simple_html_dom.php");

$urlsToCheck = array();
$urlsToCheck[] = 'page_url';
$urlsChecked = array();

while(count($urlsToCheck) > 0)
{
   $url = array_pop($urlsToCheck);
   if (!in_array($url, $urlsChecked)
   {
      $urlsChecked[] = $url;

      $html = new simple_html_dom(); 
      $html->load_file($url);

      $linkmap = array();

      foreach($html->find('a') as $link):
          if(contains("cms/education",$link)):
              if((!in_array($link, $urlsToCheck)) && (!in_array($link,$urlsChecked)))
                 $urlsToCheck[] = $link;

              if(!in_array($link, $linkmap)):
                  $linkmap[$link->href] = array();
              endif;
          endif;
      endforeach;
   }
}

?>

People are also looking for solutions to the problem: Writing xml with php

Source

Didn't find the answer?

Our community is visited by hundreds of web development professionals every day. Ask your question and get a quick answer for free.

Ask a Question

Write quick answer

Do you know the answer to this question? Write a quick response to it. With your help, we will make our community stronger.

Similar questions

Find the answer in similar questions on our website.