|
Screen scraping your way into RSS
|
By Dennis Pallett
[Hits: 20041]
|
|
Introduction RSS is one the hottest technologiesat the moment, and even big web publishers (such as the New YorkTimes) are getting into RSS as well. However, there are still alot of websites that do not have RSS feeds.
If you still want to be able to check those websites in yourfavourite aggregator, you need to create your own RSS feed forthose websites. This can be done automatically with PHP, using amethod called screen scrapping. Screen scrapping is usuallyfrowned upon, as it's mostly used to steal content from otherwebsites.
I personally believe that in this case, to automaticallygenerate a RSS feed, screen scrapping is not a bad thing. Now,on to the code!
Getting thecontent For this article, we'll use PHPit as an example,despite the fact that PHPit already has RSS feeds.
We'll want to generate a RSS feed from the content listed on thefrontpage. The first step inscreen scraping is getting the complete page. In PHP this can bedone very easily, by using implode(file("", "[the url here]"));IF your web host allows it. If you can't use file() you'll haveto use a different method of getting the page, e.g. using the CURL library.
Now that we have the content available, we can parse it for thecontent using some regular expressions. The key to screenscraping is looking for patterns that match the content, e.g.are all the content items wrapped in <div>'s or somethingelse? If you can successfully discover a pattern, then you canuse preg_match_all() to get all the content items.
For PHPit, the pattern that match the content is <divclass="contentitem">[Content Here]<div>. Youcan verify this yourself by going to the main page of PHPit, andviewing the source.
Now that we have a match we can get all the content items. Thenext step is to retrieve the individual information, i.e. url,title, author, text. This can be done by using some more regularexpression and str_replace() on the each content items.
By now we have the following code; <?php
// Get page $url = "http://www.phpit.net/"; $data =implode("", file($url));
// Get content items preg_match_all ("/<divclass="contentitem">([^`]*?)</div>/",$data, $matches); Like I said, the next step is to retrievethe individual information, but first let's make a beginning onour feed, by setting the appropriate header (text/xml) andprinting the channel information, etc. // Begin feed header("Content-Type: text/xml; charset=ISO-8859-1"); echo"<?xml version="1.0"encoding="ISO-8859-1" ?> "; ?> <rssversion="2.0"xmlns:dc="http://purl.org/dc/elements/1.1/"xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:admin="http://webns.net/mvcb/"xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <channel> <title>PHPit LatestContent</title> <description>The latest content fromPHPit (http://www.phpit.net), screenscraped!</description><link>http://www.phpit.net</link><language>en-us</language>
<?Now it's time to loop through the items, and printtheir RSS XML. We first loop through each item, and get all theinformation we get, by using more regular expressions andpreg_match(). After that the RSS for the item is printed.<?php // Loop through each content item foreach($matches[0] as $match) { // First, get title preg_match("/">([^`]*?)</a></h3>/", $match,$temp); $title = $temp['1']; $title = strip_tags($title); $title= trim($title);
// Second, get url preg_match ("/<ahref="([^`]*?)">/", $match, $temp); $url =$temp['1']; $url = trim($url);
// Third, get text preg_match ("/<p>([^`]*?)<spanclass="byline">/", $match, $temp); $text =$temp['1']; $text = trim($text);
// Fourth, and finally, get author preg_match ("/<spanclass="byline">By ([^`]*?)</span>/",$match, $temp); $author = $temp['1']; $author = trim($author);
// Echo RSS XML echo "<item> "; echo " <title>" . strip_tags($title) . "</title>"; echo " <link>http://www.phpit.net" .strip_tags($url) . "</link> "; echo " <description>" . strip_tags($text) ."</description> "; echo " <content:encoded><![CDATA[ "; echo $text . ""; echo " ]]></content:encoded> "; echo" <dc:creator>" . strip_tags($author) ."</dc:creator> "; echo " </item>"; } ?>And finally, the RSS file is closed off.</channel> </rss> That's all. If you putall the code together, like in the demo script, then you'll havea perfect RSS feed.
Conclusion In this tutorial I have shown you howto create a RSS feed from a website that does not have a RSSfeed themselves yet. Though the regular expression is differentfor each website, the principle is exactly the same.
One thing I should mention is that you shouldn't immediatelyscreen scrape a website's content. E-mail them first about a RSSfeed. Who knows, they might set one up themselves, and thatwould be even better.
Download sample script
|
|
|
|
|
|