Screen-Scraping With CSS Selectors in PHP
Tuesday, Jul 7, 2009, 7:36 pm | In General, ProgrammingWhen it comes to programming techniques, screen-scraping can be a complicated and annoying thing to deal with. I've dealt with it quite a bit in various projects, most recently my TV Library plugin for boxee. I use PHP to accomplish most of my screen scraping and its got a pretty great arsenal. However, if ease of use and simplicity is your goal, the built in tools and techniques won't be much help, they can be pretty convoluted and confusing. That's why I'm going to recommend a library called phpQuery.
phpQuery is, among other things, a PHP port of the JavaScript library jQuery. You front-end developers out there may already know where I'm going with this, but in-case you don't know, jQuery is known for many things, but is probably best known for its incredible support of CSS Selectors. CSS Selectors allow you to select HTML elements using the same syntax you'd use to style elements in a CSS Stylesheet. (Some of you more experienced readers will no-doubt be jumping to the comments section to complain about how I'm not even mentioning XPath, well I think XPath is overly complicated and can be extremely confusing to beginners. CSS Selectors are far more approachable. Also, I am of the mindset that if at anytime we can standardize on some type of well-tested technique in web development, we should.)
There could be entire books written about what CSS Selectors are and how to utilize them best, so I'm not going to go into too much detail here. If you want to learn about all the intricacies of CSS and jQuery style selectors, first look here: jQuery Docs: Selectors, and if you're still confused, a simple Google search will likely yield all the information you'll need. But for now all you need to know is this: this is the easiest way to screen scrape anything. Ever. phpQuery will let you turn this:
<?php
$url = "http://www.nfl.com/teams/dallascowboys/roster?team=DAL";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<table cellpadding="2" class="standard_table"');
$end = strpos($content,'</table>',$start) + 8;
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$number = strip_tags($cells[0][0]);
$name = strip_tags($cells[0][1]);
$position = strip_tags($cells[0][2]);
echo "{$position} - {$name} - Number {$number} <br>\n";
}
}
?>
Into this:
<?php
require("phpQuery/phpQuery.php");
phpQuery::browserGet('http://www.nfl.com/teams/dallascowboys/roster?team=DAL', 'success1');
function success1($browser) {
foreach($browser['#result > tbody > tr'] as $player) {
$player = pq($player)->find('td')->getStrings();
print "Player: #" . $player[0] . " - " . $player[1] . " - Position: " . $player[2] . "<br />\n";
}
}
?>
Much nicer right? Yeah I thought so too. Give it a shot and let me know how it goes for you in the comments.

I like the concept, but the example looks a bit confuse to me. Gonna check it out.
But as a developer, I must say that CRUD is the most annoying thing EVER! So maybe a technique like this can help, thanks for sharing!
Comment by Antonio Max — July 9th, 2009 at 4:14 pm #
This is a test.
Comment by Jake Marsh — July 11th, 2009 at 12:18 am #
Test.
Comment by Jake Marsh — July 11th, 2009 at 12:25 am #
Test 2.
Comment by Jake Marsh — July 11th, 2009 at 12:26 am #
Jake,
Thanks for the great article on how to use phpQuery. I am currently working on a project where I am having trouble determining the correct selector to use in order to grab an item from an unordered list. For the most part, I am able to use the Simple HTML Dom Parser to complete the task but I would rather use a simpler, more elegant solution like phpQuery.
I am trying to grab individual line items from the div#modelSettings on this page: http://www.tweaktv.com/tweak-my-tv/calibration-guide/samsung-ln-40b630-2.html
I can’t seem to find the correct selector or possibly my syntax is incorrect. I tried using your example above with no success and the phpQuery docs tend to be sparse. Do you have any suggestions or ideas which could help me? Thanks in advance, I appreciate your time – Keep up the nice work on your site!
Comment by Brandon — September 19th, 2009 at 10:58 am #
I thought the post made some good points on screen scrapers, For screen scrapers i use python for simple things, but for larger projects i used screen scrapers which worked great, they build quick custom screen scraper and web scraper programs
Comment by Rachel — October 29th, 2009 at 3:23 pm #
Interesting points on screen scraping, For simple stuff i use python to web crawl, but for larger projects i used screen scraping software which worked great, they build quick custom screen scraper and web scraper programs
Comment by Amanda — October 29th, 2009 at 3:52 pm #