Scrapelist The art of content scraping

25Feb/110

Useful PHP functions for scraping content

file_get_contents()

Reads an entire file or URL into a string. Some hosts may disallow this function and you'll have to use cURL instead. You can do more with cURL anyways.

Example:

$page_contents = file_get_contents("http://example.com/page.html");

preg_match()

Used to capture a piece of content within a matched regular expression.

Example:

preg_match("/href=(.*?)title=(.*?)/",$page_contents,$match);

$match will have an array of content matched in order from each (.*?).

preg_match_all()

Matches your regular expression many times in your content to produce an array of matches. Useful for getting a list of items that have the same string pattern around them. Such as your Steam games list.

Example:

preg_match("/<h4>(.*?)</h4>/",$page_contents,$match);

..will create an array list of all h4 headings.

print_r() or var_dump()

Display your data and arrays to test if your code is getting the content you want.

str_replace() or preg_replace()

Replaces strings in content. Good for cleaning up extra junk or modifying content. str_ matches literal strings, and preg_ can use regular expressions.

Example:

$content = str_replace("<h4>","<h3>",$content)

Other useful features of PHP

Traverse the DOM and SimpleXML parser

Filed under: Functions, PHP Leave a comment
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

(required)

No trackbacks yet.