Using regular expressions to extract content – php extract texts from html content

PHP provides a number of really neat regular expression functions. You can find the list of the regex function at the PHP site.

But the one that I’ve had most fun with is the preg_match_all() function which I’ve been using to do content extraction from an HTML page.

I’m not going to explain what Regular Expression (regex) is in this post. There are whole books on just this one topic along; I would be crazy to think I can explain it all in just a few paragraphs. But in order for you to understand how to use the regex functions you need to have a basic understanding of regular expressions.

If you think back to your childhood days, you would remember a toy that you can match holes with shapes with the corresponding blocks – like the picture here. Well, regular expressions is very much like that toy, but instead you have define your own ’shape’ (or pattern as it’s known) and apply your content to it. Any text that matches the pattern will ‘fall’ through it.

Let’s say you have a block of text like below and you want to extract out the all links from, you can use preg_match_all to do just that.

$content = "He's goin' everywhere,
<a href=\"\">B.J. McKay</a> and his
best friend Bear. Rollin' down to
<a href=\"\">Dallas</a>, who's providin'
my palace, off to New Orleans or who knows where."

The pattern you want to look for would be the link anchor pattern, like 
<a href=”(something)”>(something)</a>. The actual regular expression might look something like
Once you have your pattern you apply the $content and $regex_pattern to preg_match_all() like this

$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";

Once you have your pattern you apply the $content and $regex_pattern to preg_match_all() like this

preg_match_all will store all the matches into the array $matches, so if you output the array, 
you’ll see something like this.
    [0] => Array
            [0] => <a href="">B.J. McKay</a>
            [1] => <a href="">Dallas</a>

    [1] => Array
            [0] =>
            [1] =>

    [2] => Array
            [0] => B.J. McKay
            [1] => Dallas

From this array, $matches, you should be able to loop through and get the information you need.

I hope this has been useful to you. I know it doesn’t cover all the things this function can do, but for first-timers, it should be a simple look at a very powerful PHP function.

Incidently, PHP also provides the function preg_match(). The difference is preg_match() only matches a single instance of the pattern, whereas preg_match_all() tries to find all matching instances within the content.



Categories: CakePHP, CakePHP Developer India, freelance developer, Freelance PHP Developer, Freelance PHP Programmer, Freelance Programmer India, Freelance web developer, Hire Dedicated Programmer, Hire Dedicated Programmer India, Hire PHP Developer India, Hire PHP Professional, Hire PHP Professional India, Hire PHP Programmer India, india web development, iphone developer, JavaScript, joomla, joomla customization india, Joomla Developer, joomla developer india, jquery, Magento, magento freelancer india, mysql, open sources developer india, oscommerce Customization, php, PHP Developer, PHP Developer India, PHP Development, PHP Freelance, PHP Freelancer, PHP Freelancer India, php freelancing india, Php programming, web design india, web designer, web designer india, Web Developement Company USA, web developer, web developer ahemdabad india, wordpress, Wordpress Blog Developer, Wordpress Customization Services, Wordpress Developer, Wordpress Developer India, wordpress freelance developer, wordpress freelance programmer, wordpress freelancer, Wordpress Programmer, Wordpress Shopping Cart, Wordpress theme customization, wordpress theme integration, x cart development india, x-cart Tags: , , , , , , , , , , , , , , , , , , ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: