PHP DOMDocument Example
Parse HTML String using DOMDocument with ease.
In this post I will be showing you how to parse an HTML String and get Data you needed to be viewed in a different way.
For exapmle, you have lots of html files/webpages that you wanted to be dynamic , and all of those files have a similar layout.
Firstly, you need to do is to grab only the needed Data on those file and save them somewhere (a Database preferably).
You can then create just 1 file and call those Data saved and show them in the page.
Quick Explanation
PHP DOMDocument
Here's a brief explanation of some PHP DOMDocument codes that we will be using.
new DomDocument();
Returns a PHP DOMDocument instance
DOMDocument::loadHTMLFile( PathToFileName );
- PathToFileName [Required] - Location of the html file to be loaded.
Throws an error if the HTML String is malformed.
new DOMXpath( DOMDocumentInstance )
- DOMDocumentInstance [Required] - A PHP DOMDocument instance loaded with an HTML String.
Returns a DOMXpath.
DOMXpath::query( PathToElement , context )
- PathToElement [Required] - A Path targeting an element.
DOMXpath::query( '//table/tr/td' )
The above path points to all td's
Returns a DOMNodeList.
- context [Optional] - A DOMNode to where the DOMXpath::query will be searching for.
$td = DOMXpath::query( '//table/tr/td' );
$links = DOMXpath::query( '//a' , $td->item(0) );
The second DOMXpath::query will only search for <a> tags inside the first <td> found.
Returns a DOMNodeList.
DOMNodeList::item( index )
- index [Required] - Index of the item inside the DOMNodeList.
Returns a DOMNode.
DOMNode::nodeValue
- Returns a String that is inside the node.
Example HTML:
For example, we have an HTML file (someone.html). And its content is this:
<html>
<head>
<link href="myCss.css" rel="stylesheet" />
</head>
<body>
<div class="container">
<h1>Someones's BIO</h1>
<div class="BasicInfo">
<h4>My Info</h4>
<table class="table">
<tr>
<th>Name</th>
<td>Someone S. Name</td>
</tr>
<tr>
<th>Age</th>
<td>23</td>
</tr>
<tr>
<th>Gender</th>
<td>Male</td>
</tr>
</table>
</div>
<div class="WorkInfo">
<h4>My Experience</h4>
<table class="table">
<thead>
<tr>
<th>Job Title</th>
<th>Job Description</th>
</tr>
</thead>
<tbody>
<tr>
<th>CEO Manager President at Owner inc.</th>
<td>Worked here for 10 years.</td>
</tr>
<tr>
<th>Vice Assistant at Sub & Right Hand.</th>
<td>Worked here for 50 years.</td>
</tr>
</tbody>
</table>
</div>
<div class="ContactInfo">
<h4>Contact Info</h4>
<table class="table">
<tr>
<th>Email</th>
<td>someone_s_name@email.com</td>
</tr>
<tr>
<th>Skype</th>
<td>someone_s_name</td>
</tr>
</table>
</div>
</div>
</body>
</html>
Load someone.html in the PHP DOMDocument so we can start parsing it for the data we need.
$doc = new DomDocument();
@$doc->loadHTMLFile( 'someone.html' );
Note on the @$doc->loadHTMLFile(...) the @ works here to not throw any errors in the page if there the HTML is malformed.
We will be using DOMXpath as it will be easier to query the HTML String.
$xpath = new DOMXPath($doc);
Looking at the HTML, we can get the Name, Age, Gender, Jobs and Contact infos.
Here's how each can be grabbed by DOMXpath
// Basic Info
$name = $xpath->query('//div[@class="BasicInfo"]/table/tr/td')->item(0)->nodeValue;
$age = $xpath->query('//div[@class="BasicInfo"]/table/tr/td')->item(1)->nodeValue;
$gender = $xpath->query('//div[@class="BasicInfo"]/table/tr/td')->item(2)->nodeValue;
// Work Info
$job1Title = $xpath->query('//div[@class="WorkInfo"]/table/tbody/tr/th')->item(0)->nodeValue;
$job1Desc = $xpath->query('//div[@class="WorkInfo"]/table/tbody/tr/td')->item(0)->nodeValue;
$job2Title = $xpath->query('//div[@class="WorkInfo"]/table/tbody/tr/th')->item(1)->nodeValue;
$job2Desc = $xpath->query('//div[@class="WorkInfo"]/table/tbody/tr/td')->item(1)->nodeValue;
// Contact Info
$email = $xpath->query('//div[@class="ContactInfo"]/table/tr/td')->item(0)->nodeValue;
$skype = $xpath->query('//div[@class="ContactInfo"]/table/tr/td')->item(1)->nodeValue;
Also take note that the Work Info or the Contact Info could be one or more. So you need to also consider how to make your database.
Tips:
- Before using DOMNodeList::item it is best to perform a check if the node is existing with DOMNodeList::length before accessing the DOMNode. Example:
$tds = $xpath->query('//div[@class="BasicInfo"]/table/tr/td');
if( $tds->length > 0 ){
$name = $tds->item(0)->nodeValue;
.......
}
.....
- Make good use of context in DOMXpath::query So for example instead of:
// Basic Info
$name = $xpath->query('//div[@class="BasicInfo"]/table/tr/td')->item(0)->nodeValue;
$age = $xpath->query('//div[@class="BasicInfo"]/table/tr/td')->item(1)->nodeValue;
$gender = $xpath->query('//div[@class="BasicInfo"]/table/tr/td')->item(2)->nodeValue;
The above will search the HTML String 3 times for those targeted <td>'s.
We can do 2 DOMXpath::query to perform better:
// Basic Info
$BasicInfo = $xpath->query('//div[@class="BasicInfo"]');
$BasicInfoTd = $xpath->query('//table/tr/td',$BasicInfoTd->item(0));
$name = $BasicInfoTd->item(0)->nodeValue;
$age = $BasicInfoTd->item(1)->nodeValue;
$gender = $BasicInfoTd->item(2)->nodeValue;
Since we are passing a context to DOMXpath::query, we are telling that just look for <td>'s inside div.BasicInfo and dont iterate the whole HTML string.
- In the case there are multiple classes like this HTML
<div class="InfoContainers BasicInfo">.....</div>
The query below will not work,
// Basic Info
$BasicInfo = $xpath->query('//div[@class="BasicInfo"]');
The query above is just saying a div that class is only BasicInfo
We can do it by using the contains.
// Basic Info
$BasicInfo = $xpath->query('//div[contains(@class, 'BasicInfo')]');
Of if you wanted to be more sure, do it like this:
// Basic Info
$BasicInfo = $xpath->query('//div[contains(@class, 'BasicInfo') and contains(@class, 'InfoContainers')]');
You should also take into account the Class names. For example '//div[contains(@class, 'BasicInfo')] will also target <div class="BasicInfo_1">......</div>
Replies (0)
Reply