Introduction
Modern web applications very often provide additional information on their HTML5-pages. Today we can use several technologies to add extra info to HTML5-pages such as Microdata, RDF, JSON-LD.
Microdata lets you define your own customized elements and start embedding custom properties in your web pages. At a high level, microdata consists of a group of name-value pairs. The groups are called items, and each name-value pair is a property. Items and properties are represented by regular elements. More information you can get on Schema.org
There are several ways to extract data from document: using CSS selector’s queries, using XPath queries, etc. We will consider using CSS selectors queries.
Extract the data from HTML5 document with microdata.
In this article we will use demo page with JobPosting entity. Original example can be found here. Please note, that we are not attempt to write full-functional scraper, but demostrate how you can use Aspose.HTML library.
Loading document
// read page with jobposting data
try
{
_document = new HTMLDocument(@"http://asposedemo20170904120448.azurewebsites.net/home/jobposting");
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex.Message}");
return;
}
Gathering JobPosting entities
Using CSS selector’s queries
The HTMLDocument class has QuerySelectorAll method. QuerySelectorAll returns a collection of all the Elements in document, which match selector.
JobPosting microdata entity decorated with an itemscope and itemtype attributes. Itemtype specifies the URL of the vocabulary that will be used to define itemprop's (item properties) in the data structure. For JobPosting entity URL will be set to http://schema.org/JobPosting. So, we need to find all the nodes with the itemtype attribute with a value ending with "JobPosting"
var jobPostings = _document.QuerySelectorAll("[itemtype$=JobPosting]");
To store a data of one entity we declare a class JobPosting:
public class JobPosting
{
// The base salary of the job or of an employee in an EmployeeRole.
public string baseSalary;
// Publication date for the job posting.
public DateTime datePosted;
// Educational background needed for the position.
public string educationRequirements;
// Type of employment (e.g. full-time, part-time, contract, temporary, seasonal, internship).
public string employmentType;
// A property describing the estimated salary for a job posting.
public decimal estimatedSalary;
// Description of skills and experience needed for the position.
public string experienceRequirements;
// Organization offering the job position.
public string hiringOrganization;
// Description of bonus and commission compensation aspects of the job. Supersedes incentives.
public string incentiveCompensation;
// The industry associated with the job position.
public string industry;
// Description of benefits associated with the job.
public string jobBenefits;
// A (typically single) geographic location associated with the job position.
public string jobLocation;
// Category or categories describing the job.
public string occupationalCategory;
// Specific qualifications required for this role.
public string qualifications;
// Responsibilities associated with this role.
public string responsibilities; //Text
// The currency (coded using ISO 4217 ) used for the main salary information in this job posting or for this employee.
public string salaryCurrency;
// Skills required to fulfill this role.
public string skills;
// Any special commitments associated with this job posting. Valid entries include VeteranCommit, MilitarySpouseCommit, etc.
public string specialCommitments;
// The title of the job.
public string title;
// The date after when the item is not valid. For example the end of an offer, salary period, or a period of opening hours.
public DateTime validThrough;
// The typical working hours for this job (e.g. 1st shift, night shift, 8am-5pm).
public string workHours;
//......
}
Please note, that the field names are mapped to the corresponding properties of the JobPosting class. It was done on purpose to use reflection technology and getting field's info at runtime.
To get elements with itemprop attributes we call QuerySelectorAll for each elements in jobPostings collection.
var listOfJobs = new List();
var jbpType = typeof(JobPosting);
foreach (var jobPosting in jobPostings)
{
var item = new JobPosting();
foreach (var jobPostingChild in jobPosting.QuerySelectorAll("[itemprop]"))
{
// TODO: handle elements with itemprop
}
listOfJobs.Add(item);
}
Handling elements with itemprop attributes
Our idea is to get the property name, try to find it in fields of our class and set appropiate value.
foreach (var jobPostingChild in jobPosting.QuerySelectorAll("[itemprop]"))
{
var itemprop = jobPostingChild.GetAttribute("itemprop");
var fieldInfo = jbpType.GetField(itemprop);
if (fieldInfo == null) continue; // we found itemprop that's not listed in our class
switch (Type.GetTypeCode(fieldInfo.FieldType))
{
case TypeCode.String:
fieldInfo.SetValue(item, jobPostingChild.TextContent.Trim());
break;
case TypeCode.DateTime:
fieldInfo.SetValue(item, DateTime.Parse(jobPostingChild.TextContent));
break;
case TypeCode.Decimal:
fieldInfo.SetValue(item, decimal.Parse(jobPostingChild.TextContent));
break;
}
}
Using XPath queries
We will get the same result, if we will use the XPath language. To get all nodes with itemtype attributes we need to evaluate an expression: //*[@itemtype=\"http://schema.org/JobPosting\"]
The ancestor of the HTMLDocument class has methods for creating and evaluating XPath expressions. The Evaluate method evaluates an XPath expression string and returns a result of the specified type if possible. This method has 5 parameters (expression, contextNode, resolver, type, result), but now we need to set the expression and the contextNode only.
var jobPostingNodes = _document.Evaluate("//*[@itemtype=\"http://schema.org/JobPosting\"]", _document, null, XPathResultType.Any, null);
The result of this evaluation is an object, which holds an iterator for node's collection with searched values. So, we need to organize while loop and handle inner nodes.
Node node;
while ((node = jobPostingNodes.IterateNext()) != null)
{
var item = new JobPosting();
var itempropNodes = _document.Evaluate("//*[@itemprop]", node, null, XPathResultType.Any, null);
Element jobPostingChild;
while ((jobPostingChild = (Element) itempropNodes.IterateNext())!=null)
{
// TODO: handle elements with itemprop's
}
listOfJobs.Add(item);
}
The algorithm for handle elements with items rest the same as in previous example and will be omitted here.