Navigating through a document using the DocumentTraversal interface

Traversal of the Document Object Model (DOM) tree is often required utility by web developers. W3C defines two different ways of representing the nodes of a document subtree and a position within the nodes they present.

The DocumentTraversal interface defines methods allowing to create new nodeIterators and treeWalkers. In conforming implementations of Document Traversal, all objects that implement Document must also implement the DocumentTraversal interface. In this article we will demonstrate how to use DocumentTraversal interface.

Navigating through a document using NodeIterator

To organize navigation with NodeIterator we need to complete next steps:

Select some node as root for navigation
Create NodeIterator with appropiate parameters
Run the loop using NextNode() method

Let's start with simple example:

    try
    {
        _document = new HTMLDocument(@"http://asposedemo20170904120448.azurewebsites.net/home/documenttraversal");
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error: {ex.Message}");
        return;
    }

    var node = _document.GetElementById("demo02");    
    var iterator = _document.CreateNodeIterator(node, NodeFilter.SHOW_ELEMENT, default(INodeFilter));
    while (iterator.NextNode() != null)
        Console.WriteLine(iterator.ReferenceNode.NodeName);

In this example, we used a demo page with the following content:

    <div>
        <ul id="demo02">
            <li>List <span>item 1</span></li>
            <li>List item 2</li>
            <li>List item 3</li>
        </ul>    
    </div>

According to complete step 2, we called CreateNodeIterator method with next parameters:

root: a value from _document.GetElementById("demo02");
whatToShow: a constant NodeFilter.SHOW_ELEMENT
filter: null (no filter currently used)

Parameter whatToShow can take a 15 different constants. We’ll mostly use the following:

NodeFilter.SHOW_ALL (for selecting all the nodes)
NodeFilter.SHOW_ELEMENT (for selecting only the element nodes)
NodeFilter.SHOW_ATTRIBUTE (for selecting only the attribute nodes)
NodeFilter.SHOW_TEXT (for selecting only the text nodes)

In step 3 we got the next output:

    UL
    LI
    LI
    LI

As you can see, we got 4 nodes with names: UL, LI, LI, LI. Assume, we need to handle only LI nodes. To solve this task we must define a custom filter. Custom filter is a user-defined class that implements an INodeFilter interface. In that class we must define fliter function AcceptNode. This function accepts a node as its only parameter, and indicates whether the node is accepted, rejected, or skipped.

The following class demonstrate filtering nodes with "LI" as NodeName's value.

    internal class ListItemFilter : INodeFilter
    {
        public short AcceptNode(Node n)
        {               
            return n.NodeName.Equals("LI") ? NodeFilter.FILTER_ACCEPT : NodeFilter.FILTER_SKIP;
        }
    }

According new task (filtering) previous example will be changed like this:

    var node = _document.GetElementById("demo02");
    INodeFilter listItemFilter = new ListItemFilter();                        
    var iterator = _document.CreateNodeIterator(node, NodeFilter.SHOW_ELEMENT, listItemFilter);            
    
    while (iterator.NextNode() != null)
        Console.WriteLine(iterator.ReferenceNode.NodeName);

After execution we got the next output:

LI
LI
LI

Navigating through a document using Tree Walker

The TreeWalker object is a powerful DOM2 object that lets you easily filter through and create custom collections out of nodes in the document. To organize navigation with TreeWalker we need to complete steps mostly the same as for previous method:

Select some node as root for navigation
Create TreeWalker with appropiate parameters
Run the loop using NextNode() method

Let's start with the simple example again.

    var node = _document.GetElementById("demo02");
    var treeWalker = _document.CreateTreeWalker(node, NodeFilter.SHOW_ELEMENT, null);
    while (treeWalker.NextNode() != null)
    {
        Console.WriteLine(treeWalker.CurrentNode.NodeName);
    }

For example, if element with id "demo2" has such content:

    <div id="demo01">
    <h1>Horses for sale</h1>

    <section>
        <h2>Mares</h2>

        <article>
            <h3>Pink Diva</h3>
            <p>Pink Diva has given birth to three Grand National winners.</p>
        </article>

        <article>
            <h3>Ring a Rosies</h3>
            <p>Ring a Rosies has won the Derby three times.</p>
        </article>

        <article>
            <h3>Chelsea’s Fancy</h3>
            <p>Chelsea’s Fancy has given birth to three Gold Cup winners.</p>
        </article>
    </section>

    <section>
        <h2>Stallions</h2>

        <article>
            <h3>Korah’s Fury</h3>
            <p>Korah’s Fury has fathered three champion race horses.</p>
        </article>

        <article>
            <h3>Sea Pioneer</h3>
            <p>Sea Pioneer has won The Oaks three times.</p>
        </article>

        <article>
            <h3>Brown Biscuit</h3>
            <p>Brown Biscuit has fathered nothing of any note.</p>
        </article>
    </section>

    <p>All our horses come with full paperwork and a family tree.</p>
</div>

then we got next output:

    H1
    SECTION
    H2
    ARTICLE
    H3
    P
    ARTICLE
    H3
    P
    ARTICLE
    H3
    P
    SECTION
    H2
    ARTICLE
    H3
    P
    ARTICLE
    H3
    P
    ARTICLE
    H3
    P
    P

We also can use custom filter as in previous example:

    internal class SectionItemFilter : INodeFilter
    {
        public short AcceptNode(Node n)
        {
            return n.NodeName.Equals("SECTION") ? NodeFilter.FILTER_ACCEPT : NodeFilter.FILTER_SKIP;
        }
    }

Having created a filtered list of nodes using _document.CreateTreeWalker, we can then process these filtered nodes using TreeWalker's traversal methods:

TreeWalker traversal methods
Method	Description
FirstChild()	Travels to and returns the first child of the current node.
LastChild()	Travels to and returns the last child of the current node.
NextNode()	Travels to and returns the next node within the filtered collection of nodes.
NextSibling()	Travels to and returns the next sibling of the current node.
ParentNode()	Travels to and returns the current node's parent node.
PreviousNode()	Travels to and returns the previous node of the current node.
PreviousSibling()	Travels to and returns the previous sibling of the current node.

Using the next example, lets see how to use the traversal methods to walk through the returned nodes. Original data was placed here, but for demo purposes our table (element with id="table_demo") will contain only cells with plain text.

Step 1: Preparing for walking.

        var root = _document.GetElementById("table_demo");
        var walker = _document.CreateTreeWalker(root, NodeFilter.SHOW_ELEMENT, null);
        var element = walker.NextNode();

Step 2: Searchig ellement with NodeName=="TBODY". We call walker.NextSibling() to move to next sibling element.

        while (element != null)
        {
            if (element.NodeName == "TBODY")
            {
                //TODO: next step here
            }
            else
                element = walker.NextSibling();
        }

Step 3: Dive into TBODY-element and search TR-element. We call walker.NextSibling() to move to next sibling element.

        element = walker.FirstChild(); //Getting first nested node, we supposing that it's TR-element
        while (element != null && element.NodeName == "TR") // ... but check it
        {
            //TODO: next step here
        }

Step 4: Moving by table row. To move from one cell to another we call walker.NextSibling(). At some moment, we will reach the last cell in the row and walker.NextSibling () will return a null value. In this case TreeWalker stays in the last cell and we can travel to parent node by call walker.ParentNode(). From parent node to next row we can travel by call walker.NextSibling()

        element = walker.FirstChild(); //Getting first nested node
        while (element != null && element.NodeName == "TD") 
        {
            Console.Write(element.TextContent+'\t');
            element = walker.NextSibling();
        }
        Console.WriteLine();               
        walker.ParentNode();
        element = walker.NextSibling();

Because we used simple table in this example, then we also can replace walker.ParentNode(); element = walker.NextSibling(); with element = walker.NextNode();. This replacement based on assumptions that only TR and TD elements present in TBODY and call the NextNode method, will move from the last TD-element of row to the next TR-element.

Conclusion

The DocumentTraversal provides two methods for traversing the DOM. Depending on the case, we can:

use the NodeIterator and get the flatten list of nodes,
or use the TreeWalker and get the tree-oriented view of the nodes.