TThier/Languages/xml/xpath
Dan Bikle


This page contains a few simple XPath demonstrations.

XPath needs something to chew on.  A simple HTML file works pretty well for this.
A simple HTML file is displayed below.

Rendered:
simple.html

Text:
simple-html.txt

XPath has an easier time chewing on HTML if we force the HTML to be "well formed".

We may use Tidy.jar to transform an HTML file into a well formed HTML file.

Here is a command line which demonstrates how to use Tidy.jar
$JAVA_HOME/bin/java -classpath Tidy.jar org/w3c/tidy/Tidy -indent -wrap 999 -asxml -utf8 simple.html > simpleWellformed.html

Here is a demo shell script which wraps the above command line:
Tidy-sh.txt

Here are some links to Tidy.jar:
Tidy.jar
http://jazilla.sourceforge.net/products/jars/Tidy.jar
http://prdownloads.sourceforge.net/httpunit/httpunit-1.5.4.zip?download (Tidy.jar is inside this .zip)

After I made simple-html.txt well formed with Tidy.jar,
I ended up with this:

Rendered:
simpleWellformed.html

Text:
simpleWellformed-html.txt

The demonstrations below help you translate XPath statements into
English.  If you already know some HTML authoring, the demonstrations
might pack more punch since they connect with some of your HTML syntax knowledge.


English to XPATH translations
English: Give me a list of all the <a> elements. XPath:
//xhtml:a

Notice that we need to specify the namespace via the "xhtml" namespace prefix. Be aware that XPath is very picky about requiring us to specify a namespace.
The above XPath expression would return the following syntax to XSLT or XQuery: allA.txt
English: Give me a list of all the <a> elements which reside in a <td> element which resides inside a <tr> element which resides in a <body> element which which resides in an <html> element. XPath:
/xhtml:html/xhtml:body/xhtml:table/xhtml:tr/xhtml:td/xhtml:a
The above XPath expression would return the syntax below to XSLT or XQuery:
h-b-t-tr-td-a.txt

We can see that the addition of the xhtml namespace by Tidy.jar complicates both the XPath expression and the elements returned by it. In order to simplify this demonstration, I did some editing, by hand, to remove the xhtml namespace from simpleWellformed.html: Rendered: simpleWellformedNoNamespace.html Text: simpleWellformedNoNamespace-html.txt Now, we return to the translations. More English to XPATH translations (without namespaces)
English: Give me a list of all the <a> elements which reside in a <td> element which resides inside a <tr> element which resides in a <body> element which resides in an <html> element. XPath:
/html/body/table/tr/td/a
The above XPath expression would return the following syntax to XSLT or XQuery: h-b-t-tr-td-a-nns.txt
English: Give me a list of all the "href" attributes inside of <a> tags. XPath:
//a/@href
The above XPath expression would return the following syntax to XSLT or XQuery: allHref.txt
English: Give me a list of all <a> elements which contain an "id" attribute. XPath:
//a[./@id]
The above XPath expression would return the following syntax to XSLT or XQuery: aWithId.txt. I like to think of the [ ] syntax as like a box. I use this thought pattern: "Look inside all the boxes; if any of the boxes contain candy, return those boxes and ignore the rest." Here is another way to put it: "/box/candy" returns candy with no box. /box[candy] returns boxes which contain candy."
English: Give me a list of all <a> elements which do not contain an "id" attribute. XPath:
//a[not(./@id)]
The above XPath expression would return the following syntax to XSLT or XQuery: aNotWithId.txt
English: Give me a list of all <td> elements which contain an <a> element. XPath:
//td[a]
The above XPath expression would return the following syntax to XSLT or XQuery: tdWithA.txt
English: Give me the text from all <td> elements. XPath:
//td/text()
The above XPath expression would return the following syntax to XSLT or XQuery: tdText.txt. Some text in an element is often referred to as a "text node".
English: Give me a list of all <a> elements which contain an "href" attribute which ends with "gov". XPath:
//a[ ends-with(./@href,"gov")) ]
The above XPath expression would return the following syntax to XSLT or XQuery: endsWithGov.txt.
English: Give me a list of all <a> elements which do not contain an "id" attribute and also do not contain an "href" attribute which ends with "gov". XPath:
//a[not(./@id) and not(ends-with(./@href,"gov")) ]
The above XPath expression would return the following syntax to XSLT or XQuery: notEndsWith.txt.
English: Give me the text nodes from all the <a> elements which do not start with "http". XPath:
//a[ not (starts-with(./text(), "http")) ]/text()
The above XPath expression would return the following syntax to XSLT or XQuery: notStartsWithText.txt
English:Give me a list of all <a> elements which contain a text node which is exactly equal to "Stanford". XPath:
//a[ ./text() = "Stanford" ]
The above XPath expression would return the following syntax to XSLT or XQuery: aWithTextEquals.txt
English: Give me a list of all text nodes which reside in <a> elements which contain a text node which is exactly equal to "Stanford". In other words, if any <a> elements contain a text node equal to "Stanford", give me those text nodes. XPath:
XPath: //a[ ./text() = "Stanford" ]/text()
The above XPath expression would return the following syntax to XSLT or XQuery: textOfaWithTextEquals.txt
I typed up a few more translations and linked them below in a text file: moreTranslations.txt yetMore.txt