Parsing HTML/XML (GNU Emacs Lisp Reference Manual)

31.26 Parsing HTML and XML

When Emacs is compiled with libxml2 support, the following functions are available to parse HTML or XML text into Lisp object trees.

Function: libxml-parse-html-region start end &optional base-url

This function parses the text between start and end as HTML, and returns a list representing the HTML parse tree. It attempts to handle “real world” HTML by robustly coping with syntax mistakes.

The optional argument base-url, if non-nil, should be a string specifying the base URL for relative URLs occurring in links.

In the parse tree, each HTML node is represented by a list in which the first element is a symbol representing the node name, the second element is an alist of node attributes, and the remaining elements are the subnodes.

The following example demonstrates this. Given this (malformed) HTML document:

<html><head></head><body width=101><div class=thing>Foo<div>Yes

A call to libxml-parse-html-region returns this:

(html ()
  (head ())
  (body ((width . "101"))
   (div ((class . "thing"))
    "Foo"
    (div ()
      "Yes"))))

Function: shr-insert-document dom: This function renders the parsed HTML in dom into the current buffer. The argument dom should be a list as generated by libxml-parse-html-region. This function is, e.g., used by EWW in The Emacs Web Wowser Manual.

Function: libxml-parse-xml-region start end &optional base-url: This function is the same as libxml-parse-html-region, except that it parses the text as XML rather than HTML (so it is stricter about syntax).