HTMLTag

HTMLTag.py

HTMLTag defines a class of the same name that represents HTML content. An additional HTMLReader class kicks off the process of reading an HTML file into a set of tags:

from WebUtils.HTMLTag import HTMLReader
reader = HTMLReader()
tag = reader.readFileNamed('foo.html')
tag.pprint()

Tags have attributes and children, which makes them hierarchical. See HTMLTag class docs for more info.

Note that you imported HTMLReader instead of HTMLTag. You only need the latter if you plan on creating tags directly.

You can discard the reader immediately if you like:

tag = HTMLReader().readFileNamed('foo.html')

The point of reading HTML into tag objects is so that you have a concrete, Pythonic data structure to work with. The original motivation for such a beast was in building automated regression test suites that wanted granular, structured access to the HTML output by the web application.

See the doc string for HTMLTag for examples of what you can do with tags.

exception WebUtils.HTMLTag.HTMLNotAllowedError(msg, **values)

Bases: HTMLTagError

HTML tag not allowed here error

__init__(msg, **values)
args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class WebUtils.HTMLTag.HTMLReader(emptyTags=None, extraEmptyTags=None, fakeRootTagIfNeeded=True)

Bases: HTMLParser

Reader class for representing HTML as tag objects.

NOTES

  • Special attention is required regarding tags like <p> and <li> which sometimes are closed and sometimes not. HTMLReader can deal with both situations (closed and not) provided that:

    • the file doesn’t change conventions for a given tag

    • the reader knows ahead of time what to expect

Be default, HTMLReader assumes that <p> and <li> will be closed with </p> and </li> as the official HTML spec encourages.

But if your files don’t close certain tags that are supposed to be required, you can do this:

HTMLReader(extraEmptyTags=['p', 'li'])

or:

reader.extendEmptyTags(['p', 'li'])

or just set them entirely:

HTMLReader(emptyTags=['br', 'hr', 'p'])
reader.setEmptyTags(['br', 'hr', 'p'])

Although there are quite a few. Consider the DefaultEmptyTags global list (which is used to initialize the reader’s tags) which contains about 16 tag names.

If an HTML file doesn’t conform to the reader’s expectation, you will get an exception (see more below for details).

If your HTML file doesn’t contain root <html> ... </html> tags wrapping everything, a fake root tag will be constructed for you, unless you pass in fakeRootTagIfNeeded=False.

Besides fixing your reader manually, you could conceivably loop through the permutations of the various empty tags to see if one of them resulted in a correct read.

Or you could fix the HTML.

  • The reader ignores extra preceding and trailing whitespace by stripping it from strings. I suppose this is a little harsher than reducing spans of preceding and trailing whitespace down to one space, which is what really happens in an HTML browser.

  • The reader will not read past the closing </html> tag.

  • The reader is picky about the correctness of the HTML you feed it. If tags are not closed, overlap (instead of nest) or left unfinished, an exception is thrown. These include HTMLTagUnbalancedError, HTMLTagIncompleteError and HTMLNotAllowedError which all inherit HTMLTagError.

    This pickiness can be quite useful for the validation of the HTML of your own applications.

CDATA_CONTENT_ELEMENTS = ('script', 'style')
__init__(emptyTags=None, extraEmptyTags=None, fakeRootTagIfNeeded=True)

Initialize and reset this instance.

If convert_charrefs is True (the default), all character references are automatically converted to the corresponding Unicode characters.

check_for_whole_start_tag(i)
clear_cdata_mode()
close()

Handle any buffered data.

computeTagContainmentConfig()
emptyTags()

Return a list of empty tags.

See also: class docs and setEmptyTags().

error(message)
extendEmptyTags(tagList)

Extend the current list of empty tags with the given list.

feed(data)

Feed data to the parser.

Call this as often as you want, with as little or as much text as you want (may include ‘n’).

filename()

Return the name of the file if one has been read, otherwise None.

get_starttag_text()

Return full source of start tag: ‘<…>’.

getpos()

Return current line number and offset.

goahead(end)
handle_charref(name)
handle_comment(data)
handle_data(data)
handle_decl(decl)
handle_endtag(tag)
handle_entityref(name)
handle_pi(data)
handle_startendtag(tag, attrs)
handle_starttag(tag, attrs)
main(args=None)

The command line equivalent of readFileNamed().

Invoked when HTMLTag is run as a program.

parse_bogus_comment(i, report=1)
parse_comment(i, report=1)
parse_declaration(i)
parse_endtag(i)
parse_html_declaration(i)
parse_marked_section(i, report=1)
parse_pi(i)
parse_starttag(i)
pprint(out=None)

Pretty prints the tag, its attributes and all its children.

Indentation is used for subtags. Print ‘Empty.’ if there is no root tag.

printsStack()
readFileNamed(filename, retainRootTag=True, encoding='utf-8')

Read the given file.

Relies on readString(). See that method for more information.

readString(string, retainRootTag=True)

Read the given string, store the results and return the root tag.

You could continue to use HTMLReader object or disregard it and simply use the root tag.

reset()

Reset this instance. Loses all unprocessed data.

rootTag()

Return the root tag.

May return None if no HTML has been read yet, or if the last invocation of one of the read methods was passed retainRootTag=False.

setEmptyTags(tagList)

Set the HTML tags that are considered empty such as <br> and <hr>.

The default is found in the global, DefaultEmptyTags, and is fairly thorough, but does not include <p>, <li> and some other tags that HTML authors often use as empty tags.

setPrintsStack(flag)

Set the boolean value of the “prints stack” option.

This is a debugging option which will print the internal tag stack during HTML processing. The default value is False.

set_cdata_mode(elem)
tagContainmentConfig = {'body': 'cannotHave  html head body', 'head': 'cannotHave  html head body', 'html': 'canOnlyHave head body', 'select': 'canOnlyHave option', 'table': 'canOnlyHave tr thead tbody tfoot a', 'td': 'cannotHave  td tr', 'tr': 'canOnlyHave th td'}
unescape(s)
unknown_decl(data)
updatepos(i, j)
usage()
class WebUtils.HTMLTag.HTMLTag(name, lineNumber=None)

Bases: object

Container class for representing HTML as tag objects.

Tags essentially have 4 major attributes:

  • name

  • attributes

  • children

  • subtags

Name is simple:

print(tag.name())

Attributes are dictionary-like in nature:

print(tag.attr('color'))  # throws an exception if no color
print(tag.attr('bgcolor', None))  # returns None if no bgcolor
print(tag.attrs())

Children are all the leaf parts of a tag, consisting of other tags and strings of character data:

print(tag.numChildren())
print(tag.childAt(0))
print(tag.children())

Subtags is a convenient list of only the tags in the children:

print(tag.numSubtags())
print(tag.subtagAt(0))
print(tag.subtags())

You can search a tag and all the tags it contains for a tag with a particular attribute matching a particular value:

print(tag.tagWithMatchingAttr('width', '100%'))

An HTMLTagAttrLookupError is raised if no matching tag is found. You can avoid this by providing a default value:

print(tag.tagWithMatchingAttr('width', '100%', None))

Looking for specific ‘id’ attributes is common in regression testing (it allows you to zero in on logical portions of a page), so a convenience method is provided:

tag = htmlTag.tagWithId('accountTable')
__init__(name, lineNumber=None)
addChild(child)

Add a child to the receiver.

The child will be another tag or a string (CDATA).

attr(name, default=<class 'MiscUtils.NoDefault'>)
attrs()
childAt(index)
children()
closedBy(name, lineNumber)
hasAttr(name)
name()
numAttrs()
numChildren()
numSubtags()
pprint(out=None, indent=0)
readAttr(name, value)

Set an attribute of the tag with the given name and value.

A HTMLTagAttrLookupError is raised if an attribute is set twice.

subtagAt(index)
subtags()
tagWithId(id_, default=<class 'MiscUtils.NoDefault'>)

Search for tag with a given id.

Finds and returns the tag with the given id. As in:

<td id=foo> bar </td>

This is just a cover for:

tagWithMatchingAttr('id', id_, default)

But searching for id’s is so popular (at least in regression testing web sites) that this convenience method is provided. Why is it so popular? Because by attaching ids to logical portions of your HTML, your regression test suite can quickly zero in on them for examination.

tagWithMatchingAttr(name, value, default=<class 'MiscUtils.NoDefault'>)

Search for tag with matching attributes.

Performs a depth-first search for a tag with an attribute that matches the given value. If the tag cannot be found, a KeyError will be raised unless a default value was specified, which is then returned.

Example:

tag = tag.tagWithMatchingAttr('bgcolor', '#FFFFFF', None)
exception WebUtils.HTMLTag.HTMLTagAttrLookupError(msg, **values)

Bases: HTMLTagError, LookupError

HTML tag attribute lookup error

__init__(msg, **values)
args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception WebUtils.HTMLTag.HTMLTagError(msg, **values)

Bases: Exception

General HTML tag error

__init__(msg, **values)
args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception WebUtils.HTMLTag.HTMLTagIncompleteError(msg, **values)

Bases: HTMLTagError

HTML tag incomplete error

__init__(msg, **values)
args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception WebUtils.HTMLTag.HTMLTagProcessingInstructionError(msg, **values)

Bases: HTMLTagError

HTML tag processing instruction error

__init__(msg, **values)
args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception WebUtils.HTMLTag.HTMLTagUnbalancedError(msg, **values)

Bases: HTMLTagError

Unbalanced HTML tag error

__init__(msg, **values)
args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class WebUtils.HTMLTag.TagCanOnlyHaveConfig(name, tags)

Bases: TagConfig

__init__(name, tags)
encounteredTag(tag, lineNum)
class WebUtils.HTMLTag.TagCannotHaveConfig(name, tags)

Bases: TagConfig

__init__(name, tags)
encounteredTag(tag, lineNum)
class WebUtils.HTMLTag.TagConfig(name, tags)

Bases: object

__init__(name, tags)
encounteredTag(tag, lineNum)