Extracting HREF attributes from HTML files

The Problem

We want to build a program to extract all the hyper-references from an HTML file.

The references to be extracted exist in HTML elements: these elements are delimited by angle brackets (< and >) and the references themselves are enclosed in by \" or \' (in the implementation below we also admit that they occur between a = sign and the following space).

The only references we are interested in are those specified in HREF attributes. As an example consider

<A href="file:/etc/passwd">

HTML files may contain multiple white spaces (spaces, tabs, and newlines) in and around elements Podem existir zero ou mais caracteres brancos. References must be printed without delimiters. For simplicity, consider that the \" and \' delimiters cannot occurs in references. Since HTML is not case sensitive, the HREF attribute can also be written as href, Href, hREF, among other combinations.

References inside comments (delimited by, possibly nested, the sequences) should not be collected. In the same way, possible references inside strings other than the described above should be ignored.

A Possible Implementations

Note that this implementation is simplified and may not address all the intricacies of an HTML definition.