Extracting HREF attributes from HTML files

From Wiki**3

The Problem

We want to build a program to extract all the hyper-references from an HTML file.

The references to be extracted exist in HTML elements: these elements are delimited by angle brackets (< and >) and the references themselves are enclosed in by \" or \' (in the implementation below we also admit that they occur between a = sign and the following space).

The only references we are interested in are those specified in HREF attributes. As an example consider

<A href="file:/etc/passwd">

HTML files may contain multiple white spaces (spaces, tabs, and newlines) in and around elements. References must be printed without delimiters. For simplicity, consider that the \" and \' delimiters cannot occurs in references. Since HTML is not case sensitive, the HREF attribute can also be written as href, Href, hREF, among other combinations.

References inside comments (delimited by, possibly nested, the sequences) should not be collected. In the same way, possible references inside strings other than the described above should be ignored.

Implementation

Note that this implementation is simplified and may not address all the intricacies of an HTML definition.

Ficheiro hrefs.l
%option stack 8bit noyywrap yylineno

%{

#include <iostream>
#include <vector>
#include <string>

static std::vector<std::string> hrefs;

/*

  <tag-qualquer attr-qualquer="ldkj" outracoisa='mjsdhfd' href=http://bla.bla.bla mais="">
  <tag-qualquer attr-qualquer="ldkj" outracoisa='mjsdhfd' HrEf = 'http://bla.bla.bla' mais="">
  <tag-qualquer attr-qualquer="ldkj" outracoisa='mjsdhfd' hREf="http://bla.bla.bla" mais="">

  <!-- href=http://nao.e.para.apanhar.este/ -->

*/

inline void yyerror(const char *msg) { std::cout << msg << std::endl; }

%}

SPACE   [ \t]
ID      [[:alpha:]]([[:alnum:]]|:|-|_)*
HREF    [Hh][Rr][Ee][Ff]{SPACE}*={SPACE}*
ATTR    {ID}{SPACE}*={SPACE}*

%x X_COMMENT X_TAG
%x X_DONTCARE X_DCA X_DCP
%x X_REF1 X_REFA X_REFP

%%

<INITIAL,X_COMMENT>"<!--"       yy_push_state(X_COMMENT);
<X_COMMENT>"-->"                yy_pop_state();
<X_COMMENT>.|\n                 ;

"<"{ID}                         yy_push_state(X_TAG);
<X_TAG>">"                      yy_pop_state();
<X_TAG>{HREF}                   yy_push_state(X_REF1);
<X_TAG>{ATTR}                   yy_push_state(X_DONTCARE);
<X_TAG>.|\n                     ;

<X_REF1>\"                      yy_push_state(X_REFA);
<X_REF1>\'                      yy_push_state(X_REFP);
<X_REF1>{SPACE}|\n              {
  yyless(yyleng-1);
  hrefs.push_back(std::string(yytext));
  yy_pop_state();
}
<X_REF1>">"                     {
  yyless(yyleng-1);
  hrefs.push_back(std::string(yytext));
  yy_pop_state();
}

<X_REFA>\"                      {
  yyless(yyleng-1);
  hrefs.push_back(std::string(yytext));
  yy_pop_state();
  yy_pop_state();
}
<X_REFP>\'                      {
  yyless(yyleng-1);
  hrefs.push_back(std::string(yytext));
  yy_pop_state();
  yy_pop_state();
}
<X_REF1,X_REFA,X_REFP>.         yymore();
<X_REFA,X_REFP>\n               yyerror("ERRO");

<X_DONTCARE>" "|\n              yy_pop_state();
<X_DONTCARE>\"                  yy_push_state(X_DCA);
<X_DONTCARE>\'                  yy_push_state(X_DCP);
<X_DONTCARE>">"                 yyless(yyleng-1); yy_pop_state();
<X_DONTCARE>.                   ;

<X_DCA>\"                       yy_pop_state(); yy_pop_state();
<X_DCP>\'                       yy_pop_state(); yy_pop_state();
<X_DCA,X_DCP>.                  ;
<X_DCA,X_DCP>\n                 yyerror("ARGH!");

.|\n                            ;

%%

int main() {
  yylex();
  for (int i = 0; i < hrefs.size(); i++)
    std::cout << "REF " << i << ": " << hrefs[i] << std::endl;
  return 0;
}

How to compile?

 prompt% flex href.l
 prompt% g++ -o href lex.yy.c

Note that we are using a custom made main function and, thus, need not link with libfl.a (the default implementation for the test function -- not needed in general).