上周折腾了半天FLEX词法分析器,N多年前玩过,现在全忘记了,捡起来重新折腾了一下,小计之,下次捡起来快一点。
FLEX是什么? 把文本按一定规则(正则表达式定义)把文本格式化输出。循环版本的正则表达式。
干什么用的? 通常用于数据格式化,比如解析源代码,解析结构化的文本等等。
怎么玩? 配置一个.l模板文件,描叙规则,然后用 flex.exe 生成解析C源码,复制到工程里去,完事。
存在的问题? 中文比较麻烦,目前没需求,以后有需求再继续折腾
价值收益? 开脑洞,又多了一个偷懒神器
模板范例,从sgml解析htm里抄来的,简单跑跑很舒爽。
test.l:
%{
#include "stdio.h"
#include "stdlib.h"
int num_num=0,num_id=0;
#define fileno _fileno
int bShow = 0;
%}
/* $Id: sgml.l,v 1.9 1996/02/07 15:32:28 connolly Exp $ */
/* sgml.l -- a lexical analyzer for Basic+/- SGML Documents
* See: "A Lexical Analyzer for HTML and Baisc SGML"
*/
/*
* NOTE: We assume the locale used by lex and the C compiler
* agrees with ISO-646-IRV; for example: '1' == 0x31.
*/
/* Figure 1 -- Character Classes: Abstract Syntax */
Digit [0-9]
LCLetter [a-z]
Special ['()_,\-\./:=?]
UCLetter [A-Z]
/* Figure 2 -- Character Classes: Concrete Syntax */
LCNMCHAR [\.-]
/* LCNMSTRT [] */
UCNMCHAR [\.-]
/* UCNMSTRT [] */
/* @# hmmm. sgml spec says \015 */
RE \n
/* @# hmmm. sgml spec says \012 */
RS \r
SEPCHAR \011
SPACE \040
/* Figure 3 -- Reference Delimiter Set: General */
COM "--"
CRO "&#"
DSC "]"
DSO "["
ERO "&"
ETAGO "</"
LIT \"
LITA "'"
MDC ">"
MDO "<!"
MSC "]]"
NET "/"
PERO "%"
PIC ">"
PIO "<?"
REFC ";"
STAGO "<"
TAGC ">"
/* 9.2.1 SGML Character */
/*name_start_character {LCLetter}|{UCLetter}|{LCNMSTRT}|{UCNMSTRT}*/
name_start_character {LCLetter}|{UCLetter}
name_character {name_start_character}|{Digit}|{LCNMCHAR}|{UCNMCHAR}|[\xa1-\xff]
/* 9.3 Name */
name {name_start_character}{name_character}*
number {Digit}+
number_token {Digit}{name_character}*
name_token {name_character}+
/* 6.2.1 Space */
s {SPACE}|{RE}|{RS}|{SEPCHAR}
ps ({SPACE}|{RE}|{RS}|{SEPCHAR})+
/* trailing white space */
ws ({SPACE}|{RE}|{RS}|{SEPCHAR})*
/* 9.4.5 Reference End */
reference_end ({REFC}|{RE})
/*
* 10.1.2 Parameter Literal
* 7.9.3 Attribute Value Literal
* (we leave recognition of character references and entity references,
* and whitespace compression to further processing)
*
* @# should split this into minimum literal, parameter literal,
* @# and attribute value literal.
*/
literal ({LIT}[^\"]*{LIT})|({LITA}[^\']*{LITA})
/* 9.6.1 Recognition modes */
/*
* Recognition modes are represented here by start conditions.
* The default start condition, INITIAL, represents the
* CON recognition mode. This condition is used to detect markup
* while parsing normal data charcters (mixed content).
*
* The CDATA start condition represents the CON recognition
* mode with the restriction that only end-tags are recognized,
* as in elements with CDATA declared content.
* (@# no way to activate it yet: need hook to parser.)
*
* The TAG recognition mode is split into two start conditions:
* ATTR, for recognizing attribute value list sub-tokens in
* start-tags, and TAG for recognizing the TAGC (">") delimiter
* in end-tags.
*
* The MD start condition is used in markup declarations. The COM
* start condition is used for comment declarations.
*
* The DS condition is an approximation of the declaration subset
* recognition mode in SGML. As we only use this condition after signalling
* an error, it is merely a recovery device.
*
* The CXT, LIT, PI, and REF recognition modes are not separated out
* as start conditions, but handled within the rules of other start
* conditions. The GRP mode is not represented here.
*/
/* EXCERPT ACTIONS: START */
/* %x CON == INITIAL */
%x CDATA
%x TAG
%x ATTR
%x ATTRVAL
%x NETDATA
%x ENDTAG
/* this is only to be permissive with bad end-tags: */
%x JUNKTAG
%x MD
%x COM
%x DS
/* EXCERPT ACTIONS: STOP */
%%
int *types = NULL;
char **strings = NULL;
size_t *lengths = NULL;
int qty = 0;
/*
* See sgml_lex.c for description of
* ADD, CALLBACK, ERROR, TOK macros.
*/
/* <name -- start tag */
{STAGO}{name}{ws} {
//printf("TAG:[%s]\r\n",yytext);
//ADDCASE(SGML_START, yytext, yyleng);
//BEGIN(ATTR);
//Sleep(200);
}
/* <a ^href = "xxx"> -- attribute name */
{name}{s}*={ws} {
if (stricmp(yytext,"href=")==0)
{
printf("ATTR:[%s]\r\n",yytext);
bShow = 1;
}
}
/* <a name = ^xyz> -- name token */
{name_token}{ws} {
//printf("ATTR2:[%s]\r\n",yytext);
}
/* <a href = ^"a b c"> -- literal */
{literal}{ws} {
if (bShow)
{
printf("VALUE:[%s]\r\n",yytext);
bShow = 0;
}
}
.|\r|\n {}
%%
int main(int argc, char* argv[])
{
yyin=fopen("./test.html","r");
yylex();
printf("num=%d,id=%d/n",num_num,num_id);
return 0;
}
int yywrap()//此函数必须由用户提供
{
return 1;
}
|