Most of the changes have been to the parser, located at: /phc-0.1.7rc2/generated_src/php_parser.ypp. There was very little to add to the lexer, just a couple of lines shown below. The modified lexer is located at: /phc-0.1.7rc2/generated_src/php_scanner.lex.
The following definitions were added to the definition of a statement
:
| '<' {$1 = NEW(AST_xml_element,());} xml_element_name statement { CAST_AST(new_element,$1,AST_xml_element); CAST_AST(xml_element_name, $3, AST_xml_element_name); CAST_STATEMENT_VECTOR(tag_stat, $4); new_element->xml_element_name = xml_element_name; new_element->statements = tag_stat; $$ = new_element; } | '<' '?' {$2 = NEW(AST_xml_processing_instruction,());} xml_element_name statement { CAST_AST(new_element,$2,AST_xml_processing_instruction); CAST_AST(xml_element_name, $4, AST_xml_element_name); CAST_STATEMENT_VECTOR(tag_stat, $5); new_element->xml_element_name = xml_element_name; new_element->statements = tag_stat; $$ = new_element; } | '&' {$1 = NEW(AST_xml_element_attribute,());} xml_element_name atribute_assignment expr ';' { CAST_AST(new_att,$1,AST_xml_element_attribute); CAST_AST(xml_element_name, $3, AST_xml_element_name); CAST_AST(expr, $5, AST_expr); new_att->xml_element_name = xml_element_name; new_att->expr = expr; $$ = new_att; } | '?' escaped_echo_expr_list ';' { $$ = $2; }
Since the added instructions use existing symbols, there was no need to
modify the lexer definitions for these. The statements added where <
for tags, <?
for XML processing instructions, &
for XML element attributes and ?
for escaped print (or
echo).
The CAST_AST
macro declares an object named as the first
argument, with the data of the second argument (usually, a parsed element) of
the class indicated in the third argument.
A <
is expected to be followed by an xml_element_name
(defined below) and a statement
, which is already defined.
A <?
also requires an xml_element_name
and can
have statements. The definition is pretty much like the previous, except
that a different kind of node is created, an AST_xml_processing_instruction
instead of a AST_xml_element
.
A &
statement requires an attribute name, which is also an xml_element_name
an expression and an ending ;
. This last is now required
since in the previous two, statement already assumes an ending ;
.
There is also a atribute_assignment
which might be either an equal
sign or empty, as shown below. This is just to allow for an optional equal
sign in between the attribute name and its value which is otherwise ignored:
atribute_assignment: /* empty */ | '=' ;
Finally, the ?
instruction is followed by an escaped_echo_expr_list
, a definition I just barely modified from the echo_expr_list
which
was already part of the original parser.
An xml_element_name
is defined as one or two xml_element_name_fragment
s separated by a colon:
xml_element_name: xml_element_name_fragment | xml_element_name_fragment ':' xml_element_name_fragment
Before analyzing the actions associated with each, let us see what an xml_element_name_fragment is
xml_element_name_fragment: VARIABLE { CAST_STR(name,$1,Token_tag_name); $$ = NEW(AST_xml_element_name, (false,NULL,true,name)); } | IDENT { MAKE_ELEMENT_NAME($$,$1) } | XML_IDENT { MAKE_ELEMENT_NAME($$,$1) } | K_AND { MAKE_ELEMENT_NAME($$,$1) } | K_OR { MAKE_ELEMENT_NAME($$,$1) } | K_XOR { MAKE_ELEMENT_NAME($$,$1) } | K___FILE__ { MAKE_ELEMENT_NAME($$,$1) } | K___LINE__ { MAKE_ELEMENT_NAME($$,$1) } ........ ;
A tag name can be either a variable or a identifier, both already
defined elsewhere. Both create a tree node AST_xml_element_name
,
the first with the is_var
(the third argument to the constructor) argument set to true.
The variable name or identifier is stored in the node as a Token_tag_name
object, the fourth argument.
The first two arguments of the constructor are for the namespace, the first a
flag indicating whether it is a variable or an identifier, the second, the name
of the variable or the identifier itself. For individual fragments, we
make the namespace value NULL
, indicating there is no
namespace.
Notice that there are many alternatives for an name fragment. The first
is for a variable, the rest are alternatives for literals. XML
admits many more names than PHP does. First of all, it accepts more valid
characters, specially the hyphen and dot. Besides, XML does not care for PHP
reserved words, thus, we have to accept PHP keywords such as AND
, OR
,
while
or
any such. So, we accept a plain PHP IDENT
but we also accept an
XML_IDENT
, which is basically the same as a IDENT
but contains dots or hyphens
and we also accept PHP keywords. I have not listed all PHP keywords,
the list is over 50 tokens.
I would have liked to be able to change the state of the lexer right after
detecting any of the symbols preceding an xml element name so that I could
change the behaviour of the lexical analyzer and make it recognize a VARIABLE
or
an XML_IDENT
, unfortunately the lexer is already one step ahead of the parser so
by the time the parser recognizes a '<' or a '&', the lexer has already
read the next token which I have no alternative but to accept. Thus, the
only change to the lexer are these two additions:
XML_NAME [a-zA-Z_\x7F-\xFF][a-zA-Z0-9_\x7F-\xFF\.\-]* ..... %% .... <PHP>{XML_NAME} { semantic_value = new String(yytext); RETURN(XML_IDENT);}
The first , in the definitions section, declares the pattern for an XML_NAME
which is the same as for an IDENT
with the dot and hyphen added
(each escaped with a '\'). The second goes in the rules section after the
rule for an IDENT
. It is important that it should go after
that one otherwise it would rob it of all the identifiers and keywords.
Notice that an IDENT
is used, not a string literal or a
generic expression, only variables and plain unquoted valid identifiers.
So, now that we know the actions for an xml_element_name_fragment
,
we can see the actions associated with it:
xml_element_name: xml_element_name_fragment { $$ = $1; } | xml_element_name_fragment ':' xml_element_name_fragment { CAST_AST(xml_namespace, $1, AST_xml_element_name); CAST_AST(xml_name, $3, AST_xml_element_name); $$ = NEW(AST_xml_element_name , ( xml_namespace->is_var , xml_namespace->tag_name , xml_name->is_var , xml_name->tag_name ) ); } ;
An xml_element_name
may be a single xml_element_name_fragment
.
If so, the fragment is the whole name. It can also be a couple of
fragments separated by a colon. In that case I create a new AST_xml_element_name
object and use the values of each fragment, the first as the namespace, the
second as the element name.
Finally, an escaped_echo_expr_list
escaped_echo_expr_list: echo_expr_list ',' expr { CAST_AST(echo_list, $1, AST_statement_list); CAST_AST(param, $3, AST_expr); AST_method_invocation* fn = NEW(AST_method_invocation, ("?", param)); echo_list->push_back(NEW(AST_eval_expr, (fn))); $$ = echo_list; } | expr { CAST_AST(param, $1, AST_expr); AST_statement_list* echo_list = new AST_statement_list; AST_method_invocation* fn = NEW(AST_method_invocation, ("?", param)); echo_list->push_back(NEW(AST_eval_expr, (fn))); $$ = echo_list; } ;
As with the original echo_expr_list
it converts an echo
(which
is a statement in PHP and thus allows echo 'hello', 'world!';
without
parenthesis around the arguments) into
a series of plain function calls echo ('hello');echo('world!');
.
In this case it converts them to calls to a non-existing ?
function, which I can process later on.
< Previous: Development |