Parsing

Most of the changes have been to the parser, located at: /phc-0.1.7rc2/generated_src/php_parser.ypp. There was very little to add to the lexer, just a couple of lines shown below.  The modified lexer is located at: /phc-0.1.7rc2/generated_src/php_scanner.lex.

The following definitions were added to the definition of a statement:

	| '<' {$1 = NEW(AST_xml_element,());} xml_element_name statement
		{
			CAST_AST(new_element,$1,AST_xml_element);
			CAST_AST(xml_element_name, $3, AST_xml_element_name);
			CAST_STATEMENT_VECTOR(tag_stat, $4);
			
			new_element->xml_element_name = xml_element_name;
			new_element->statements = tag_stat;
			
			$$ = new_element;
		}
	| '<' '?'  {$2 = NEW(AST_xml_processing_instruction,());} xml_element_name statement
		{
			CAST_AST(new_element,$2,AST_xml_processing_instruction);
			CAST_AST(xml_element_name, $4, AST_xml_element_name);
			CAST_STATEMENT_VECTOR(tag_stat, $5);
			
			new_element->xml_element_name = xml_element_name;
			new_element->statements = tag_stat;
			
			$$ = new_element;
		}
	| '&' {$1 = NEW(AST_xml_element_attribute,());} xml_element_name atribute_assignment expr ';'
		{
			CAST_AST(new_att,$1,AST_xml_element_attribute);
			CAST_AST(xml_element_name, $3, AST_xml_element_name);
			CAST_AST(expr, $5, AST_expr);
			
			new_att->xml_element_name = xml_element_name;
			new_att->expr = expr;
			
			$$ = new_att;
		}
	| '?' escaped_echo_expr_list ';'
		{ $$ = $2; }

Since the added instructions use existing symbols, there was no need to modify the lexer definitions for these. The statements added where < for tags, <? for XML processing instructions, & for XML element attributes and ? for escaped print (or echo).

The CAST_AST macro declares an object named as the first argument, with the data of the second argument (usually, a parsed element) of the class indicated in the third argument.

A < is expected to be followed by an xml_element_name (defined below) and a statement, which is already defined.

A <? also requires an xml_element_name and can have statements. The definition is pretty much like the previous, except that a different kind of node is created, an AST_xml_processing_instruction instead of a AST_xml_element.

A & statement requires an attribute name, which is also an xml_element_name an expression and an ending ; . This last is now required since in the previous two, statement already assumes an ending ; . There is also a atribute_assignment which might be either an equal sign or empty, as shown below. This is just to allow for an optional equal sign in between the attribute name and its value which is otherwise ignored:

atribute_assignment:
    /* empty */
    | '='
    ;

Finally, the ? instruction is followed by an escaped_echo_expr_list , a definition I just barely modified from the echo_expr_list which was already part of the original parser.

An xml_element_name is defined as one or two xml_element_name_fragments separated by a colon:

xml_element_name:
	  xml_element_name_fragment
	| xml_element_name_fragment ':' xml_element_name_fragment

Before analyzing the actions associated with each, let us see what an xml_element_name_fragment is

xml_element_name_fragment:
	 VARIABLE
	 	{
			CAST_STR(name,$1,Token_tag_name);
			$$ = NEW(AST_xml_element_name, (false,NULL,true,name));
		}
	| IDENT 	{ MAKE_ELEMENT_NAME($$,$1) }
	| XML_IDENT 	{ MAKE_ELEMENT_NAME($$,$1) }
	| K_AND		{ MAKE_ELEMENT_NAME($$,$1) }
	| K_OR		{ MAKE_ELEMENT_NAME($$,$1) }
	| K_XOR		{ MAKE_ELEMENT_NAME($$,$1) }
	| K___FILE__	{ MAKE_ELEMENT_NAME($$,$1) }
	| K___LINE__	{ MAKE_ELEMENT_NAME($$,$1) }
	........
	
	;

A tag name can be either a variable or a identifier, both already defined elsewhere. Both create a tree node AST_xml_element_name, the first with the is_var (the third argument to the constructor) argument set to true. The variable name or identifier is stored in the node as a Token_tag_name object, the fourth argument. 

The first two arguments of the constructor are for the namespace, the first a flag indicating whether it is a variable or an identifier, the second, the name of the variable or the identifier itself.  For individual fragments, we make the namespace value NULL, indicating there is no namespace.    

Notice that there are many alternatives for an name fragment.  The first is for a variable, the rest are alternatives for literals.   XML admits many more names than PHP does.  First of all, it accepts more valid characters, specially the hyphen and dot. Besides, XML does not care for PHP reserved words, thus, we have to accept PHP keywords such as AND, OR, while or any such.   So, we accept a plain PHP IDENT but we also accept an XML_IDENT, which is basically the same as a IDENT but contains dots or hyphens and we also accept PHP keywords.   I have not listed all PHP keywords, the list is over 50 tokens.

I would have liked to be able to change the state of the lexer right after detecting any of the symbols preceding an xml element name so that I could change the behaviour of the lexical analyzer and make it recognize a VARIABLE or an XML_IDENT, unfortunately the lexer is already one step ahead of the parser so by the time the parser recognizes a '<' or a '&', the lexer has already read the next token which I have no alternative but to accept.  Thus, the only change to the lexer are these two additions:

XML_NAME	 	[a-zA-Z_\x7F-\xFF][a-zA-Z0-9_\x7F-\xFF\.\-]*
.....
%%
....
<PHP>{XML_NAME}		{ semantic_value = new String(yytext);	RETURN(XML_IDENT);}

The first , in the definitions section, declares the pattern for an XML_NAME which is the same as for an IDENT with the dot and hyphen added (each escaped with a '\').  The second goes in the rules section after the rule for an IDENT.  It is important that it should go after that one otherwise it would rob it of all the identifiers and keywords.

Notice that an IDENT is used, not a string literal or a generic expression, only variables and plain unquoted valid identifiers. 

So, now that we know the actions for an xml_element_name_fragment, we can see the actions associated with it:

xml_element_name:
	  xml_element_name_fragment
	  	{
			$$ = $1;
		}
	| xml_element_name_fragment ':' xml_element_name_fragment
		{
			CAST_AST(xml_namespace, $1, AST_xml_element_name);
			CAST_AST(xml_name, $3, AST_xml_element_name);

			$$ =  NEW(AST_xml_element_name , (
				  xml_namespace->is_var
				, xml_namespace->tag_name
				, xml_name->is_var
				, xml_name->tag_name
				)
			);
		}
	;

An xml_element_name may be a single xml_element_name_fragment. If so, the fragment is the whole name.  It can also be a couple of fragments separated by a colon.  In that case I create a new  AST_xml_element_name object and use the values of each fragment, the first as the namespace, the second as the element name.


Finally, an escaped_echo_expr_list

escaped_echo_expr_list:
	  echo_expr_list ',' expr
		{
			CAST_AST(echo_list, $1, AST_statement_list);
			CAST_AST(param, $3, AST_expr);
			
			AST_method_invocation* fn = NEW(AST_method_invocation, ("?", param));
			echo_list->push_back(NEW(AST_eval_expr, (fn)));
			
			$$ = echo_list;
		}
	| expr
		{
			CAST_AST(param, $1, AST_expr);
			AST_statement_list* echo_list = new AST_statement_list;
			
			AST_method_invocation* fn = NEW(AST_method_invocation, ("?", param));
			echo_list->push_back(NEW(AST_eval_expr, (fn)));
			
			$$ = echo_list;
		}
	;

As with the original echo_expr_list it converts an echo (which is a statement in PHP and thus allows echo 'hello', 'world!'; without parenthesis around the arguments) into a series of plain function calls echo ('hello');echo('world!');. In this case it converts them to calls to a non-existing ? function, which I can process later on.

< Previous: Development

Up

Next: Building the tree >