I'm trying to write a parser to parse html with boost spirit x3, and I wrote parsers below:
The problem is these code can't compile. Error is :
fatal error C1202: recursive type or function dependency context too complex
I know this error comes out because of my parser html_element_ references tag_block_, and tag_block_ references html_element_, but I don't know how to make it work.
#include <boost/spirit/home/x3.hpp>
#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/spirit/home/x3/support/ast/position_tagged.hpp>
#include <boost/spirit/home/x3/support/ast/variant.hpp>
#include <iostream>
using namespace boost::spirit::x3;
struct tag_name{};
struct html_tag;
struct html_comment;
struct attribute_data : boost::spirit::x3::position_tagged {
std::string name;
boost::optional<std::string> value;
};
struct tag_header : boost::spirit::x3::position_tagged {
std::string name;
std::vector<attribute_data> attributes;
};
struct self_tag: boost::spirit::x3::position_tagged {
tag_header header;
};
struct html_element : boost::spirit::x3::position_tagged, boost::spirit::x3::variant< std::string, self_tag, boost::recursive_wrapper<html_tag>>{
using base_type::base_type;
using base_type::operator=;
};
struct html_tag: boost::spirit::x3::position_tagged {
tag_header header;
std::vector<html_element> children;
};
BOOST_FUSION_ADAPT_STRUCT(attribute_data, name, value);
BOOST_FUSION_ADAPT_STRUCT(tag_header, name, attributes);
BOOST_FUSION_ADAPT_STRUCT(self_tag, header);
BOOST_FUSION_ADAPT_STRUCT(html_tag,header,children);
// These are the attributes parser, seems fine
struct attribute_parser_id;
auto attribute_identifier_= rule<attribute_parser_id, std::string>{"AttributeIdentifier"} = lexeme[+(char_ - char_(" /=>"))];
auto attribute_value_= rule<attribute_parser_id, std::string>{"AttributeValue"} =
lexeme["\"" > +(char_ - char_("\"")) > "\""]|lexeme["'" > +(char_ - char_("'")) > "'"]|
lexeme[+(char_ - char_(" />"))];
auto single_attribute_ = rule<attribute_parser_id, attribute_data>{"SingleAttribute"} = attribute_identifier_ > -("="> attribute_value_);
auto attributes_ = rule<attribute_parser_id, std::vector<attribute_data>>{"Attributes"} = (*single_attribute_);
struct tag_parser_id;
auto tag_name_begin_func = [](auto &ctx){
get<tag_name>(ctx) = _attr(ctx).name;
//_val(ctx).header.name = _attr(ctx);
std::cout << typeid(_val(ctx)).name() << std::endl;
};
auto tag_name_end_func = [](auto &ctx){
_pass(ctx) = get<tag_name>(ctx) == _attr(ctx);
};
auto self_tag_name_action = [](auto &ctx){
_val(ctx).header.name = _attr(ctx);
};
auto self_tag_attribute_action = [](auto &ctx){
_val(ctx).header.attributes = _attr(ctx);
};
auto inner_text = lexeme[+(char_-'<')];
auto tag_name_ = rule<tag_parser_id, std::string>{"HtmlTagName"} = lexeme[*(char_ - char_(" />"))];
auto self_tag_ = rule<tag_parser_id, self_tag>{"HtmlSelfTag"} = '<' > tag_name_[self_tag_name_action] > attributes_[self_tag_attribute_action] > "/>";
auto tag_header_ = rule<tag_parser_id, tag_header>{"HtmlTagBlockHeader"} = '<' > tag_name_ > attributes_ > '>';
rule<tag_parser_id, html_tag> tag_block_;
rule<tag_parser_id, html_element> html_element_ = "HtmlElement";
auto tag_block__def = with<tag_name>(std::string())[tag_header_[tag_name_begin_func] > (*html_element_) > "</" > omit[tag_name_[tag_name_end_func]] > '>'];
auto html_element__def = inner_text | self_tag_ | tag_block_ ;
BOOST_SPIRIT_DEFINE(tag_block_, html_element_);
int main()
{
std::string source = "<div data-src=\"https://www.google.com\" id='hello world'></div>";
html_element result;
auto const parser = html_element_;
auto parse_result = phrase_parse(source.begin(), source.end(), parser, ascii::space, result);
}
I tried to read the example of boost:spirit:qi in official document and the x3 official document, in example of qi, that parser is only parse tag, but not attributes。 The example in x3 official document is different, I think in my case is harder;
On reading, the first thing I notice is that
self_tag_uses expectation points. That won't fly because it is ordered before other things that can legally start with<, liketag_block_:And due to the expectation points it will never backtrack to reach that.
Many places use
operator+whereoperator*is required, like:All those charset differences can be phrased as inverse sets:
One smell is the re-use of parser rule tags. This should, as far as my understanding goes, be fine for immediately-defined rules, but certainly not for those that are defined through their tag type, with BOOST_SPIRIT_DEFINE.
Cleanup Exercism
First, a cleanup. This gets past the hurdle of template instantiation depth by commenting out
*html_element_insidetag_block__def. But first let's see what works then:Live On Coliru
Outputs
What Is The Trouble
As you can deduce from my hunch to comment-out the recursion
*html_element_, this is causing problems.The real reason is that
with<>extends the context. This means that each level of recursion adds more data to the context type, causing new template instantiations.The simplest trick is to move
with<>up outside the recursion:However this highlights the problem that elements can nest, and it's useless when inner tags overwrite the context data for
tag_name. So, instead ofstringwe could make itstack<string>:And then amend the actions to match:
See it Live On Coliru
Printing
CLOSING THOUGHTS
I'm answering this assuming you are just doing this to learn X3. Otherwise the only recommendation is: do not do this. Use a library.
Not only does your grammar do a pretty poor job of parsing XML, it will utterly fail on HTML in the wild. Closing tags are not a given in HTML ("quirks mode"). Scripts, CDATA, entity references, Unicode, escapes will all f*ck your parser up.
Oh, have you noticed how you mostly broke attribute propagation by introducing some semantic actions? I could show you how to fix it, but I think I'd rather leave it for the moment.
Just use a library.