I want to parse all links (represented by table rows) in the following HTML code using C++:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=UTF-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<link rel="stylesheet" href="/_autoindex/assets/css/autoindex.css"/>
<script src="/_autoindex/assets/js/tablesort.js"></script>
<script src="/_autoindex/assets/js/tablesort.number.js"></script>
<title>Index of /mydirectory/subdirectory/</title>
</head>
<body>
<div class="content">
<h1>Index of /mydirectory/subdirectory/</h1>
<div id="table-list">
<table id="table-content">
<thead class="t-header">
<tr>
<th class="colname" aria-sort="ascending">
<a class="name" href="?ND" onclick="return false"">Name</a></th><th class=" colname " data-sort-method=" number "><a href=" ?MA " onclick=" return false"">Last Modified</a>
</th>
<th class="colname" data-sort-method="number"><a href="?SA"onclick="return false"">Size</a></th></tr></thead>
<tr data-sort-method="none "><td><a href="/mydirectory/"><img class="icon " src="/_autoindex/assets/icons/corner-left-up.svg " alt="Up ">Parent Directory</a></td><td></td><td></td></tr>
<tr><td data-sort="first.json "><a href="/mydirectory/subdirectory/first.json "><img class="icon " src="/_autoindex/assets/icons/file.svg " alt="File ">first.json</a></td><td data-sort="1704288747 ">2024-01-03 13:32</td><td data-sort="4096 "> 4k</td></tr>
<tr><td data-sort="second.json "><a href="/mydirectory/subdirectory/second.json "><img class="icon " src="/_autoindex/assets/icons/file.svg " alt="File ">second.json</a></td><td data-sort="1704290309 ">2024-01-03 13:58</td><td data-sort="4096 "> 4k</td></tr>
<tr><td data-sort="third.json "><a href="/mydirectory/subdirectory/third.json "><img class="icon " src="/_autoindex/assets/icons/file.svg " alt="File ">third.json</a></td><td data-sort="1704290300 ">2024-01-03 13:58</td><td data-sort="4096 "> 4k</td></tr>
</table></div>
<address>Proudly Served by LiteSpeed Web Server at example.com Port 443</address></div><script>new Tablesort(document.getElementById("table-content "));</script></body></html>
This is a directory listing from an Apache web server. My expected result is a std::vector<std::string> which contains the (relative) urls of all 3 JSON files from the table.
For the implementation I tried to use Apache xerces-c but this library does not seem to have full XPath support. Furthermore, xalan-c, which promises full XPath support, is not available in my package manager vcpkg etc.
How can I still implement this parsing similar to how Java's JSoup operates using xerces-c?
std::vector<std::string> parse_all_links(const std::string &website_content)
{
std::vector<std::string> collected_links;
try
{
XMLPlatformUtils::Initialize();
}
catch (const XMLException& exception)
{
auto error_message = XMLString::transcode(exception.getMessage());
logger->error("Failed to initialize XML platform utils: " + std::string(error_message));
XMLString::release(&error_message);
return collected_links;
}
{
XercesDOMParser parser;
parser.setValidationScheme(XercesDOMParser::Val_Never);
const MemBufInputSource input_source(reinterpret_cast<const XMLByte*>(website_content.data()),
website_content.size(), "dummy");
parser.parse(input_source);
// ...
}
XMLPlatformUtils::Terminate();
return collected_links;
}
Any other HTML parsing library solution is also fine, preferrably with a vcpkg port for better ease of use.

Despite being old and unmaintained, Google's gumbo-parser still does the job: