I am trying to use JREPL.bat to match URLs containing a specific term in a txt file (and then write the result back to the file).
This is what I have so far, unfortunately it is not returning the expected result. The result is always NULL:
JREPL.bat "href=""(\w[^""]+/pdf4v/\w[^""]+)" "" /match /f html.txt /o -
The html.txt looks as follows (in reality the file is much more complex; additional content represented by [...]):
[...]
<ul>
<li><a href="#" id="fav" onclick="return favoritesadd(8094,'fav.png','removefav.png');"><img id="fav8094" src="fav.png" alt="" border="0" /> <span id="fav8094">ADD TO WISHLIST</span></a></li>
<li class="sixcol right"><a href="https://documents.domain.com/content/updates/year18/jv/folder01/pdf/pdf8094.zip?exp=1567791065&hsh=5a49e7d4828603beddbfb058a1535f5e&dl=att&filename=pdf-00008094-16.pdf" class="tcenter"><img src="pdf.png" class="icon" align="left" />16<br /><span class="small">download pdf</span></a></li>
<li class="sixcol"><a href="https://documents.domain.com/content/updates/year18/jv/folder01/pdf4v/pdf4v8094.zip?exp=1567791065&hsh=246a7702296f7db363ecaa1746a8815a&dl=att&filename=pdf-00008094-40.pdf" class="tcenter"><img src="pdf.png" class="icon" align="left" />40<br /><span class="small">download pdf</span></a></li>
<div class="clear"></div>
<li><a href="/details.php?id=8094&num=1&ss=1" onclick="$.open();return false;"><img src="/images/details.png" class="center" />Details</a></li>
</ul>
[...]
The expected outcome is:
https://documents.domain.com/content/updates/year18/jv/folder01/pdf4v/pdf4v8094.zip?exp=1567791065&hsh=246a7702296f7db363ecaa1746a8815a&dl=att&filename=pdf-00008094-40.pdf
Can anyone help? I am not sure why this isn't working.
Thanks in advance for your help!
The following single command line could be used in the batch file with following preconditions:
jrepl.batmust be in directory containing the batch file containing this line.html.txtmust be in current directory on execution of this batch file.html.txtmust not contain multiple URLs with/pdf4v/in one line.html.txtcontains/pdf4v/not outside a URL.The batch file command line:
FINDSTR supports regular expressions just very limited and outputs always the entire line containing a matched string. So the case sensitive regular expression search string
href=.*/pdf4v/finds all lines containinghref=and/pdf4v/.Those lines are output by FINDSTR to handle STDOUT which is redirected by Windows command processor to handle STDIN of JREPL.BAT.
JREPL.BAT runs a much more powerful JScript regular expression replace to match everything on a line definitely containing
href=and/pdf4v/with marking the URL containing/pdf4v/and replacing the line just by the marked URL.The search expression
^.*href="([^"]*?/pdf4v/[^"]*)".*$is written in batch file with\x22for each"ascmd.exeinterprets a double quote as begin/end of an argument string.There is an even better solution using JREPL.BAT option
/MATCH:The search expression
[^"]*?/pdf4v/[^"]*matches simply all strings consisting of 0 or more characters not being a double quote or a newline character non-greedy and/pdf4v/and 0 or more characters not being a double quote or a newline character. That is very simple and can result in false positives, but works for provided example.JScript regular expression engine supports unfortunately not look-behind or other enhanced features of modern regular expression engines to limit the search on
hrefvalues. But some false positives can be avoided using:Lines not containing
href="[^"]*?/pdf4v/are filtered out by this include filter before applying the simple search expression. That is still not perfect, but perhaps good enough for this task.