Removing Hashtag using Java WebFilter

342 Views Asked by At

I have the following configuration in the urlrewrite.xml:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE urlrewrite PUBLIC "-//tuckey.org//DTD UrlRewrite 4.0//EN" "http://www.tuckey.org/res/dtds/urlrewrite4.0.dtd">
<urlrewrite use-query-string="true">
    <rule>
        <from>^(/event/showEventList)(\.{1})(\bhtm\b|\bhtml\b)(\?{0,1})([a-zA-Z0-9-_=&amp;]{0,}+)(#{0,1})([a-zA-Z0-9-_=&amp;]{0,}+)$</from>
        <to type="redirect" last="true">/events$4$5</to>
    </rule>                 
</urlrewrite>

The regex ^(/event/showEventList)(\.{1})(\bhtm\b|\bhtml\b)(\?{0,1})([a-zA-Z0-9-_=&amp;]{0,}+)(#{0,1})([a-zA-Z0-9-_=&amp;]{0,}+)$ has 7 groups, which are:

  1. (/event/showEventList): matches /event/showEventList
  2. (\.{1}): matches a single dot (.)
  3. (\bhtm\b|\bhtml\b): matches only htm or html
  4. (\?{0,1}): matches question mark (?) which can may occur zero or one
  5. ([a-zA-Z0-9-_=&amp;]{0,}+): matches the query string which can occur zero or more
  6. (#{0,1}): matches hashtag (#) which can may occur zero or one
  7. ([a-zA-Z0-9-_=&amp;]{0,}+): matches the fragment which can occur zero or more

If I test this configuration with a test URL: /event/showEventList.html?pageNumber=1#key=val, I am expecting that the redirected URL would be /events?pageNumber=1, but I am getting /events?pageNumber=1#key=val

I have a code snippet to test it, which is:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class UrlRewriterRegexTest {

    public static void main(String[] args) {
        String input = "/event/showEventList.html?pageNumber=1#key=val";
        String regex = "^(/event/showEventList)(\\.{1})(\\bhtm\\b|\\bhtml\\b)(\\?{0,1})([a-zA-Z0-9-_=&]{0,}+)(#{0,1})([a-zA-Z0-9-_=&]{0,}+)$";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(input);   
        System.out.println(matcher.replaceFirst("/events$4$5"));
    }
}

It outputs to: /events?pageNumber=1.

Any pointer would be very helpful.

3

There are 3 best solutions below

1
Tapas Bose On BEST ANSWER

I am answering my own question, so that in future if someone else stumbles upon the same problem, this answer could help him.

There is nothing to do with the UrlRewriteFilter framework. By enabling the debug log for this framework I have seen that the URL it is receiving before applying the defined rules doesn't have the URL Hash(#). From other SO answers and by analyzing the network traffic of the browser, I saw that the browser does not send the URL fragment to the server so it's not available in the HttpServletRequest. This is the reason the Regular Expressions are not working.

Since this hash is available in the client browser and thanks to HTML5 History API I am able to solve the problem using JavaScript:

<script type="text/javascript">
    window.addEventListener('DOMContentLoaded', (event) => {
        const url = new URL(window.location);
        url.hash = '';
        history.replaceState(null, document.title, url);
    });
</script>
1
wwerner On

I'd simplify the expression a bit.

  • Escape slashes, as they are typically used as delimiters for the regex (\/event\/showEventList)
  • Remove superfluous quantifier (\.)
  • Shorten the html string test (htm(l)?) - careful, this messes with your capturing group numbers
  • Remove word boundary checks around html
  • Use ? instead of {0,1}
  • Use * instead of {0,}
  • Remove possessive quantifier (I don't see why you'd need it)
  • Ignore everything after #, you don't seem to need it in your replacement

This gives us ^(\/event\/showEventList)(\.)(htm(l)?)(\??)([a-zA-Z0-9-_=&]+)*#(.+)$ which subsitutes your example to /events?pageNumber=1

To play around, see https://regexr.com/4otp7

2
Venu On

I've simplified the expression and here is the working solution

<from>^(\/event\/showEventList\.html?)(\?[a-zA-Z0-9-_=&]*)\#.*$</from>
<to type="redirect" last="true">/events$2</to>

This will match any thing and take everything from the beginning of query string till the first occurrence of #

Explanation:

Group 1 : Match the url /event/showEventList.html OR /event/showEventList.htm

Group 2 : Match all query string between o to many till the first occurrence of #

Group 2 is the string which you want to use for redirect and ignore any thing after # including #

Example:

enter image description here