preg_replace for a specific domain name

42 Views Asked by At

I was using str_replace to rewrite URLs to PDFs from https://example.com/documents/en/whatever.PDF to https://example.com/documents/es/whatever_SPANISH.pdf

This is what I was using

    if($_COOKIE['googtrans'] == "/en/es") { //Check the google translate cookie
            $text = str_replace('/documents/en/', '/documents/es/', $text);
            $text = str_replace('.pdf', '_SPANISH.pdf', $text);
    }

The problem is, if the page contains a PDF linked to another page (not my own website), example https://othersite.example.com/whatever.pdf, it becomes https://othersite.example.com/whatever_SPANISH.pdf which isn't valid on other people's sites. I want to ignore offsite links and only change URLs on my site.

So what I would like to do is look for the string: https://example.com/documents/en/whateverfilename.pdf and pull that file name out and change it to https://example.com/documents/es/whateverfilename_SPANISH.pdf (Switching the en to es and also appending the _SPANISH to the end of the PDF filename.

How can I do this. Have tried various preg_replace but can't get my syntax right.

    if($_COOKIE['googtrans'] == "/en/es") {
            $text = str_replace('/documents/en/', '/documents/es/', $text);
            $text = str_replace('.pdf', '_SPANISH.pdf', $text);
    }

1

There are 1 best solutions below

0
The fourth bird On BEST ANSWER

You could do the replacement in 1 go using a regex and 2 capture group values in the replacement.

\b(https?://\S*?/documents/)en(/\S*)\.pdf\b

Or match the domain name:

\b(https?://example\.com/documents/)en(/\S*)\.pdf\b

The pattern matches:

  • \b A word boundary
  • (https?://\S*?/documents/) Capture group 1, match the protocol and then optional non whitespace characters until the first occurrence of /documents/
  • en Match literally
  • (/\S*) Capture group 2, match / followed by optional non whitspace chars
  • \.pdf\b Match .pdf followed by a word boundary

In the replacement use the 2 capture groups denoted by $1 and $2:

$1es$2_SPANISH.pdf

See the regex group captures.

Example:

$regex = '~\b(https?://\S*?/documents/)en(/\S*)\.pdf\b~';
$text = "https://example.com/documents/en/whateverfilename.pdf";

$result = preg_replace($regex, "$1es$2_SPANISH.pdf", $text);

echo $result;

Output

https://example.com/documents/es/whateverfilename_SPANISH.pdf

If you want to match the same amount of forward slashes as in your example, you can make use of a negated character class [^\s/] to exclude matching whitespace characters or forward slashes:

\b(https?://[^\s/]+/documents/)en/([^\s/]+)\.pdf\b

See another regex demo.