Get value from "url" parameter of url querystring which is & delimited

152 Views Asked by At

I am using PHP 7.4.1.

I am trying to parse a rss feed from google.

My links look like the following:

https://www.google.com/url?rct=j&sa=t&url=https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/&ct=ga&cd=CAIyGjRm
https://www.google.com/url?rct=j&sa=t&url=https://www.politifact.com/factchecks/2020/oct/31/raphael-warnock/fact-checking-raphael-warnocks-claim-georgia-sen-k/&ct=ga&cd=CAIyGm
https://www.google.com/url?rct=j&sa=t&url=https://www.benzinga.com/news/20/10/18156683/last-weeks-notable-insider-buys-ibm-intel-raytheon-and-more&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-avino-silver-gold-mines-ltd-nyseasm-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5Y
https://www.google.com/url?rct=j&sa=t&url=https://www.businessinsider.co.za/who-received-an-sms-from-markus-jooste-2020-10&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2MDY6Y29tOmVuOlVT&am
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-veritone-inc-nasdaqveri-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2M
https://www.google.com/url?rct=j&sa=t&url=https://heavy.com/sports/las-vegas-raiders/jj-watt-stephon-gilmore-trade-targets/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2MDY6Y29tOmVuOlVT&a
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-truecar-inc-nasdaqtrue-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2MD
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-veeco-instruments-inc-nasdaqveco-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRl
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-21vianet-group-inc-nasdaqvnet-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU

I would like to get the real link from url= and cut out the end /&ct=ga&cd=CAIyGjRm.

I tried str_replace however, parsing out the end is difficult as it differs.

Any suggestions how to just get the link?

2

There are 2 best solutions below

2
anubhava On

You may use this regex in preg_match_all:

(?<=url=)https?:\S+?(?=&amp;|$)

RegEx Demo

RegEx Details:

  • (?<=url=): If we have url= before current position
  • https?:\S+?: Match a URL starting with http: or https:
  • (?=&amp;|$): If we have &amp; or line end after current position

Code:

php > $s = "https://www.google.com/url?rct=j&amp;sa=t&amp;url=https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/&amp;ct=ga&amp;cd=CAIyGjRm";
php > preg_match_all('~(?<=url=)https?:\S+?(?=&amp;|$)~', $s, $m);
php > print_r($m[0]);
Array
(
    [0] => https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/
)
15
mickmackusa On

Regex is appropriate when there isn't a legitimate / native / reliable technique to parse text.

PHP offers native functions to parse urls and query strings.

The following snippet involves multiple native functions and WILL perform slower than regex, BUT it will also be far, far less likely to break when your external data source reconfigures their querystring data. For instance, if they add an additional parameter rawurl=, then regex is prone to incorrectly matching these. It is a too common debate between using a legitimate parsing technique or regex (on urls, valid html, bbcode, etc) -- but a developer's primary goal should always be data integrity. Only entertain sacrificing data integrity for execution speed if you are processing inordinately huge volumes of data and the speed boost actually provides a valuable benefit for your system / end users. If you find yourself leaning toward the micro-optimized solution without a sound reason, I'll advise that you not drink that kool-aid.

This is one way that a url can be parse and the url value extracted.

Code: (Demo)

$url = 'https://www.google.com/url?rct=j&amp;sa=t&amp;url=https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/&amp;ct=ga&amp;cd=CAIyGjRm';

parse_str(
    htmlspecialchars_decode(
        parse_url(
            $url,
            PHP_URL_QUERY
        )
    ),
    $parts
);
echo $parts['url'];

Output:

https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/

I super-love regex, but not for every task. Avoiding regex here will make your script more readable, reliable, and easier to maintain.