Google indexing Cloudfront distribution

2.4k Views Asked by At

I have a static site through Cloudfront with an S3 origin & custom domain via Route 53. All works well, except that Google has also indexed the Cloudfront distribution url (d123etc.cloudfront.net) as well as my custom domain, leading to duplicate content issues.

I've tried canonical urls, but the distribution remains indexed. It has been suggested to serve up a different robots.txt depending on what domain is being used, which sounds fine, but there is no .htaccess or web server, leaving it to a Lambda Edge function to try and send the different robots.txt.

The problem is that I can't find how in the function to determine if a request is coming from my custom domain or from the direct distribution url. I've tried white-listing the Origin, but it is not sent through when using an S3 origin. I've also tried white-listing the Referer header, but no referrer is sent through when accessing the robots.txt file as it's a direct request.

For the time-being, I'm adding a meta noindex client-side using js on page load (which I realise is too late), and also redirecting client-side to my actual domain in case someone follows the google indexed cloudfront.net domain.

Does anyone know how to detect in Lambda Edge which domain is being used to make the request? Or some other way of blocking Google from indexing the Cloudfront url, just leaving it to index the custom domain.

2

There are 2 best solutions below

5
Sydney Y On

So I think the way to do this would be to set up a redirect on your hosted webserver. If you check the 'host' in the request header and check for cloudfront.com, send a 301 response code along with your custom domain name.

S3 has a UI way to do this:

https://medium.com/tensult/how-to-do-site-redirection-using-aws-522a4002c645

It seems you'll need a second bucket behind the same cloudfront url but without the custom domain. Then you can set it to redirect all requests to your custom domain.

The browser or bots would then stop trying cloudfront.com because it doesn't return anything, they would automatically (without the user really noticing) to my domain.xyz and all the links would link to your own domain.

0
alexis-donoghue On

I had a similar issue recently, albeit I have a web server instead of S3. It seems to be a very rare quirk of a Google robot.

The only possible way it might have learned about the CloudFront URL is via the DNS query. I had a CNAME DNS record pointing to the distribution, so it looked like this

> nslookup site.com
Non-authoritative answer:
site.com  canonical name = dfsdfsdfsdf.cloudfront.net.
Name:   dfsdfsdfsdf.cloudfront.net
Address: xx.xx.xx.xx

So I switched to an ALIAS A record, and now it's like this:

> nslookup site.com
Non-authoritative answer:
Name:   site.com
Address: xx.xx.xx.xx

I'm not sure this is 100% bulletproof, but this was the only lead I had.