Robots.txt disallow by regex

1.4k Views Asked by At

On my website I have a page for the cart, that is: http://www.example.com/cart and another for the cartoons: http://www.example.com/cartoons. How should I write on my robots.txt file to ignore only the cart page?

The cart page does not accept an ending slash on the URL, so if I do: Disallow: /cart, it will ignore /cartoon too.

I don't know if it's possible and it will be correctly parsed by the spider bots something like /cart$. I dont want to force Allow: /cartoon because may be another pages with the same prefix.

2

There are 2 best solutions below

1
On

You could explicitly allow and disallow both paths. More specific paths will take a higher precedent if they are longer in length:

disallow: /cart
allow: /cartoon

More info is available at: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

0
On

In the original robots.txt specification, this is not possible. It neither supports Allow nor any characters with special meaning inside a Disallow value.

But some consumers support additional things. For example, Google gives a special meaning to the $ sign, where it represents the end of the URL path:

Disallow: /cart$

For Google, this will block /cart, but not /cartoon.

Consumers that don’t give this special meaning will interpret $ literally, so they will block /cart$, but not /cart or /cartoon.

So if using this, you should specify the bots in User-agent.

Alternative

Maybe you are fine with crawling but just want to prevent indexing? In that case you could use meta-robots (with a noindex value) instead of robots.txt. Supporting bots will still crawl the /cart page (and follow links, unless you also use nofollow), but they won’t index it.

<!-- in the <head> of the /cart page -->
<meta name="robots" content="noindex" />