Editing Robots.txt

What is robots.txt?

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. The Robots.txt file will ask search engines not to crawl and index the site. This will help deter potential SEO harm.

How does Site Stacker handle robots.txt?

Site Stacker will automatically create the robots.txt for each Site Channel added in the Sites component. By default, the file will allow all search engines to index anything at the server level. So anything at the server level in your domain will be crawlable by default.

User-agent Directive

Is used to specify which crawler should obey a given set of rules. This directive can either be a wildcard (*) or any other rules that apply to all crawlers:

Ex. User-agent:* - meaning ALL web crawlers should obey this rule.

User-agent: Googlebot - meaning ONLY Googlebot should obey this rule.

Disallow Directive

This rule is to specify what directory is off-limits to web crawlers.

Ex. User-agent:*

Disallow: /some-page

This rule will block all URLs that have a path that starts with “/some-page”:

http://example.com/some-page

http://example.com/some-page?filter=0

http://example.com/some-page/another-page

http://example.com/some-pages-will-be-blocked

However, it will not block URLs that do not start with “/some-page”, like:

http://example.com/subdir/some-page

NOTE: Whatever comes after the “Disallow:” is treated as a simple string of characters (with the notable exceptions of * and $). This string is compared to the beginning of the path part of the URL (everything from the first slash after the domain to the end of the URL) which is also treated as a simple string. If they match, the URL is blocked.

Allow Directive

By default, pages with no specified “Disallowed” rules are allowed. This is just to specify exemptions to a disallow rule.

This directive is useful if you have a subdirectory that is “Disallowed” but you want to allow a page from that subdirectory to be crawled.

Ex. User-agent:*

Allow: /do-not-show-me/show-me-only

Disallow: /do-not-show-me/

This example will block these URLs:

http://example.com/do-not-show-me/

http://example.com/do-not-show-me/page-one

http://example.com/do-not-show-me/pages

http://example.com/do-not-show-me/?a=z

However, this will not block these URLs:

http://example.com/do-not-show-me/show-me-only/

http://example.com/do-not-show-me/show-me-only-now-you-see-me

http://example.com/do-not-show-me/show-me-only/page-one

http://example.com/do-not-show-me/show-me-only?a=z

Wildcards

These extensions are directives that allow you to block pages that have an unknown path or variable.

* (asterisk) - This will match any text between the (2) two directories.

Ex. Disallow:/users/*/profile

This will block these URLs:

http://example.com/users/name-1/profile

http://example.com/users/name-2/profile

http://example.com/users/name-3/profile

And so on…

$ (dollar sign) - This means the URL ends at that point.

Ex. Disallow: /page-one$

This will block:

http://example.com/page-one

But will not block:

http://example.com/page-one-of-ten

http://example.com/page-one/

http://example.com/page-one?a=z

Robots.txt Precedence and Specificity

Web crawlers like Google and Bing do not care what order you specify the syntax to either crawl (Allow: /your/file/path) or block from crawling (Disallow: /your/file/path) within the robots file. By default, if the robot crawler does not encounter any instructions within the robots.txt file it will assume the position that the whole website should be indexed.

Note: In putting a rule in Robots.txt in SiteStacker, It is ideal to only put the ‘User-agent’ and the ‘Disallow’ rule if you do not want a path to be crawled by web crawlers. Putting only ‘Allow:/’ after Disallowed rules will override the Disallowed rules.

Ex: User-agent:*

Disallow:/path/page

Disallow:/path/

Allow:/

This will override all disallowed rules since it allows all paths to be crawled.

It is also ideal to put the ‘Allow’ rule first then the ‘Disallow’ rule because most web search engines, aside from Google and Bing, follow the order of directives for group-member records.

Google Custom Search

Google Custom Search is a tool used to search content inside your website. It will crawl to all your indexed pages showing filtered contents connected or associated with your search query. In SiteStacker, content can be searchable or unsearchable within the site. This will allow the publisher to hide or secure pages that have sensitive content.

To set up page searchability:

Log-in to SiteStacker
Click Site Planner
Add or Edit a content/item
On the right side;

Check ‘Searchable’ if you want this page to be searched and be available on your website.
Uncheck ‘Searchable’ if you do not want this to be available your website.

Save & Close

Note: The ‘Searchable’ option for your page is only to make the page available or unavailable when searching it inside your website. Meaning, If your Robots.txt is not specified to ‘Disallow’ this page, web crawlers can still crawl to this page.

Modified on Tue, 19 Sep, 2023 at 11:02 AM

Make sure these features are added to your Site Stacker installation by learning how to run updates here!