What is robots.txt?
Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. The Robots.txt file will ask search engines not to crawl and index the site. This will help deter potential SEO harm.
How does Site Stacker handle robots.txt?
Site Stacker will automatically create the robots.txt for each Site Channel added in the Sites component. By default, the file will allow all search engines to index anything at the server level. So anything at the server level in your domain will be crawlable by default.
User-agent Directive
Is used to specify which crawler should obey a given set of rules. This directive can either be a wildcard (*) or any other rules that apply to all crawlers:
Ex. User-agent:* - meaning ALL web crawlers should obey this rule.
User-agent: Googlebot - meaning ONLY Googlebot should obey this rule.
Disallow Directive
This rule is to specify what directory is off-limits to web crawlers.
Ex. User-agent:*
Disallow: /some-page
This rule will block all URLs that have a path that starts with “/some-page”:
http://example.com/some-page
http://example.com/some-page?filter=0
http://example.com/some-page/another-page
http://example.com/some-pages-will-be-blocked
However, it will not block URLs that do not start with “/some-page”, like:
http://example.com/subdir/some-page
NOTE: Whatever comes after the “Disallow:” is treated as a simple string of characters (with the notable exceptions of * and $). This string is compared to the beginning of the path part of the URL (everything from the first slash after the domain to the end of the URL) which is also treated as a simple string. If they match, the URL is blocked.
Allow Directive
By default, pages with no specified “Disallowed” rules are allowed. This is just to specify exemptions to a disallow rule.
This directive is useful if you have a subdirectory that is “Disallowed” but you want to allow a page from that subdirectory to be crawled.
Ex. User-agent:*
Allow: /do-not-show-me/show-me-only
Disallow: /do-not-show-me/
This example will block these URLs:
http://example.com/do-not-show-me/
http://example.com/do-not-show-me/page-one
http://example.com/do-not-show-me/pages
http://example.com/do-not-show-me/?a=z
However, this will not block these URLs:
http://example.com/do-not-show-me/show-me-only/
http://example.com/do-not-show-me/show-me-only-now-you-see-me
http://example.com/do-not-show-me/show-me-only/page-one
http://example.com/do-not-show-me/show-me-only?a=z
Wildcards
These extensions are directives that allow you to block pages that have an unknown path or variable.
* (asterisk) - This will match any text between the (2) two directories.
Ex. Disallow:/users/*/profile
This will block these URLs:
http://example.com/users/name-1/profile
http://example.com/users/name-2/profile
http://example.com/users/name-3/profile
And so on…
$ (dollar sign) - This means the URL ends at that point.
Ex. Disallow: /page-one$
This will block:
http://example.com/page-one
But will not block:
http://example.com/page-one-of-ten
http://example.com/page-one/
http://example.com/page-one?a=z
Robots.txt Precedence and Specificity
Web crawlers like Google and Bing do not care what order you specify the syntax to either crawl (Allow: /your/file/path) or block from crawling (Disallow: /your/file/path) within the robots file. By default, if the robot crawler does not encounter any instructions within the robots.txt file it will assume the position that the whole website should be indexed.
Note: In putting a rule in Robots.txt in SiteStacker, It is ideal to only put the ‘User-agent’ and the ‘Disallow’ rule if you do not want a path to be crawled by web crawlers. Putting only ‘Allow:/’ after Disallowed rules will override the Disallowed rules.
Ex: User-agent:*
Disallow:/path/page
Disallow:/path/
Allow:/
This will override all disallowed rules since it allows all paths to be crawled.
It is also ideal to put the ‘Allow’ rule first then the ‘Disallow’ rule because most web search engines, aside from Google and Bing, follow the order of directives for group-member records.
Google Custom Search
Google Custom Search is a tool used to search content inside your website. It will crawl to all your indexed pages showing filtered contents connected or associated with your search query. In SiteStacker, content can be searchable or unsearchable within the site. This will allow the publisher to hide or secure pages that have sensitive content.
To set up page searchability:
Log-in to SiteStacker
Click Site Planner
Add or Edit a content/item
- On the right side;
Check ‘Searchable’ if you want this page to be searched and be available on your website.
- Uncheck ‘Searchable’ if you do not want this to be available your website.
- Save & Close
Note: The ‘Searchable’ option for your page is only to make the page available or unavailable when searching it inside your website. Meaning, If your Robots.txt is not specified to ‘Disallow’ this page, web crawlers can still crawl to this page.
Modified on Tue, 19 Sep, 2023 at 11:02 AM
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article