If you work on a marketing team or build websites, you probably want people to be able to find your site. You also need search engine bots to crawl and index your website’s numerous web pages so that they may be included in search results.
Robots.txt and an XML sitemap are two separate files on the technical side of your website that assist these bots in finding the information they require.
Table Of Contents−
- What is a Robots.txt file?
- XML Sitemaps
- XML sitemap reference
- How to update your Robots.txt file to include your XML sitemap
- What happens if you have more than one sitemap?
- Last thoughts
What is a Robots.txt file?
A straightforward text file named Robots.txt is placed in the root directory of your website. Search engine robots can read the instructions in this file to learn which pages on your website they can and cannot crawl.
It is also possible to prevent particular robots from visiting the website using the robots.txt file. A website in development, for instance, would benefit from having access to robots blocked until it is ready for launch.
When viewing a website, crawlers often start by reading the robots.txt file. It is still best practice to include a robots.txt file on your website, even if you wish to let all robots access every page.
The address of the XML Sitemap, another crucial file, should also be listed in Robots.txt files. This gives information about each page on your website that you want search engines to find.
This article will demonstrate how and where to include a reference to the XML sitemap in the robots.txt file. But first, let’s look at a sitemap and why it’s crucial.
Why is a robots.txt file necessary?
By crawling pages, clicking on links to move from site A to site B to site C, and so on, search engines index the web. The robots.txt file on a domain, which specifies which URLs on that site the search engine is permitted to crawl, is opened by a search engine before spiders any page on that domain it hasn’t come across.
Search engines normally cache the contents of the robots.txt, but they often refresh it to reflect changes as soon as possible.
Where should I place my robots.txt file?
At the root of your domain, there should always be a robots.txt file. It should thus be accessible via https://www.example.com/robots.txt if your domain is www.example.com.
It’s crucial that your robots.txt file truly has the name robots.txt. Make sure to enter the name correctly because the case affects how it functions.
What does the robots.txt file do?
Search engines index the web by spidering pages, following links from site A to site B to site C, and so on. Before a search engine spiders any page on a domain it hasn’t encountered, it will open that domain’s robots.txt file, which tells the search engine which URLs it’s allowed to index.
Search engines typically cache the contents of the robots.txt but will usually refresh it several times a day so that changes will be reflected fairly quickly.
Where should I put my robots.txt file?
The robots.txt file should always be at the root of your domain. So if your domain is www.example.com, it should be found at https://www.example.com/robots.txt.
It’s also very important that your robots.txt file is called robots.txt. The name is case-sensitive, so get that right, or it won’t work.
An XML sitemap is an XML file that contains a list of all pages on a website that you want robots to discover and access.
For example, you may want search engines to access your blog posts for them to appear in the search results. However, you might not want them to have access to your tag pages since these may not make good landing pages and should, therefore, not be included in the search results.
XML sitemaps can also contain additional information about each URL in the form of metadata. And just like robots.txt, an XML sitemap is a must-have. It’s important to make sure search engine bots can discover all of your pages and help them understand your pages’ importance.
A list of all the pages on a website that you want robots to find and access is contained in an XML sitemap, which is an XML file.
For instance, you could want all of your blog entries to be accessible by search engines so that they can show up in the search results. Your tag pages, however, might not make ideal landing sites and should not be displayed in the search results, so you might not want people to have access to them.
In the form of metadata, XML sitemaps can include additional information about each URL. A must-have is an XML sitemap, exactly like robots.txt. Not only must you make sure that search engine bots can find all of your pages, but you also must convey to them the significance of those pages.
How do sitemaps & robots.txt relate?
Yahoo, Microsoft, and Google banded together in 2006 to support the established mechanism for providing a website’s pages through XML sitemaps. Your XML sitemaps had to be submitted via Google Search Console, Bing webmaster tools, and Yahoo, as some other search engines, including DuckDuckGo, rely on Yahoo and Bing results.
After around six months, in April 2007, they joined in support of Sitemaps Autodiscovery, a technique to scan for XML sitemaps via robots.txt. This indicated that it was OK not to submit the sitemap to certain search engines. They would first locate the sitemap address from the robots.txt file on your website.
Most search engines still allow sitemap submission but remember that Google and Bing aren’t the only options.
Because web admins can make it possible for search engine robots to find all the pages on their website, the robots.txt file has become even more important for web admins.
How can I make a robots.txt reference to my sitemap?
Referencing your XML sitemap(s) in your robots.txt file is considered good practice. Google sitemap documentation also suggests it.
The ground rules are as follows:
- Refer to your XML sitemap’s absolute URL.
- You can use different XML sitemap references.
- Regular XML sitemaps and XML sitemap indices are also available.
- If your domain is example.com and your XML sitemap is on example2.com, you can refer to that domain when referencing XML sitemaps on another domain.
We also suggest submitting your XML sitemaps using Bing Webmaster Tools and Google Search Console.
XML sitemap reference
See the example below for accurate referencing for XML sitemaps:
User-agent: * Disallow: Sitemap: https://www.website.com/page.xml Sitemap: https://www.website.com/post.xml Sitemap: https://www.website.com/categories.xml Sitemap: https://www.website.com/users.xml
Incorrect XML sitemap reference
User-agent: * Disallow: Sitemap: /post.xml
Accurate XML sitemap reference but disallowed
User-agent: * Disallow: / Sitemap: https://www.website.com/pages.xml Sitemap: https://www.website.com/posts.xml
How to update your Robots.txt file to include your XML sitemap
To add the location of your XML sitemap to your robots.txt file, follow these three easy steps:
- Find the Sitemap URL
If a third-party developer created your website, you should first see if they provided an XML sitemap. Your sitemap’s URL will, by default, be /sitemap.xml. The XML sitemap for https://website.com, for instance, is located at https://website.com/sitemap.xml
Therefore, replace “website.com” with your domain when entering this URL in your browser.
There must be a sitemap for sitemaps since some websites have several XML sitemaps (known as a sitemap index). If you use the Yoast SEO plugin with WordPress, a sitemap index will automatically be added to /sitemap index.xml.
By utilizing the search operators as demonstrated in the examples below, you could also be able to find your sitemap using Google search:
site:website.com filetype:xml or filetype:xml site:website.com inurl:sitemap
However, this will only be effective if Google has previously crawled and indexed your website.
You can look for your XML sitemap file if you can access your website’s File Manager.
You can make a sitemap if there isn’t one on your website. Several programs are available to assist with this, such as the XML Sitemap generator, which is free for up to 500 pages; however, any pages you don’t wish to be included must be manually removed. Alternatively, adhere to the procedure detailed at Sitemaps.org.
- Find the Robots.txt File
By adding /robots.txt after your domain, for instance, https://website.com/robots.txt, you can see if your website has a robots.txt file.
You must create a robots.txt file and add it to the root directory of your web server if you don’t already have one. You’ll need access to your web server to complete this. It is often placed in the same location as your website’s main “index.html” file.
Depending on the web server software you use, these files may be located in different places. If you are unfamiliar with these files, you might consider seeking a web developer’s assistance.
Remember to name the file containing your robots.txt text in all lowercase. Avoid naming your file Robots.TXT or Robots.Txt.
- Add the Location of the Sitemap to Robots.txt File
Now access the robots.txt file located at the site’s root. Once more, you must have access to your web server to perform this. So, if you’re unsure where to find and change the robots.txt file for your website, call a web developer or your hosting provider for assistance.
The robots.txt directive providing the sitemap location can be inserted anywhere. It doesn’t matter where it is because it is independent of the user-agent line.
You may visit your favorite website and add /robots.txt to the end of the domain to check how this appears on a live site. For instance, https://website.com/robots.txt.
What happens if you have more than one sitemap?
XML sitemaps shouldn’t include more than 50,000 URLs and shouldn’t have more than 50Mb in size when uncompressed, according to Google & Bing’s sitemap recommendations. You may thus build many sitemap files if your website has a lot of URLs and is bigger.
A sitemap index file must provide a list of all sitemap file locations. The sitemap index file, a sitemap of sitemaps, has a similar XML format to the sitemap file.
You may either mention the URL of your sitemap index file in your robots.txt file when you have numerous sitemaps, like in the example below:
- Sitemap: http://website.com/sitemap_index.xml.
Alternatively, you may give unique URLs for every sitemap file you have, like in the examples below:
- Sitemap: http://website.com/sitemap_pages.xml
- Sitemap: http://website.com/sitemap_posts.xml
Presumably, you now understand how to add a sitemap location to a robots.txt file. There’s a great benefit to doing so, and it is something you should not sleep on. Use the info above to help you!
I've worked for WooRank, SEOptimer, and working on a cool SEO audit tool called SiteGuru.co. Now I have build Linkilo and SEO RANK SERP WordPress theme. I've been in the SEO industry for more than 5 years, learning from the ground up. I've worked on many startups, but also have my own affiliate sites.