I read some related posts on this topic over here but they all seem to have different replies. In addition I am horrible in robots.txt.

I just started Q&A site which is actually on  subdomain of my other content site. I created some categories and there are a few questions and answers but I have well over 50 Q&A pages indexed in Google just after a few days. Many categories are still empty but they are indexed.

So what are the best practices for robots.txt file? I don't mind that actual questions and answers will be indexed but tags, categories, users and the like probably should not be in search. Anyone can help?
Q2A version: the latest

1 Answer

Good question.

I think that, at minimum, you should disallow anything that is not content related:


User-agent: *
Disallow: /login
Disallow: /ask
Disallow: /forgot
Disallow: /register
Disallow: /questions?sort
Disallow: /admin
In addition, you should disallow anything that repeats the same content with different urls. For example /Activity/12345 is the same as /12345 is the same as /Category/12345. How you choose do this one would depend on how you have your site organized.
I would like to see some more discussion on this topic.
Thank you very much for taking time to answer. I will get started with these and we'll look if anyone comes up with more suggestions. :)
it would be good to add
Disallow: /message/