shouldn’t we disallow the indexation of folders like wp-admin, wp-content or wp-includes?
I certainly have. Those files are for core WordPress use only. There’s no reason for them to be indexed. In all honesty, you could use robots.txt to ban everything except index.php without seeing any pagerank impact.
that’s what i thought. but are you sure? π
shouldn’t it be then a default for wordpress?
wp-admin and wp-includes: the contents of those are probably identical for all wp blogs. The bandwidth saving is minimal when probably compared to the savings someone could make by really doing it properly.
User-Agent: Googlebot
Disallow: /
That’s what you need :p
come on podz, i’m serious
why is not default but instead nofollow is a default? how comes?
and i’m not criticizing anything. i’m just curious
Google make the rules.
Google took nofollow – a perfectly legitimate w3c tag – and used it to try and clean up the mess that they made. No doubt Google will again try to make everyone else pay in some way when their nefarious ways screw things over. No, I’m not a fan.
Either way – would it save bandwidth. Yes, some.
Would the time be better served by optimising other areas of your site ? Yes.
Does G-bot probably ignore it anyway ? Yes
Should WP ship with or advise about robots.txt. No – that’s an end-user decision based upon knowledge.
why is not default but instead nofollow is a default? how comes?
Now you’re talking two entirely different things here. nofollow instructs the Googlebot no to index any links in the comments. Where as, robots.txt can be altered to give the Googlebot very detailed instructions. As for robots.txt’s inclusion with WP, I agree with Podz, “Should WP ship with or advise about robots.txt. No – that’s an end-user decision based upon knowledge.”
Does G-bot probably ignore it anyway ? Yes
Actually, I ban all bots that ignore robots.txt and I’m still indexed by Googlebot several times per day. So, I’m inclined to say that it does not ignore robots.txt (especially since Google recommends robots.txt as a method of Googlebot control).
Bot control is a massive topic, it really is.
Head over to webmasterworld and search for “perfect .htaccess” – it’s a long and complex set of threads and off-shoots.
Another question, re: robots, and in this case google is the search engine in question but the question may apply to other search engines as well.
I post on my blog and the post appears on the index page. I have the options set to show the 10 most recent posts on that page. Eventually, as more posts are made, the post rolls over to page 2, page 3, etc, as more posts are added the post in question gets pushed down the list and rolls onto the 2nd 3rd 4th pages etc. Normal, fine, no problem.
Now I’m seeing referrals from google that point to my wordpress blog pages, ie page/3/, instead of to the archived post. The person clicks on the google result and doesn’t find the article in question because time has passed and the post which was on page 3 has rolled over to page 4.
My thought is to exclude the index and the /page/ directories, and just let the robots crawl the archives, so a search result will always point to the archived post, and not to the index or /page/ which may or may not have the post on which they are searching.
thoughts? better ideas?