PDA

View Full Version : robot.txt


Jad
08-25-2005, 05:17 PM
Hi,
I have some forums being harvested by some search engine bots
whats the best robot.txt configuration to block ALL of search engine on that forums Only ?

Thanks in advance.

BornOnline
08-25-2005, 05:31 PM
Depending on the search engine, it may not even look at your robots.txt.
Validator (http://www.searchengineworld.com/cgi-bin/robotcheck.cgi)

User-agent: *
Disallow: /forum/

The following allows all robots to visit all files because the wildcard "*" specifies all robots.

User-agent: *
Disallow:

This one keeps all robots out.

User-agent: *
Disallow: /

The next one bars all robots from the cgi-bin and images directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

This one bans BadSearch from all files on the server:

User-agent: BadSearch
Disallow: /

This one bans keeps googlebot from getting at the whatever.htm file:

User-agent: googlebot
Disallow: whatever.htm

Fred
08-25-2005, 05:44 PM
You can't block all bots... some of them doesn't respect the standard robots.txt ...
If you have too many bots that doesn't respect the robots.txt, you can use .htaccess to block them... it's pretty easy.

Something like this:


SetEnvIfNoCase User-Agent "^The_super_bot" bad_bot

<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

Jad
08-25-2005, 05:55 PM
Thank you, I'll try it and report

Jad
08-25-2005, 05:57 PM
hey is there anyway to test if .haccess method works fine instead of waiting for the next crawel ?

Fred
08-25-2005, 06:00 PM
well... probably by using a browser where you can change your user-agent ?
If you use firefox... there's an extension that can do it i think...

Jad
08-25-2005, 06:04 PM
I'm watching them crawling me now heh

tcpdump -i venet0 port 80

Jad
08-25-2005, 06:37 PM
I'm not sure if i'm being slashdotted or not
but see this
17930 23.03% Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com
2744 3.52% Googlebot/2.1 (+http://www.google.com/bot.html)

Fred
08-25-2005, 08:42 PM
i think that if you were slashdotted, you will notice far more than bots ;)

Check the raw logs for a better investigation ...