Robots and Spiders
From DocDroppers
| Author: | StankDawg |
|---|---|
| Date Released: | 10/15/2003 - In 2600 Magazine. |
| Added to DD: | 19:44, 18 Nov 2004 (EST) |
Everyone uses search engines, but did you ever wonder how they choose which
pages to list and which pages to not list? You've all heard stories of
private pages that get listed when they weren't supposed to. What stops
these search engines from digging into your personal information? Well,
without going into a lecture on why you should never store personal
information on a publicly accessible web site, let's talk about how search
engines work.
The World Wide Web was named such because of the cliché that all of the
pages are linked to each other like a spider's web. A search engine starts
looking one a page and follows all of the links on that page until it
gathers all of the information into its database. It then follows off-site
links and goes on to do the same thing at all of the sites that are linked
from that original site. This is really no different than a user sitting at
home surfing the web except that it happens at an incredibly high speed. It
is as though it were acting as an agent for the search engine. Due to its
automation, it can quickly create and update its database. This automation
is akin to a robot where it simply does the same repetitious job over and
over. In this case, that job is to build a database of web sites. Because
of these reasons, the actual program or engine that does the work of
crawling across the World Wide Web is called an "agent", a "spider", or more
commonly, a "robot".
"Isn't that a good thing?" Well, it can be. There are many good reasons
for using robots. Obviously, it is very handy to have search engines to
find things on the vast online world. It is even difficult to find
documents on your own site sometimes! The use of robots is not only for
going out and gathering up data, but they can be very personal and
customized for your own site. One site can easily get into thousands and
thousands of pages, sometimes more. It is very difficult to find and
maintain documents on a site of this size. A robot can do that work for
you. It can report broken links and help you fill in holes or errors on
your sites.
"That's great, I want one!" Well, before you go jumping into something,
think it through. There are also many drawbacks to using a spider.
Firstly, you have to write the spider engine efficiently so as not to
overload your server and also smart enough so that it does not start
crawling on other peoples sites and overloading their servers. If everyone
had an agent out there crawling through everyone else's links, the web would
slow to a grinding halt! The most important problem, however, is what I
mentioned in the opening. Spiders will follow links to EVERYTHING that it
sees linked from another page. That means if you have a link to a personal
email, suddenly it isn't personal. Your company's financial documents may
be on there somewhere. Did you have some naughty pictures that you took and
only your husband or wife knew the link to...Can you say "oops?"
This raises a big concern over privacy, and rightfully so. Never put
anything on the internet that you don't want people to see. That is a
general word of advice that you should follow regardless of spiders. But
you may have read stories about companies whose internal records are
suddenly found floating around on the internet. Blame hackers? Maybe you
should blame robots and the administrators who do not know how to control
them. All it takes is one site to start the robot and it begins to follow
whatever links it is programmed to follow. Some employees may link to
internal documents. Some databases may allow spiders to query from them.
You never know who may be linking to what, and by not having a well designed
web site, you may have just taken your top secret project and shared it with
the world.
So you see, there are some good things and some bad things. Luckily, there
are ways that you can control robots and hopefully limit the bad things.
There is a standard called the "robots.txt" exclusion file. It is a simple
ASCII text file that allows you to tell any robot visiting your site what
they can and cannot access. For a sample file, look at this:
www.stankdawg.com/robots.txt
You will notice that there are comments (starting with the "#" sign) and
two other important fields. Proper use of these fields can limit most
search engines and spiders that honor the exclusion file.
The first field is called the "User-agent" string. Each program visiting
your website, human or otherwise, is using a piece of software. For humans,
it is called a web browser like Mozilla Firebird, Konqueror, or dozens of
others. The name of this agent is sent with every page request. If you
look at raw log files from your web server, you can see who visited your
site, and what agent they used. The majority of them will be Internet
Explorer since most surfers are using the Windows operating system. You can
look at your logs and find some interesting types of clients out there.
Well, since robots are programs too, they also have an agent string. In the
robots.txt file (which must reside in the root directory of your web server
home) you can single out any agent to block it.
The second field is the actual file or directory that you do not want
accessed. The field name you would use is "disallow". Both the
"User-agent" and the "disallow" must be followed by a ":" and then the data
that specifies what you want done. If you want to stop the agent called
"googlebot" from accessing the file called "privatestuff.html" you would
code the following lines:
# This is a comment above the sample code. # User-agent: googlebot Disallow: privatestuff.html Disallow: /images/mysexypics/
As you can see, the syntax is very simple. What you need to do is think
about which things you want kept hidden from which agents. If you want to
hide several different files or directories, you would use multiple
"Disallow" lines. In the example above, I also block access to the entire
directory called "/images/mysexypics/" which could have been very
embarrassing! Be careful to realize that this only blocks ONE AGENT!
Usually, people do not distinguish one agent from another in practical
application. If something is to be kept hidden, it should be hidden from
all agents, not just "googlebot" as in the example above. One way of doing
this is to use multiple "User-agent" strings. This is never complete and
there are always new spiders coming out that would not be on your list
unless you constantly update it. The better way to do this is to simply use
wildcard of "*" which tells ALL AGENTS to follow the subsequent "Disallow"
commands. Along the same lines, you can also tell robots to ignore your
entire site by using the "Disallow" string of "/" which will stop the robot
from looking at anything! (Note that you cannot use a "*" wildcard in the
Disallow field, you must specify a path.)
# This is a global "stop all robots" example # # Note that comments can be put anywhere on a line, # and not just above the fields. They can come after the string. # User-agent: * # This string stops ALL robots from going into... Disallow: / # ANY of the directories
An alternative to using the robots.txt file is to use special "meta" tags in
your HTML. Some people may not be able to create a robots.txt file for one
reason or another. You can also add a Meta tag in the HTML of every page
that you code. The Meta tag name is simply "robots". This Meta tag will
allow or disallow robots by using keywords in the Meta tag such as "all" to
allow it to be included in the search engine or "none" to stop it from being
added to a search engine. There are other options as well, but these should
suffice for most users.
Now here is the catch... (There is always a catch.) The keyword is "honor"
which I mentioned earlier. While most commercial search engines will
currently honor your robots.txt file, it is not a requirement that they do.
It is an optional standard that is not enforced by any agency. That's
right; it's on the honor system. I am sure that there will come a day when
the search engine competition will become so fierce that the engines will
begin to index all pages regardless of exclusion requests so that they will
gain an "advantage" over other search engines. Also, you have to realize
that anyone can write a spider or a robot! Since it is optional whether or
not they honor your exclusion requests, they may still waltz right through
your site ignore all of your "do not enter" signs. This is the reason that
I mentioned earlier that you should never, EVER, put really personal,
private, or valuable information in a publicly accessible location. There are
many better ways to keep your files safe than robots.txt anyway.
Finally, you should also realize that just because these are intended for robots
(or programs) to look at, that doesn't mean that humans cannot look at them as
well. I have found many, MANY backdoors and "hidden" entrances simply by looking
at a sites robots.txt file. You have full permission to poke around my robots.txt
files and maybe you will find some interesting super secret 31337 stuff!
Further Reading:
http://www.searchengineworld.com/robots/robots_tutorial.htm
http://www.spiderline.com/help/robots.html
Shoutz: As always... my home-dawgs in the DDP, Zearle, Saitou, people who are
willing to read and learn, whoever invented the new Reese's "big cup", and people
who try to use robots.txt files as a substitute for security.
