DNS for a search engine

Sat Feb 17 02:07:25 UTC 2001

At 04:35 AM 2/16/01, Eric Billingsley wrote:

>How would you configure a set of DNS servers for a search engine?  There
>are two separate issues:
>
>Sites - Name lookups of sites while crawling the web - somewhere around
>1-20 million names
>
>Clients - for log processing on a daily basis - somewhere around 50
>million IPs
>
>My view of the perfect DNS server for this application:
>
>I wouldn't even try to do DNS for the site here.  This will only be for
>the two tasks above.  I'm picturing a caching name server, but standard
>installations just don't work for this.  I need to tweak the cache so
>that I don't expire the entries for weeks rather than obeying standard
>TTL's.  I would want a very short time-out for the query, but I would
>want to requeue the address on the crawler and have the DNS server try
>the query again immediately with a very long timeout.  That way, the
>next time I try to crawl the site again, the answer would be in the
>cache and not just time out again.  I also want a system where I can
>restart the daemon and preserve the cache (write it to file).  If
>possible, I would even like to be able to copy the cache itself for some
>processes to use directly (very simple format).  I would then want a
>separate thread that would automatically attempt to update the cache
>when my local TTL expires rather than do it at query time.

         Why not just keep the IP address in the database once you've looked it up
instead of bothering with DNS multiple times?  What did spider do?

>Beyond that, any way that I can speed up DNS resolution for IPs/names
>not in the cache would be good to know.  If any of you have ever done
>reverse DNS on a log file with 30M IPs you know that ALL of the
>processing time is spent doing DNS rather than actual work.  I used to
>work at AltaVista and this was the biggest pain I ran into.

         Just store the results of the lookup in a database and then try the database
first.  After a while you'll only have a few IP's that are not in the database. In fact
you should combine the crawler's list with this one into one large database. I doubt
that there's much overlap between the two, but it's the same type of data. Make
sure you include a lookup timestamp so you can make decisions about how
stale the addresses are. Did you ask Paul F this question?  I'm sure he would
have some ideas.

         Danny