DNS for a search engine

Sat Feb 17 15:01:20 UTC 2001

We had a similar problem with our search engine.  We could not use an
external database, as there are no hooks into the product.  DNS TTL's
would time the records out of the cache too soon for the engines taste,
dramatically increasing indexing time.

Our solution was to set up an NIS master on the machine that only
supported the host table.  Apparently NIS isn't as careful about
expiring records as DNS is.  Not a solution I would want for my general
user population, but for the search engine, it increased performance by
an order of magnitude.

The search server is running on Solaris 2.6 (patched).

Hope this was helpful.

-Glenn

Danny Mayer wrote:
> 
> At 04:35 AM 2/16/01, Eric Billingsley wrote:
> 
> >How would you configure a set of DNS servers for a search engine?  There
> >are two separate issues:
> >
> >Sites - Name lookups of sites while crawling the web - somewhere around
> >1-20 million names
> >
> >Clients - for log processing on a daily basis - somewhere around 50
> >million IPs
> >
> >My view of the perfect DNS server for this application:
> >
> >I wouldn't even try to do DNS for the site here.  This will only be for
> >the two tasks above.  I'm picturing a caching name server, but standard
> >installations just don't work for this.  I need to tweak the cache so
> >that I don't expire the entries for weeks rather than obeying standard
> >TTL's.  I would want a very short time-out for the query, but I would
> >want to requeue the address on the crawler and have the DNS server try
> >the query again immediately with a very long timeout.  That way, the
> >next time I try to crawl the site again, the answer would be in the
> >cache and not just time out again.  I also want a system where I can
> >restart the daemon and preserve the cache (write it to file).  If
> >possible, I would even like to be able to copy the cache itself for some
> >processes to use directly (very simple format).  I would then want a
> >separate thread that would automatically attempt to update the cache
> >when my local TTL expires rather than do it at query time.
> 
>          Why not just keep the IP address in the database once you've looked it up
> instead of bothering with DNS multiple times?  What did spider do?
> 
> >Beyond that, any way that I can speed up DNS resolution for IPs/names
> >not in the cache would be good to know.  If any of you have ever done
> >reverse DNS on a log file with 30M IPs you know that ALL of the
> >processing time is spent doing DNS rather than actual work.  I used to
> >work at AltaVista and this was the biggest pain I ran into.
> 
>          Just store the results of the lookup in a database and then try the database
> first.  After a while you'll only have a few IP's that are not in the database. In fact
> you should combine the crawler's list with this one into one large database. I doubt
> that there's much overlap between the two, but it's the same type of data. Make
> sure you include a lookup timestamp so you can make decisions about how
> stale the addresses are. Did you ask Paul F this question?  I'm sure he would
> have some ideas.
> 
>          Danny