DNS for a search engine

Sat Feb 17 22:18:33 UTC 2001

         It depends on whether you are writing your own search engine or just
using an available one.  I assumed (perhaps wrong) that Eric was involved
in writing one.

         You also weren't clear as to whether you meant the crawler or the users
runing queries.

         Danny
At 10:01 AM 2/17/01, Glenn Bell wrote:
>We had a similar problem with our search engine.  We could not use an
>external database, as there are no hooks into the product.  DNS TTL's
>would time the records out of the cache too soon for the engines taste,
>dramatically increasing indexing time.
>
>Our solution was to set up an NIS master on the machine that only
>supported the host table.  Apparently NIS isn't as careful about
>expiring records as DNS is.  Not a solution I would want for my general
>user population, but for the search engine, it increased performance by
>an order of magnitude.
>
>The search server is running on Solaris 2.6 (patched).
>
>Hope this was helpful.
>
>-Glenn
>
>Danny Mayer wrote:
> > 
> > At 04:35 AM 2/16/01, Eric Billingsley wrote:
> > 
> > >How would you configure a set of DNS servers for a search engine?  There
> > >are two separate issues:
> > >
> > >Sites - Name lookups of sites while crawling the web - somewhere around
> > >1-20 million names
> > >
> > >Clients - for log processing on a daily basis - somewhere around 50
> > >million IPs
> > >
> > >My view of the perfect DNS server for this application:
> > >
> > >I wouldn't even try to do DNS for the site here.  This will only be for
> > >the two tasks above.  I'm picturing a caching name server, but standard
> > >installations just don't work for this.  I need to tweak the cache so
> > >that I don't expire the entries for weeks rather than obeying standard
> > >TTL's.  I would want a very short time-out for the query, but I would
> > >want to requeue the address on the crawler and have the DNS server try
> > >the query again immediately with a very long timeout.  That way, the
> > >next time I try to crawl the site again, the answer would be in the
> > >cache and not just time out again.  I also want a system where I can
> > >restart the daemon and preserve the cache (write it to file).  If
> > >possible, I would even like to be able to copy the cache itself for some
> > >processes to use directly (very simple format).  I would then want a
> > >separate thread that would automatically attempt to update the cache
> > >when my local TTL expires rather than do it at query time.
> > 
> >          Why not just keep the IP address in the database once you've looked it up
> > instead of bothering with DNS multiple times?  What did spider do?
> > 
> > >Beyond that, any way that I can speed up DNS resolution for IPs/names
> > >not in the cache would be good to know.  If any of you have ever done
> > >reverse DNS on a log file with 30M IPs you know that ALL of the
> > >processing time is spent doing DNS rather than actual work.  I used to
> > >work at AltaVista and this was the biggest pain I ran into.
> > 
> >          Just store the results of the lookup in a database and then try the database
> > first.  After a while you'll only have a few IP's that are not in the database. In fact
> > you should combine the crawler's list with this one into one large database. I doubt
> > that there's much overlap between the two, but it's the same type of data. Make
> > sure you include a lookup timestamp so you can make decisions about how
> > stale the addresses are. Did you ask Paul F this question?  I'm sure he would
> > have some ideas.
> > 
> >          Danny