Broadly speaking, a ‘search engine’ is something that permits a visitor to a web site to seek content in a database hosted on that web site. It is the engine that powers the visitor’s search. In this broadest sense, every web site that lets a consumer search through a database for something is a ‘search engine.’ Several characteristics help to distinguish various types of search engines from each other, including their scope, data sources, methods of indexing, and their target audiences and audience objectives.
Search engine scope: Each search engine has a ‘scope.’ In the case of Google, for example, the scope of the search is all web pages (and a few other things, depending on where you are searching on Google). I’ll refer to Google and its ilk as ‘web search engines.’ In the case of Realtor.com, Zillow, and Trulia, which I’ll call ‘real estate aggregator search engines,’ the scope is real estate listings from participating brokers and MLSs. The scope of a local broker’s IDX site, an ‘IDX search engine,’ is the IDX listings from the MLS(s) where that broker is a participant. The local MLS’s consumer-facing search engine usually has the same scope as the local broker’s IDX site.
Search engine data sources: Every search engine has to have a source for its data. In the case of IDX search engines, the source is the MLS’s IDX program, consisting (usually) of limited fields relating to (usually) active listings of brokers in the MLS, provided neither the broker nor the seller has opted out of IDX.
A real estate aggregator search engine’s data source is usually (a) listing brokers/salespeople uploading their own listings; (b) MLSs or listing syndicators syndicating listings to aggregators on behalf of listing brokers/salespeople; (c) other aggregators that got listings via one of the first two means, sharing listings under a business relationship between the aggregators; or (d) a combination of these.
An MLS’s consumer-facing search engine DOES NOT have the same data source as broker IDX sites. An MLS’s consumer-facing search engine is NOT an IDX site. (If you doubt me, look up the definition of IDX in your rules.) A broker that takes part in IDX could still opt out of having her listings appear on the MLS’s consumer site (and vice versa). This is an important distinction, but it is frequently misunderstood.
Google is not interested in indexing only active listings. Its expressed intent is to index everything! It really cannot wait for folks to submit or upload their content. Instead, it sends out an army of ‘spiders,’ ‘robots,’ and ‘web crawlers’ to gather web pages for indexing. Of course, those animistic terms are just metaphors – there are no robots or spiders. A web crawler is really a series of software applications designed to read web pages, follow the links on them to other web pages, and continue in that way, until they have visited and gathered all the content on the web.
Google’s web crawlers go a step further. They gather complete copies of most of the web pages that they view, copies that Google retains on its servers to facilitate indexing and searching. These are called ‘cache’ copies. The cache copies are really what you are searching when you use Google, and it is based on the cache copies that Google creates indexes.
NEXT: Indexing on search engines