Databases vs. Search Engines: The Space Locality Bottleneck

So far, the only solution proposed by database vendors to provide acceptable performance on a large volume of information is to improve the performance of the underlying hardware. In-memory databases like Oracle TimesTen or DB2 SolidDB require huge amounts of physical memory. Datawarehouse appliances like TeraData or Netezza rely on specialised hardware coprocessors. And most recently, as Steve Arnolds points out in his blog, Oracle itself admits that the acquisition of Sun will allow them to build more powerful “systems” by combining Sun’s high-end hardware with Oracle’s database platform.

At Exalead, we believe that Search-Based Applications, or SBAs, are another (I could say more “sustainable”) solution to this problem. The key to efficiently handling large amounts of data is to make sure that data access has a strong “Spatial Locality“. Quoting Wikipedia, achieving spatial locality means that “if a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future.” The main problem with relational databases is that they have very poor spatial locality, because the objects they store are spread across a large number of different tables. High-end CRM or ERP solutions typically store their data on as many as 65,000 different tables, each table being stored at a different disk location. Imagine how many different disk locations the system needs to touch just to display information about a customer or a product on a call center agent’s screen, or to produce a complex BI report. Poor spatial locality leads to huge requirements for disk access, which is the main performance bottleneck for databases today.

SBAs are built on a very different data model, centered around the notion of a “business item”. A “business item” is a self-contained object corresponding to a “real-life” entity that is manipulated by the application and understood by the end-users. For example, in a CRM application, business items would be the Contacts, Opportunities and Leads that are viewed by the business users. Unlike applications built using a relational data model, a business item-centric storage strategy allows for great data spatial locality, since the pieces of information required to answer complex, multi-criteria search queries are all part of a single business item type, and hence stored close to each other on a disk. The performance gap between this local approach and the spread-out relational data model grows exponentially wider as the amount of information applications need to store increases.'


Latest posts by François (see all)

  • Hey, nice post, really well written. You should write more about this.

  • I admit, I have not been on that webpage in a long time… however it was another joy to see It is such an critical topic and ignored by so many, even professionals. I thank you to support making people more aware of achievable issues.

  • actually your blog is one of those i will bother to revisit. most i saw today are full of useless informations and advertising. thank you for providing some real content to the world 🙂

  • I saw this really good post today, found it on google. i think i may return some time.

  • Hi, I’ve accidently stumbled upon this blog whilst I’m hunting around the Internet as I am researching some information on engine lifts!. I think it’s an interesting site so I have bookmarked your site and I will come back soon to have a more detailed read when I have more time.

  • Stéphane


    indeed having the definition of what to put in business items is key to the application’s performance. However, to make a SBA successful, the business item definition must not be driven by performance optimization considerations, but rather by the functional (“business-related”) aspects of the application: what are the key ‘objects’ in my application ? What are the items a user expects to retrieve from a search ? Actually the definition of the business items is often driven by the end users’s information requirements, to make sure that the objects stored in the search engine match the users’s mental representation (for example if the users expect to be able to search a contract by using the name of the sales rep that owns the corresponding account, then the name of the sales rep should be part of the “contract” object).
    A same data model can lead to different definitions of business items for different populations of users. For example, consider a logistics application where the company groups multiple objects that have the same destination into a same “shipment”. Depending on the application’s business logic and on the users’s way of manipulating the information, the most efficient business item definition could be the shipment as a whole, where the items contained in a shipment would be considered as searchable properties of the object “shipment”, or on the contrary, business items could be the actual objects in the shipment, with the shipment’s properties (origin, destination, etc) being attached to each object.

    One of the main advantages of SBAs is the flexibility of the data model. The definition of a business item can change multiple times over the application’s life span, without requiring the underlying data structures to be rebuilt or suffer heavy maintenance. I would say even more than that, the definition of business items is meant to change often because the users’ needs and expectations evolve quickly and the SBA is designed to adapt to these changing needs.

  • Stephane & other commenters:

    This SBA approach & discussion is very interesting. I am not a programmer, but have been researching some of these questions and database limitations for integration into a platform we are intending to architect. So fair warning and please forgive my neophyte status in this regard.

    Fundamentally, I have come to understand that the computing scale that we are heading towards point to key bottlenecks thrown up by RMBD’s & also perhaps HTTP.

    One of the things I am looking at is Neo4J. I am wondering if you guys are familiar with this and if this is relevant to the “spacial locality” discussion. Here’s an extract of Neo4J: is a graph database. It is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. A graph (mathematical lingo for a network) is a flexible data structure that allows a more agile and rapid style of development.”

    “You can think of Neo4j as a high-performance graph engine with all the features of a mature and robust database. The programmer works with an object-oriented, flexible network structure rather than with strict and static tables — yet enjoys all the benefits of a fully transactional, enterprise-strength database.”

  • Stephane, Greg,

    I was at VLDB 2009 too and indeed I felt many wanted search-like functions build into their engine. However, their thinking about search was somewhat like how search was years ago, not the fast and especially rich search that Exalead and others nowadays provide.

    I know from own experience that search based applications are much easier to build and deploy than database-driven applications. The challenge of database offloading and using search technology like Exalead’s for wading though millions of records is always to optimize (not flatten) the data structures. There are very few search vendors that have a good solution here and I am glad that Exalead is putting so much attention to this. It is clear that ‘spatial locality’ could be a solution to prevent too much flattening of the data structures, but Stephane has not made it clear how you get to the right definition of a business item, That seems to be key to optimal performance though. WOuld be interesting to learn more about this.

  • Greg

    When I was at the VLDB 2009 conference last month, I got the impression that database researchers had “search engine” envy. Many paper topics had to do with accelerating search in databases, with the 10-year Award keynote highlighting column-based store as the great idea to speed up access. There, I heard the ideal is to “get everything into main memory” because every disk access is “mortal” in terms of speed. But unless you flatten the web of relational databases into single “business items” you are still accessing tables all over memory, and you can’t provide the spatial locality that Stephane describes, even with column store.