In his just-released book “In-Memory Data Management” Prof. Hasso Plattner and Dr. Alexander Zeier are comparing enterprise applications with web search systems (cf. section 1.1, page 8 – download excerpt). However, I think one should be careful with that and know the major differences which are not obvious at first glance. As an example I would like to compare a financial reporting – representing the Enterprise Application class – with a classical search engine to better visualize two fundamental differences:
Accuracy
Already having touched the aspect in the same section, Plattner and Zeier are pointing out that the end-users expectation varies significantly between a web search engine and the data that is being reported by the enterprise application. Though the end-users in both cases will request to have predictability (starting the same operation twice should result in exactly the same result with the same response time), he will have a different expectation on what is being returned. For example, the list of hits of a search engine is never expected to be a full-covering list of websites, which means that an end-user will not assume that all possible webpages in the Internet having a search term on their site to be found by a single search engine (though the target of a search engine shall be to provide as many of these sites to be able to provide the best ranking possible). On the contrary, thinking of a balance sheet reporting for example, it is absolutely vital that not just 95% of the postings stored in the ledgers is considered, but even legally (by the appropriate GAAP) it is enforced that all these values need to be aggregated. Leaving out just one single record – for example – on the debit side would not allow the balance sheet to balance anymore and any auditor would refuse to sign the auditing certificate.
This fact has a significant impact on algorithms which can be used for this purpose. Whereas it is acceptable for the search engine to evalatue only a subset of all records by means of statistics to provide a very good and fast “first guess”, the value of the cash account in the balance sheet can only be determined accurately after having read all affected postings in the accounting system at least once. Ultimately, the optimization potential of the latter is more limited compared to the search engine.
Optimization
Moreover, the type of requests on a search engine is quite homogenous: Its primary task is to provide high-performance full-textual search on associated attributes. Predicates may make the search query more complex, but usual requests to the system will contain only a small number of terms – due to the simple fact that the human brain is limited in thinking of complex logical conditions. Furthermore, it is accepted that a significantly more complex search request will also take some more time to be executed by the search engine. Thus, as the amount of “simple” requests to the search engine is considered to be much higher than the complex one’s, the intent of the developer implementing the search algorithm will be that the first type of requests will be performed with the utmost response time. The target of optimization therefore is clearly directed. For business application, however, this task is not that trivial: Analytics conducted by management is depending on the fact that the same piece of data is being reported from different angles. Therefore, a number of aggregation on the same set of data from totally different aspects (in fact: algorithms) need to performed such that the human consumer can compare the results to form a holistic view on the underlying facts. The distribution of the response times of all these aggregation requests must be approximately equal – a significant deviation among them is not accepted by the end-user’s community and is regarded that “the report is not working properly”. Thus, it is only possible to optimize these algorithms on large data volume to a certain degree which means that they must still remain sort of “general purpose” (yet, this does not mean a carte blanche of having bad perfoming algorithms, though).
Summary
It can be very much hoped that business applications do learn significantly from the performance that search engines exemplify today. However, there are natural boundaries which lay in the nature of use case: two of them we just have touched in this post. It is a seminal idea to leverage the difference in latency between main memory and disk space by building in-memory applications to drive innovation, but one should not forget that business economics make significant use of “special conditions” and thus often will lead to complex, not-optimizable algorithms where all data needs to be read to finally provide the requested result. This puts an upper boundary on performance that can be achieved simply due to the rules of Theoretical Computer Science. Let’s see how far we can push that border.