Throughout the past few months, I’ve been asked several times, including on a couple of podcasts, to talk about Omnisearch’s founding story and reflect on the very early days of the product. This line of thought led to the very early demo videos that I made for what was at the time called Caption. The product was pretty bare-bones - a simple audio/video search API. You could send audio or video URLs for processing, and then use the API to perform search queries and get exact timestamps where matches occur and integrate that back into your site.
It also got me reflecting on the thought process that nudged us to build Omnisearch in the first place. At that point, we had a vague intuition that search technology was overly text-focused and that better tools are required for the multifarious media types on both the open web and in companies’ own “walled gardens.”
There exist various reasons why this was urgently needed: for instance, the initial spark for Omnisearch came from not being able to find information inside hundreds of hours of training videos at Amazon. The online education segment is huge overall, and it was growing briskly during the COVID pandemic; meanwhile, podcasts seemed to have taken over the world; culturally, newer tech-savvy generations have fundamentally different preferences for content consumption with less of a focus on text (e.g., TikTok).
Overall, the anecdotal evidence seemed to show an increase in the importance of non-textual data and the need to design new search algorithms from the ground up.
The research
The layman’s observations from the previous sections are one thing, but how can they be tested with scientific rigor? And what are the future repercussions of how we consume content on the web?
Well, Professor Anthony Cocciolo of the Pratt Institute took a very successful stab at providing analysis in his paper “The Rise and Fall of Text on the Web.” The main question he tries to answer is what percentage of all content on the World Wide Web is text, as opposed to images, videos, and other non-textual forms of content. And to figure out how this percentage has been changing throughout the years. The good professor’s methodology is pretty ingenious. I highly recommend you to read his original paper since I believe it’s monumental. Now, I will summarize the gist of it.
The paper begins with selecting a set of relevant sites that can be analyzed throughout the years - for this the Alexa ranking is used. Additionally, only sites that were functional all the way from 2003 to 2014 (the final year the analysis covers) were considered. This yields a lot of the usual suspects, such as Amazon.com, Pepsi.com, NYTimes.com, Slashdot.org, and so forth.
Then, Anthony leverages the Internet Archive to get historical versions of these sites in three-year increments. This yields a really good real-world sample for observing how the content changes over time on the same set of sites.
Last but not least, the final step in the puzzle is analyzing just how much of the content text actually comprises. Keep in mind that this isn’t easy, since text can be inside images as well, rather than simply in HTML (as illustrated in Figure 1). To achieve this, the author utilized a modified version of Project Naptha. This essentially produces bounding boxes for the textual areas in the archived web pages. The percentage of text is then simply calculated as the ratio of the bounding box area to the area of the entire page.
What findings does the paper get to? Well, the anecdotal evidence of the declining percentage of text on the web is borne out by the analysis - the percentage of textual content on the sites that were analyzed has been declining from a high of 32% in 2005 to about 26% in 2014, the last year of the study, and we can assume it’s dropped even more since. This shows a very significant trend regarding the overall composition of Internet content. Imagine what the numbers would show in 2022. Or 2030!
Traditional search engine perspective
What does this mean for search technology? Traditional search solutions have, from the ground up, been designed with textual data front and center. This after all makes sense - not only is text-based HTML the backbone of the web, but the web’s hyperlink structure provides a great way to determine sites’ and pages’ relevance relative to others. So it’s not a major surprise that an overwhelming majority of algorithms in information retrieval, from the verbosely-named term frequency-inverse document frequency (commonly abbreviated as “tf-idf”) to Google’s core algorithm PageRank (which treats links as de facto scientific citations to help determine relevance), assume and are based on textual data.
This holds for consumer-facing search engines that index the open web, such as Google, DuckDuckGo, and Bing or country-specific variants like Baidu and Naver. But it’s also true for their enterprise counterparts like Elastic, Algolia, OpenText, and others that focus on indexing data within companies’ walled gardens. But this, quite simply, leads to a large percentage of content on the web not getting properly indexed and thus not being properly searchable for the end-user.
Nota bene, the results of the research in the previous section actually oversell the percentage of text, since it also takes into account the text embedded inside images, which is not accessible by search solutions out of the box. With that in mind, it becomes perfectly clear that a different approach is needed if search technology is to be moved into a new era.
Omnisearch’s role
At Omnisearch, we seek to fundamentally alter the status quo. As we always insist, we’re not a Google-like search engine that crawls and indexes content on the open web, but we rather focus on site search use cases, indexing companies’ user-facing content, and providing a uniquely powerful search experience to their users. The magical thing about Omnisearch is that it makes all site content searchable from a single, unified interface - be it video, images, documents, or text.
There’s still plenty of work to do, we’re persevering in our mission to make all content as easily and efficiently searchable as text.