Sumit Sharma

Archive for October 2009

How should aggregators work?

leave a comment »

There will be a sophisticated analytics engine, with adaptive intelligence capabilities that enables self-learning and continuous iterations in order to better serve the user – by providing relevant data delivery. These engines would be able to discover what the users preferences are on an iterative basis, correlate various contextual data to form hypotheses about users, based on which it would aggregate data from appropriate providers (based on suitability of the providers contextual data characteristics), to eventually be in a position to predict and form insights into what is most relevant to the user. Taking this a step further, the aggregator will be able to suggest new things to the user – bringing the web of unknown data into the known – herein lies the notion of framing serendipity. Going back to a utopian world, an ideal aggregator would function in a fashion depicted in the chart below:

How a data/information aggregator should work

Another way of portraying the aggregation operation, for a retail example where a retailer is trying to market a product/service to someone, can be depicted below:


Written by Sumit

October 29, 2009 at 6:56 am

The role of Data and Information Aggregators

leave a comment »

“We will see two levels of data and information emerging: Knowledge in terms of facts and Knowledge in terms of insights.” – Eric Schmidt, CEO and Chairman, Google

We’ve organized the data, now what?

Regardless of the approach taken to define characteristics to the webs of core underlying raw data, it will inevitably imply additional overlays of metadata – contextual filters that describe, or contextualize, what the web of data consists of. The foundation has been laid, so how to build on this foundation is the next question. The next step is for analysis to be conducted on this semantic web set of data to figure out patterns of our characteristics and even predict behavior to ultimately give the right information, at the right place, at the right time.  This is predominately the job of data and information aggregators.

What are aggregators?

Aggregators can be in the form of engines that aggregate information based on keyword popularity, web cookies, search and browsing history etc – Google and Bing are examples of such aggregators. Other aggregators today utilize direct subjective user generated content such as ratings and rankings and have continuous dynamic feeds of information streamed based on rating/ranking popularity – aggregators such as Digg, Delicious, Fark and StumbleUpon do this primarily for news, articles, media and some products and services. Other aggregators of data are services and product platforms, news sites and blogs and social networking sites. However, in reality, the industry understands aggregators to be those entities that operate on a more macro level across greater sets of data.

All the examples just noted are deliberate aggregators, but there is a large set of entities that, through their functionality, are siphoning information to its user and they in a sense are indirect aggregators too. For example, productivity widgets such as for stock tickers should automatically populate which data you’d be interested in, or a company’s product has certain details relevant to you highlighted. The key message is that aggregators, in theory, are basically any entities that have content available for consumption.

All these aggregators have the immense responsibility to consume, understand and then develop insights from their users’ data and behavior. Going a level deeper, we can define the natures of aggregators using the notion of pull and push aggregators.

Push aggregators

Push aggregators target a certain audience and are most common today within the news media, i.e. Huffington post, drudge report can search engines, commerce platforms, advertising platforms, social networking platforms, knowledge and news platforms and all other utilities’ and services’ platforms– they are all providing information to users at some level or another, and this targeting will be based on understanding user requirements and hypothesizing which data and information to provide. For example, someone might prefer to read about Indian entertainment news only, American basketball and global technology news for example.

Pull aggregators

Pull aggregators are those with very selective consumer bases who self select themselves – examples of such aggregators include Aardvark, Hunch and other specific bulletin boards and silo’d information services. These aggregators, who base their algorithms on voluntarily provided user data, gathered data and information for users looking for very specific information such as “How can I fix my DVD player”.

Unclassified (neither Push not Pull) aggregators

There are also some aggregators, that aren’t quite Push nor Pull oriented but are in the middle. They tell us not what we need to know, and not what we’re looking for, but rather what we would be attracted to. An example of such an aggregator is DemandMedia – a storehouse of information on all sorts of topics, aggregated across a series of websites – some topics are so random that we’d never be specifically looking for them however attractive enough for us to be interested in.

Today, aggregators and their functions and operating models are extremely distributed, with little or no integration amongst each other – Google and Facebook do not share data.

There are still limitations to the aggregator algorithms today in that they are basing their data on very limited sets that will not accurately cover our real intent, especially since our activity touches data that is beyond the realm of disparate data sets . In other words, aggregating across silo’d sets of data, and enabling solutions is a dis-service and so the more data inputs we have to aggregators, and the more these aggregators know about you, the richer the benefits to you there will be. Being able to interpret each customer/user as a unique individual, and through analyzing contextual data, applying a filter which enables the aggregator to shape serendipitous connections with knowledge/information, people and other entities such as recommending physical locations and so on.

In an ideal world, aggregators will be omnipresent entities that integrate the physical and virtual worlds’ data and information, to provide an experience through which relevance and return on attention is maximized. The aggregator would have access to an unlimited amount of data and information, across all boundaries within the virtual and physical plains. That is to say that data between Google, Digg, Facebook, Yahoo, YouTube, WebMD, and all other sites, as well as user information across all 3 user plains of data will be available in a standardized and federated semantic format.

In essence, if done properly and all conditions such as data federation and privacy conditions are dealt with, push and pull aggregators’ data and information will converge to be the same. This will be highly disruptive to the world as we know it today: media, services and products’ industries will all need to re-think their operations and strategies because at their current models, they will to provide increasingly irrelevant information to users.

Written by Sumit

October 26, 2009 at 6:54 am

What we should take away from the Music Genome Project

leave a comment »

Case Study: the Music Genome Project

The music Genome Project is an initiative to map out about 400 qualities or genes to describe a specific piece of music within each genre. This is on the lines of a provider being able to accurately describe their services, and this in turn enables them to figure out what to provide to their customer that is currently being done through The missing dimension in the equation of mapping user preferences to music for Pandora is to capture user data beyond just the historical music preferences but to involve time, geography and mind/body data as well.

Unfortunately, we are FAR from this ideal world state of having data overlays describing our lives in all 3 plains. Until and unless there is a massive federation effort incorporating every single source of data in all 3 plains and how standardized the information will need to be addressed – perhaps there will need to be an equivalent of a WWW for mind/body data which could be the lines of HL7 standards, as well as standards on defining physical data. We are already seeing companies making effort to contextualizing data however it is happening in silos, (right now Yahoo Fire-eagle and Google Latitude have divergent ways of defining geographic metadata).

Note, that the web of semantic data will go beyond just virtual entities, but will also include physical entities in this world with identities depicted in a standardized way that conforms to how entities in the virtual world are depicted.

It should be noted that there is no right solution to designing this semantic web of data, and these models will probably continue to go through iterations towards a more feasible and efficient method to characterize users and the data they consume. One powerful solution has been outlined by Marc Davis, former Chief Scientist at Yahoo! who has designed the Web4 invention framework in which context of data in terms of “who, what, where and when” is being categorized.

Written by Sumit

October 25, 2009 at 6:53 am

Using metadata for overlaying the web of provider data

leave a comment »

The Semantic Web of Provider Data

In my previous post, I talked about what I thought the metadata overlay of consumer data could look like, and in this post I’d like to briefly overview what the metadata overlay of provider data could look like…

The semantic web of provider data can be described in terms of three core foundational overlays: commerce, utility and leisure. These overlays will exist to describe the products and services being offered in a manner that will enable the providers to more accurately target consumers through marketing and sales.

Web overlay for provider data

Commerce contextcommerce predominantly spans the products and services industry – this includes both e-commerce and physical commerce entities. Context could include characteristics and features related to defining the type of product/service being offered, as well as features defining operating characteristics of the business entity. Examples of contextual data could include the types of products, hours of operation of a business, location of a business, price points, and various other of the business.

Utility context utilities imply things that we make useful to our productivity or quality of life. These would include collaboration tools, networking, computing infrastructural solutions and other personal productivity applications and widgets.

Leisure context this includes games, arts, cultural, sports and recreational and social networking sites

Written by Sumit

October 22, 2009 at 6:48 am

A utopian approach to organizing data

leave a comment »

There is no right solution to organizing the web of data and all solutions will ultimately converge into a common framework and an attempt is being made to define a semantic web of data in this article. Assuming an ideal utopia in which standardization and federation of data have absolutely no obstacles, an iteration of the semantic web of user and provider data will be presented in the following section.

The Semantic Web of User Data

Again, assuming a utopian world, for user data to be organized effectively we would need to characterize human beings’ information and all other entities in relation to our activities in a federated manner. Humans can be represented in terms of three core foundation overlays: virtual, physical and mind/body. Here is a summary of what the semantic web of user data would look like after including in contextual metadata from all 3 plains:

Physical context – what occurs in the presence of Oxygen: includes, but is certainly not limited to, the following: geographical space, time, macro events, micro events, social context, meteorological factors, micro-geographical space, type of event etc. and so on.

Perhaps the most pertinent example of measuring physical data nowadays is on Google maps, which records your location when you use it, through the use of the GPS on modern devices. Spreading this further, there is much more scope of recording more data on the physical plain be it which physical shop you visited, last time you watched a movie, how many times your visit the gym or the last time you took highway 1 down the Californian coast.

Virtual context – what occurs in the presence of 0’s and 1’s: includes, and is certainly not limited to, the following: download speed, bandwidth available, latency, type of machine: Mac or PC, type of browser: Safari, IE or Firefox, website information, average time spent online, browsing history, cookie information, sophistication and power of local machine in terms of RAM, processor and storage etc, our avatar activity, websites you’ve visited, ads you have clicked through, products you have purchased and it can even get detailed to the extent of subject matter within your emails thanks to Google.

We are producing a wealth of information in the virtual world and how much of it is getting consumed depends on what privacy settings you’ve knowingly or unknowingly agreed to, what devices and technologies you’re using and how much you use the internet for various activities. The fact is there is unlimited information about us, owned by Google, Yahoo,,, Apple, Nokia, Microsoft, Garmin, Sony Playstation, Nintendo and so on.

Body/Mind context – what occurs in the psychological and biological realm of our minds and bodies could include: vital body signs, neuron activity, mental mood and so on.

How this data gets recorded depends on the sophistication of portable devices and advancement of interfaces between doctors’ offices, clinics, pharmacies etc and the virtual world – it could be from the watches with blood pressure sensors, portable devices that measure vital signs that link up to the internet or as simple as an interface between your doctors’ EMR system to a 3rd party medical information aggregator. There is much scope for this and we haven’t even begun to hit the tip of the iceberg.

Written by Sumit

October 21, 2009 at 6:46 am

Before thinking relevance…we need to think about organization

leave a comment »

Now that we have established that some action needs to be taken, what should that action be and how will it occur? Before answering this question, it is important to level set where we, as a global community, are headed: The not so surprising assumption is that we want to positively change the world we live in through increasing innovation and efficiency in all aspects of our life be it personal, social or business related. To do this, we need to look towards structure and organization because it is in laying this foundation that one opens up many possibilities for innovation.

Users and Providers

Being able to discover and deliver relevant data is a complex notion that will require widespread changes in how data is organized and stored. Being able to decipher what users want and need will entail a sophisticated understanding of the user, as well as the providers. For this article, the notion of user and provider are being treated very loosely, especially since there is a very thin gray line demarcating both parties. In most cases, a “User” can be thought of as us, human individuals consuming some data, information, products and services provided by a “Provider”. Where the boundaries get blurred is when data produced by the users gets consumed. For example, as an customer, we’re consuming data provided by the website, however we’re also producing data for the website to consume: products we browse, pages we visit, links we click and so on. In all this data that is being stored, both user data and provider data, there needs to be a context given in the way it is organized so that some sense can be made of the data.

Building on the Semantic Web of Data

This organization will manifest into one of the key elements of what is popularly known today as the semantic web of data.  That is, a set of overlays to contextualize and describe the baseline web of data described earlier in this document. Being able to contextualize data is the foundation from which computers can better understand user characteristics as well provider characteristics and in turn do a better job at providing relevant data and information flows.

The impact this will have on our lives: the right data, at the right place, at the right time

Being able to understand user characteristics as much as possible will increase the user experience, enable higher productivity, heighten innovation and last but not least will influence the ability to bring serendipitous moments into users’ lives. For this to happen, it is not enough to just be able to understand the user characteristics. Thinking of this metaphorically as a relationship of give and take, it is extremely important for the providers of data to be able to contextualize what is being provided in a similar way. Being able to do this will enable for relevant data delivery: the right data, at the right place, at the right time.

Written by Sumit

October 18, 2009 at 6:44 am

My opinion on current efforts to make the web a more relevant place for us…

leave a comment »

There have been moves being made to rectify this potentially messy situation. For example, incumbent search engine algorithms are continually iterating in an effort to re-evaluate our search intents based on keyword popularity and cookie information about certain online activities. Pandora filters music streams to users based on their historical song preferences, tries to gauge what you may be interested in buying based on what we’ve browsed before. Many social networking websites and entities such as Digg, Delicious, Aadvark and Hunch are more explicit in trying to gauge our interests through more explicit feedback solicitation.

However the algorithms employed today still yield imperfect results too often and it can be argued that as the base set of data expands across the internet, these algorithms will only get marginally better, and perhaps even worse, over time. Perhaps, the reason why is because they are focused on an extremely limited set of data based on which they can form algorithms to discover intent and learn more about us. This data is limited in a couple of ways:

1. Whatever data is captured is not shared. Data exists in silos for each web company. This entails each company to have an extremely limited view of our activity, of course, unless the only website we frequent is theirs only, which is probably not the case.

2. To truly discover intent, and figure out more about us, we will need a much larger data set encompassing beyond just the online world. Data from the physical world, such as location, is already being gathered however there is extremely nascent and there is much more scope required. The development of sensor networks will go a long way in supporting this. Another realm for data could be data related to our mind/body – such data, of correlated correctly, can provide some extremely disruptive solutions to the various providers of data.

The underlying issue is that context needs to be understood. Context of user activity resulting  data production, as well as context of user consumption.

Written by Sumit

October 17, 2009 at 6:41 am