Google's Data Asset
Tim O’Reilly has been saying for several years that data is the Intel Inside of web services. I am not sure the analogy is completely accurate. The microprocessor plays a different role in a personal computer than data does in a web service. But it was a catchy line, and it made the central point that data is the heart of a web service. Tim’s recent post quotes Marissa Mayer at Google acknowledging that data is critical to Google’s ability to provide more and more contextually relevant web services. But, the most striking revelation may be that Google is providing 411 services in order to capture phonemes that would enable it to do a better job of searching video (presumably by using voice recognition on the audio track). If this is true, it means that they are providing a fairly complex and expensive service simply to capture a data element that would help them to provide another service. Looked at through this lens, Google is one huge data magnet. All of the services they provide are collecting massive amounts of data.
I do not mean to imply that there is anything wrong with this strategy or with Google’s motives. It is only by collecting this data that they can provide many of the services they offer. Most of the data they collect is not that personal – collecting the inflection in my voice when I make a 411 call (my phonemes) does nothing to harm me – I had no plans for those phonemes anyway. I am happy to contribute them. Many of Google’s other uses of data are equally harmless. Google’s use of my click-through patterns to increase the relevance of future search returns or to adjust the presentation of paid search text ads is fine with me, but there is something here that I think we need to pay attention to.
Data has this really weird quality. In economic terms data has an increasing marginal utility. Anyone who took Econ 101 knows that most physical objects have a decreasing marginal utility. When it is raining my first umbrella keeps me dry, a second may be handy if the first blows out, but a third is unlikely to be used. This is true of shirts, steaks, houses, of almost anything you can think of except data.
Data has the opposite characteristic. Each incremental point of data adds value to the ones you all ready have. It is easy to see this in the context of an advertising network. If the ad network knows that a user is female it can show more relevant ads. But, If the ad network knows that female’s age, it can do even better, and data about location, household income, and recent web sites visited all add value to the existing data points, making it possible to show more and more relevant ads. Google’s services all benefit from additional data albeit in different ways.
So what does all this mean about the market for web services. It means that we all need to begin to think about the degree to which Google’s enormous data asset will allow it to dominate this important sector
We have for example, been paying a lot of attention to services that help users discover new information and to filter the information they are already trying to consume. The young companies we have looked at in this space all approach the problem differently but they all depend on amassing data about users reactions to information and services in order to improve their ability to anticipate what a user might be interested in seeing. When Google announced their recommendations feature for Google Reader, we had a flurry of discussions about how this would impact the opportunity to provide discovery services. Google’s recommendations feature itself was not that impressive, or immediately useful, but just the way Microsoft’s entrance into a PC software market (often with an inferior product initially) changed the prospects for a startup, Google’s addition of recommendations to Google Reader is a shot across the bow of anyone in the filtering or discovery business. The source of the threat here is a data differential. Google has so much more data at their fingertips that even if a startup does a much better job leveraging data to deliver recommendations, Google could potentially provide a better value proposition to the end user with an inferior algorithm powered by more data, sourced from a broader range of services.
I have to admit that I do not know yet how dominant Google could be in web services, or if their dominance would dampen innovation and hurt consumers, but my bias as a venture capitalist is to believe that innovation thrives in small businesses and is often muted in large organizations. So, I think it is time that we all began to think about how to promote innovation in a world dominated by Google’s massive data store. Open source and the shift to a web based applications architecture reduced Microsoft’s influence and enabled a new round of innovation on the web.
Google understands the leverage of data. In the one area where they do not have the largest data asset, social networking, they have launched the Open Social initiative to try to make that data accessible to them and to others. It will be interesting to see how Open Social plays out. I am not convinced that it will alter the balance of power in the social networking space. Open source by itself did not have a huge impact on Microsoft. It was open source in combination with a platform shift from the PC to the web that opened up innovation on the web. Microsoft still dominates the PC platform. We need an open data movement, but that may not be enough. We may also need a platform shift. The web seems so much like an end state that it is hard to imagine what that platform shift might look like or when it might happen. I am not going to predict the nature or the timing of this platform shift, but I will point out one thing. The data that drives all of the most valuable web services is contributed by users as they interact with these services. The shift that unlocks another era of innovation will occur when users begin to understand their role in this ecosystem and have the tools at hand to direct what is now an unconscious contribution in a way that insures continued innovation on their behalf.
December 21, 2007 04:07 PM, By Brad Burnham
Tags: 411 data free google oreilly phonemes
Comments (19)
All this seems to proceed from the idea that out of an abundance of data comes more useful information. As late as the 70's it was still thought that if enough sensors could be distributed throughout the atmosphere, and if a powerful enough computer were used to process all that information, we would be able to predict the weather accurately for months in advance. Wrong.
There's another shift about to take place that will change everything about the nature of files, that will make data mining impossible. It is being driven by the desire for privacy, but it is far above and beyond and infinitely more powerful than encryption. Yet, it is simpler and when it begins, Google won't know what to do.
Posted by d , December 22, 2007 10:17 AM
There is no way data has "increasing marginal utility." Clever idea but simply false. Most of what Google is trying to do is make inferences about specific parameters (e.g. your tastes). That's basically a bayesian inference problem, which has the property that the first bit of information is a lot more valuable that the next bit. Google figured me out years ago, my search history barely matters anymore.
Now, you are on to something. It's just doesn't have anything to do with the marginal utility. It's more like a natural monopoly.
Posted by Googster , December 23, 2007 02:03 AM
Extending Googster's analysis, the relationship between data and prediction is pretty thoroughly described in mathematics, though in cases where all resources are strictly bounded in some fashion (i.e. reality) the relationship is a lot more complex and has interesting ramifications.
Generally speaking, prediction is computationally intractable in its purest form so we use all manner of approximations to make predictions. While it is true that early bits are worth more than later bits, slight differences in algorithm design and quality can generate significant differences in average predictive quality down the road. (There are a number of interesting theoretical caveats to this, but mentioning them would lead too far afield.) The datasets of most types are often large enough that the differentiator is the algorithm, and the optimal algorithm for the general case is known to be non-computable. It is worth pointing out that a lot of the mathematics behind this is relatively recent, and it is not an area many people are deeply familiar with.
So what does this mean? It means that at any moment, some company could materialize with a substantially better representation algorithm and eat Google's lunch, at least as far as prediction goes. While Google's algorithms are theoretically far from optimal in many regards, few companies have been able to exploit this by making significant improvements in the necessary areas of computer science. The article is correct in that you can't beat Google nibbling around the edges, you have to use computer science they don't have. The good news is that improvements in the areas of computer science required are eminently doable in the sense that we know the current state-of-the-art is substantially improvable for practical systems.
And lest it sound hypothetical, I know of one venture doing a Series A now with the requisite computer science IP to do the job, and possibly another early one that is apparently attacking another one of the computer science avenues to displacing Google at least in theory. Of course, if the data all gets locked up inside Google or whoever, *then* those companies start to gain significant monopoly leverage. On the other hand, Google et al have barely scratched the surface of what can be done -- at least in theory -- with the vast public body of data so that is less of a roadblock than it may sound at first.
Google is in a strong position but it not quite as invulnerable as it is sometimes portrayed, at least not with regard to prediction and data mining, and they definitely do not have a lock on the people who are generating most of the computer science that will inevitably make Google's current prediction and mining methods obsolete. Still plenty of churn left in that market. At least in theory.
Posted by Andrew , December 24, 2007 03:35 AM
The platform shift you mention is already underway, it's the Mobile Internet, where none of the big players have a major slice of the pie secured yet.
Google's mobile strategy has just started, you can see it in the new Google Maps, and the potential game changer Android can be.
Posted by vruz , December 24, 2007 04:39 AM
Given that "prediction" can mean different things in different professional contexts forgive me if I don't see how that relates to Google's public products with embedded data collection to date.
As long as Google can successfully resist agreeing to standards by offering their own unique open platforms to enthusiastic participants while continually creating data gathering systems fueled by enthusiastic users, they've got a great shot at global dominance.
As we continue our flow from pc to web to mobile devices to physical/natural environment to human body, Google can just keep moving with us.
Posted by Clyde Smith , December 24, 2007 08:52 AM
Malcolm Gladwell wrote a great New Yorker piece in this topic:
http://gladwell.com/2007/2007_01_08_a_secrets.html
It is subtitiled "Enron, intelligence, and the perils of too much information"
Posted by Nick Davis , December 24, 2007 09:53 AM
In financial markets, there's a spike in returns for knowing the next piece of information that's going to impact the stock that's not yet factored in the price.
(Insiders make better decisions on average, but plenty of insiders have gone down with a sinking ship, and a fast way to go broke is with inside information. There's no value if it's already in the price, and there's no value if it's not going to get priced in before something more important happens)
Similarly, that one piece of information no one else has that puts someone in a specific psychographic, or shopping for a particular product in a particular location, is worth a (big) spike in returns.
If Google can keep their monopoly on that information, they will keep generating superior returns. That's different from saying there are continuously increasing returns.
(Also, while there might be increasing returns to knowing your location, income, and amount spent on CDs and each can be stored as a number, the complexity of acquiring each succeeding number rises exponentially, and in that sense they contain more information in the same number of bits)
Posted by Zippy McSpliff , December 24, 2007 01:11 PM
I 100% agree that data has increasing marginal utility... and I think much of that utility has yet to be tapped.
There's significant inherent value in the metadata buried within the data itself that we are just beginning to see exploited. There are two kinds of meta-data: explicit, which we're starting to do things with, and implicit, which almost no-one has done much with as of yet.
I think much of what will be interesting in the next 2-4 years will be driven by the exploitation of implicit data. Interesting times are ahead!
Posted by fewquid , December 24, 2007 02:47 PM
I believe what we're looking at is that "Data is a product" or "Data's value as an intangible asset is non-linear." The behavior we're seeing around data stores in the market appears to be that of sigmoid curves, such as those associated with product lifecycles. Please see the diagram here: http://sphericalmusings.blogspot.com/2006/11/sigmoid-curve.html
Google is doing a good job in hopping from one data product lifecycle to the next, in a way that (among large enterprises) only 3M, Intel, and Microsoft have done in the past. It requires avoiding the innovator's dilemma by practicing self-cannibalization. Self-cannibalization requires consistently sub-optimizing short term cash flow which has been difficult for Yahoo, et al.
Posted by Scott Rafer , December 24, 2007 11:20 PM
Google seems to be making this bargain: I'll give you free software/services (Search, Docs, 411, etc.) if you give me free data. Why else would they be building massive data centers?
The sniff test seems to say that Google wants to lead the data-driven future, so to them, on a macro level, more data is more valuable.
Posted by Don Jones , December 24, 2007 11:27 PM
Great post! Often information is a key asset in an organization. Back in the old days, often the retailors often ended up with the customer knowledge. The supplier often had to rely on the retailor for this information.
Posted by Consulting , December 27, 2007 07:46 AM
When Google starts incorporating the 23andMe data (http://www.wired.com/medtech/genetics/magazine/15-12/ff_genomics) - then we'll have something to worry about.
Only half joking as 23andMe founder is married to Sergey Brin
Posted by Chris Ceppi , December 27, 2007 08:28 PM
The applicable concept from economics is that of a supermodular production function. Let x and y be two different types of data and let f(x, y) stand for the value that Google can derive from the data.
As a previous commenter pointed out, because of the characteristics of Bayesian inference (which lies at the heart of what Google does) it will be the case that the marginal benefit from an increase in x is declining, i.e. f(x' + e, y) - f(x', y) x. Suppose you want to serve targeted ads and you have access to a stream of searches by an individual, then the first observation of the term "bmw" has more benefit in identifying that individual as a potential car buyer (and serving a car ad) than the second observation of a similar term, say "mercedes".
Now suppose that you also have access to a second type of data on the individual, say geography, then a supermodular production function says that having more of that second type of data makes every additional bit of the first data more valuable, i.e. f(x + e, y') - f(x, y') > f(x + e, y) - f(x, y) for y' > y. In the example, the marginal benefit from seeing the second search term ("mercedes") increases if you happen to also know that the individual is located in Greenwich, CT, because now you can serve up a localized car ad (for which the dealer is sure to be willing to pay more).
The fact that Google's production function is almost certainly supermodular poses a big challenge for potential competitors. It means that even if you succeed in gathering more data of one type, the incremental value of that data for you may be a lot less than for Google which can combine it with lots of other types of data.
Posted by Albert , January 5, 2008 03:45 PM
I analized this for my economics class today.
As marketing data increases, total utility increases, not MARGINAL utility. The more data collected the more total utility. However as more and more marketing data is collected, each category begins to add less and less value to ones marketing research, and marginal utility decreases.
This follows the first basic assumption of consumer behavior, More is better! The more data collected the greater the utility
The fact that marginal utility decreases follows the law of diminishing marginal utility. The more categories of marketing data collected, the less value the information adds.
Posted by Greg K , February 20, 2008 08:55 PM
That is what we are doing with our start-up. Utilizing contextual data, we are able to collect valuable data on the web, and able to index these data with our contextual algorithm, thus providing users a powerful front end base to find there contents on the web.
Posted by Halal K , March 23, 2008 11:23 PM
I don't think the marginal utility increases. I just think the "fat head" of the demand curve is going to get fatter and fatter.
Posted by Michael F. Martin , June 4, 2008 01:01 PM
kcgm opzxgslfc xwjldz hdrgze symqwx owmqe shojniuc
Posted by fanyjp pjwtgu , July 18, 2008 05:37 AM
xrlvj msjflxe
Posted by free7505 , July 18, 2008 06:03 AM



Hi Brad,
Great post. This really resounds with what we are doing for our own start up.
I believe that Google's data gathering abilities can be matched by interpreting a user's imprints while they use services across the web.
For example, how can the related tags in Delicious or Twitter conversations be co-related to find something meaningful. This would be more than a mashup of getting two services together but actually corelating the data to do something else completely differently.
Another example, why can't people crawl, index, ingest and analyse the web's podcasts to get voice accents. The data is already out there on the open web.
I think the data across all these web services is richer and deeper than just what Google has. So, if we take a site by site approach and then co-relate maybe Google's data advantage can be nullified.
FriendFeed has taken a site by site approach to interpret data from these services and deliver useful information to users.
Why can't other services also utilize the data to make thier own service richer?
Nik
Posted by Nik , December 22, 2007 09:01 AM