Leo Polovets of Susa Ventures joins Nick to cover The Value of Data, Part One. We will address questions including:
- As a startup investor, why did you decide to write your series of articles on the “Value of Data?”
- You released this series in three parts. At a very high-level, can you give an overview of the three sections?
- The first section has to do w/ the increasing value of data and how it’s now being used as a competitive advantage. Can you start off by walking us through the historical evolution of data and how hardware and software has become commoditized?
- From your perspective, as this commoditization progresses, the sustainable competitive advantage becomes DATA. You’ve talked about the creation of Data Moats and how data sets become a closely guarded advantage. From your list o 10+ ways in which companies apply data to create more defensible products, can you pick three or four of the more common approaches and cite example companies that are employing these approaches?
- Moving on to the next portion of the series, the focus is on accumulation of data and potential pitfalls… Can we start w/ the four ideal attributes of data sets that should be front-of-mind during the collection process?
- You go on to outline the collection process and the five major sources of data. What are each of these sources and how do they go about acquisition of data?
- What tips would you give to any startup that has a big-data or data as a competitive advantage focus?
- Finally, wrapping up the section on data accumulation, you cite the pitfalls and caveats. What are the major questions founders should be asking theirselves in order to avoid building data sets that have much less tangible value than intended?
- Part 2 of the Interview
- Leo on Twitter
- Leo’s Article- The Value of Data, Part 1: Using Data as a Competitive Advantage
- Leo’s Article- The Value of Data, Part 2: Building Valuable Datasets
- Susa Ventures
Nick: Today I’m meeting with # Leo Polovets in San Fransisco. Leo is back to talk about his excellent series on “The Value of Data”. Leo, welcome back.
Leo: Thanks, Nick. Happy to be here.
Nick: First things first. As a startup investor, why did you decide to write your series of articles on “The Value of Data”?
Leo: I think there are two main reasons. So the main one is when founders think about competitive advantages, a lot of times they’re looking at things like execution or they’re experiencing a new industry. Those are good advantages, but they’re not as durable as some other advantages. And specifically, I think data has been like a very good resilient competitive advantage for a lot of companies. And people tend to under appreciate that sometimes. So I wanted to really write a post kind of discussing data, why it’s valuable, how it can help you build your business and you know fight off competition in the long run. That’s the primary reason. The secondary reason was actually like my partners and I believe it so strongly, it’s kind of our, the core focus of our fund , that we like companies with, you know, kind of an interesting data play behind them. That was kind of a more self serving reason writing posts.
Nick: Is it a central necessity for your investments, or is it more of a nice to have that you look for in certain types of companies?
Leo: I think it’s, it’s somewhere in between. So it’s not mandatory. We’ve done some investments that don’t quite have a data angle, but those are the minority, they might be like 20 or 25% of our portfolio. For the rest we would look for that data piece to be there.
Nick: Leo, you released this series in three parts. At a very high-level can you give an overview of the three sections?
Leo: Sure. So the first part was basically kind of the historical reasons for why I think data is becoming more and more important. And then also the ways in which data can be sold to a company. And the second part was kind of how to tell whether the data you’re collecting can be valuable, and also some tips for how to collect that data and make more use out of it, to get more value out of it. And then the last part was about business models that basically use data at their core.
Nick: So this first part has to do with the increasing value of data, and how it’s now being used as a competitive advantage. Can you start off by walking us through the historical evolution of data and how hardware and software has become commoditized?
Leo: Sure. So, I think a lot of this actually is, are things I observed during my period as a software engineer. So I was basically a software developer from 2002-2003 till about 2012. And at my first job, I worked at # Linked In, I started back in 2003. And when I joined , it was a very different time form now. So a seed round was basically a Series A. You had to go raise like $5M (five million dollars). And then you had to go hire like 10 engineers to build up all the software from scratch. And you had to buy servers, put them in a, in a colocation centre somewhere. And in some ways a lot of the competitive advantage was that you can raise $5 million to build these things and someone else has to do that before they can even, you know, go against you. And I think if you fast forward maybe 5 or 6 years, I was in another startup in LA called # Factual. And the time was already a little bit different. There was more and more open source software. You still had to build a lot of things yourself. But now a lot of core infrastructure was already pre-built and available, pre-packaged. You had things like Hadoop, you had a lot of open source databases, their solar for search engines. And also I think that was around the time when Amazon came out with their cloud. So now instead of buying your own servers, assembling them, maintaining them, you can just kind of have a turn key solution. And so that made the hardware and software side less of a competitive advantage. And I think if you fast forward to now, that’s only become more dramatic. So now, not only do you have kind of this core infrastructure like databases that you can use, you know, just off the shelf, there’s also a SaaS service for everything. Like from payroll to machine learning recommendations to whatever you want. And so I think over time, basically being great with value building servers or building software, you know, those are still valuable things. But they’re, they’re not as defensible as they once were in my opinion. And I think data over time as companies reach bigger and bigger scales, like now you have you know Twitter and Facebook with like half a billion users, a billion users, they’re collecting so much data. And that data has a lot of value in it in terms of understanding users, building features, building recommendations. And so now I think the real competitive advantages are these companies that collect this data and know what to do with it. So even now if you build, you know, a better social network than Facebook, first it’s hard to get your friends to switch. But even if they switch, Facebook now has done such a good job of like knowing how to suggest posts to you, knowing how to get you engaged. And basically anyone that’s just starting out is almost dead in the water because they don’t have all that circle data to work from.
Nick: It’s crazy to me how much information a lot of these social platforms have about individuals, not just self disclosed information on profiles like here’s my age and gender, but every tweet or every post, and then every connection that you follow or are followed by, I’m sure informs this sort of bigger data engine that understands people sometimes even better than maybe they understand themselves.
Leo: Absolutely. And I think also that sometimes it’s even more subtle because it’s, you know, sometimes you explicitly do things, like you like some bands so Facebook knows you like that band. But sometimes it’s, it’s actually very subtle like they show you five news stories and you click on the third one but not the other four. So now they know, you know, whatever the third one’s about is kind of interesting to you and the other four are not. And if you do that a thousand times, they have a really good sense of what kind of things you like and don’t like.
Nick: It’s like # Pandora a little bit, huh ?
Leo: Yeah exactly.
Nick: So, from your perspective, as this commoditization progresses, the sustainable competitive advantage becomes data. You’ve talked about the creation of Data Moats and how data sets becomes a closely guarded advantage. From your list of 10+ ways in which companies apply data to create more defensible products, can you pick three or four of the more common approaches and cite example companies that are employing these approaches?
Leo: Sure. So, I think one area is better recommendations. And this is, you mentioned Pandora, so I think Pandora does this at their core. # Netflix also does a great job of this, where they, they have a lot of content but I think a lot of their value is actually understanding what you like and showing you more content that you like and keeping you engaged. All the streaming services like # Spotify and Pandora have done really well because of, you know, you can basically press play and just listen for hours and they know what you like. And I think if they made you create a playlist, made you buy specific songs, then help you discover new songs, they’d be a lot less to gain engagement. So recommendations is one area. I think another area is improved efficiency. And this can manifest itself in different ways. It might be knowing how much inventory to keep, you know, if you’re Amazon. If you’re a company like # Uber, it might be more on logistics and understanding how many cars do you need at this time of the day; if you have these requests, how should you have the cars. And again this is an interesting advantage because you could almost imagine that if some company wanted to compete against Uber, even if they hired the exact same number of cars and drivers, they’d still struggle to know like what’s the most efficient way to dispatch those cars and drivers. And you could imagine maybe a few hours into the day like all the drivers are up on one side of the city and the other side’s under served or something like that. And I guess a third area would just be general predictions and modeling. A lot of companies related to finance use this. So for pricing insurance, trying to predict fraud, trying to do credit scoring. So in the credit scoring side, we have a portfolio company called # LendUp, that basically makes a better version of payday loans, where they’re using data to make better credit decisions. They’re getting people better rates because they have a lot of circle data to learn from and to really understand what you can and cannot pay and borrow. On the fraud side, there’s a company called # Sift Science that’s doing fraud detection as a service. They collect data , transactional data from a lot of merchants and then because they get this sea across lot of merchants, they are much more effective fighting fraud than if you just look at that data for a single merchant. And for insurance, for example, there’s # Climate Corp which collects sensor data, basically weather data at like an acre level instead of a zipcode level. And because they know, you know, exactly what the weather is like, how that affects crops, they can give farmers much better priced insurance. Because they can estimate the climate risk much more accurately.
Nick: So do you find a lot of companies are sharing data? Are there privacy considerations? Or is this data that belongs to them as long as they keep the anonymity of the consumers then they’re free to wield it how they choose?
Leo: I think it’s a mix of consumer products. Privacy is important and so usually it’s harder to share personal data. But looking at data in the aggregate of, you know, let’s say like 14 year olds like this band but not this band, that’s generally seen as okay. And I think also, you know, personalizing something to you is okay but using your personal data for someone else is less okay. And the B2B site, it varies a little bit more. It tends to sort of be what can you get into in terms of service that your customers would accept, given what they’re paying you. And a lot of that also comes down to usually aggregate data is easier to get access to then specific data. Specially if you’re, if you’re catering to several companies in a vertical typically those companies don’t want you using, somebody doesn’t want you using their data to help a competitor in a positive way. So it tends to be a little bit trickier.
Nick: So, moving on to part two in the series, the focus is on accumulation of data and potential pitfalls. For this part can we start with the four ideal attributes of data sets that should be front-of-mind during the collection process?
Leo: Sure. So, so there are four attributes that I think make a data set very valuable. So the first one is that it’s, it’s hard to build. And I think this one’s pretty self explanatory. Like if it’s a data set that’s really easy to get, then the fact that you have it is not a big deal because someone else could get it pretty easily. The second attribute is that the data is actually clean and accurate and up to date. And each of these is important. For example, for up to date knowing the businesses in your town is great for like let’s say a local location based op. But if you’re business listings are two years old, maybe a lot of those businesses have closed, new ones have opened. Or maybe you have an address of a place but they moved, so your data is not accurate. You really want to make sure your data is basically like high quality, up to date as much as possible. The third attribute is that the data is useful. And again I think this is pretty self explanatory. But, you know, some data sets are just more useful than others. So for Amazon, for example, they have a lot of user purchase data. And that’s great for building recommendations, understanding maybe what kind of products they should introduce. They also have some data sets that are probably not as useful, like, let’s say like , you know, shoe sizes for people. Like that’s great but there’s not a lot of products you could build on top of that. I don’t think you could build a startup that sells shoe size data to other startups. Versus you could obviously sell purchase history to other startups and everybody would love to pay for that. And finally, I think, kind of the, the fourth attribute, kind of a bonus attribute, is that the really valuable data sets get even more valuable as they get bigger. So you can imagine, Yelp reviews for example, you know, having one review for each place in the US is great, having ten is better, having a thousand like just it makes the data set much more valuable. And so if you have a data set where people or maybe machines or eight guys are contributing to it over time, and that makes it more valuable. I think that’s, that furthers the ments kind of the, you know, what you can do with your data and how much you can monetize it.
Nick: During like a very early stage startup evaluation, are you looking for a founder to be able to articulate how they’re going to collect, the maintenance strategy, the critical mass of data that’s required to get these valuable insights or is it often too early with the types of companies that you are evaluating?
Leo: I think it’s often too early in terms of implementation. But I do think that if somebody can think through a lot of these things before they actually happen, so they might not actually be collecting the data, but they should have some good ideas for collecting the data. And also once they have it, they should have some good ideas of what they can do with it. And I think sometimes people will hand weigh it one or the other, and so they’ll sort of say like we’ll collect this data and then, you know, I’m sure it will be useful, we’ll figure it out. Which is a lot like saying, you know, we’ll figure out monetization later. Like sometimes that works but a lot of times it’s better to think about it earlier. And on the flip side, sometimes people think about oh this data is really valuable, and here’s what I would do if I had it. But they don’t actually think about how would they collect it, how would they make sure customers don’t mind them using it for recommendation systems, things like that.
Nick: You go on to outline the collection process and the five major sources of data. What are each of the sources and how do they go about acquisition of data?
Leo: So there are five main categories. The first one is direct collection. So this might be just asking users in a form, like what do you think of this, who is your favorite band, or maybe it’s having them pick a like. But it could be surveys, it could be collecting data from hardware sensors. Basically just collecting data directly. Another approach is to use cloud sourcing, which is what companies like # Yelp do. Also # Glassdoor does offer salaries and interview questions. A lot of that model, sometimes it’s, it’s very explicit, like you have to share your data to see what other people’s data is, that’s what Glassdoor does. Sometimes it’s more like Yelp where it’s just, there’s a lot of people reading the reviews and there is a core side of people writing reviews, and they get some kind of incentives or personal satisfaction for doing it. Basically the database gets filled by the users rather than the company, so to speak. Another approach closely tied to this is paid cloud sourcing. So sometimes it’s hard to get people to contribute data for free, so you pay them. This could be with oDesk jobs, it could be using mechanical turns, it could be an outsourced team that you have. Basically paying people to just fill out valuable data sets for you. Another approach, and this one’s probably my favorite, is to use something called data exhaust. And what that is basically is data collected during the normal usage of a product. So like i said with the example of, you know, maybe Facebook sees which of five news links you click on. Other than just marketing that news link on an extra click, they can also note what are the keywords that attracted you or that weren’t interesting to know their articles. And a lot of startups can do that. For example, maybe you’re building an accounting tool. It’s more of an administrative tool and just like a streamlining, a process streamlining product. But as you collect data, you could see what kind of things people buy together. Maybe you can get partnerships in industry that like people that buy X who want to buy Y, so you partner with somebody that sells Y. And, you know, so there are a lot of interesting plays that you can do with the data you might be collecting, kind of marketing plus users as they use your site or your product or your mobile app. And finally I think one other approach to collecting data is actually to look at existing data sets and to combine them together. A lot of times the more data sets you can combine, the more valuable it is. So for example, maybe one data set has a list of the music people like. Another data set has their incomes. And those two things separately are interesting, but they might be even more interesting together. Because maybe if you know somebody is wealthy, you recommend concerts; but if they’re not as wealthy, you recommend a streaming service. And so one approach to building a valuable data set is actually to take five or ten or fifty other existing data sets and then through APIs and algorithms and data processing, combine them into one big data set that’s more useful.
Nick: Do you think at all about buyers leads that happened for instance like with Yelp or some sort of data collection method where the users are selecting reviews. I’ve heard a common critique, people that are really angry with their service at a restaurant or people that are really excited about their experience at a theme park. They’re the ones that go on and write the reviews and then you miss sort of the middle of that normal distribution. And then the other part to my question has to do with people’s awareness, that their data is being collected and how that may influence their decisions when they’re posting information?
Leo: Yeah, I mean I think those are definitely strong legitimate concerns, specially on the review side. The last few times I looked for an apartment, I looked on the site called Apartment Ratings. And I think people that review apartments usually do it because they have a terrible experience and they sort of want to, you know, maybe they want to like get back at the landlord in some way. So those reviews tend to be skewed, like very very negative. And almost any place you look at, whether it’s good or bad, it sounds like it’s something from a horror movie. So I think trying to get the data to be as unbiased as possible is a challenge. And sometimes there are creative solutions to that. So for example, Yelp basically asks people to leave reviews whenever they want, open table. It takes a different approach, where after you go to a restaurant they email you and ask you to review it. And I think that’s a little bit less biased potentially because they ask you whether you enjoyed the experience or not. You know, you’ll get asked to leave a review. So I think those reviews might be a little bit more unbiased. To your other point, it’s true people behave differently when they think they’re being watched. That’s one of the values of data exhaust, and kind of that model of building interesting data sets from the things people are doing rather than asking them explicitly to tell you what they’re thinking. So I do think that’s an issue. Also some sites are more anonymous than others, and so anonymity can help a little bit. If the bands you likes are going in your public profile, you might curate them differently than if all they do is inform what playlist suggestions you get. Or if you don’t have a profile so nobody can see what you like.
Nick: Any tips that you would give to early stage startups that have a potential big-data or data as a competitive advantage focus?
Leo: Yes, so I think a couple of tips are to collect as much data as possible as early as possible. And this is one of those things where you learn it the hard, way but you basically can’t go back in time and collect data a year from now if you didn’t do it now. So I think that that’s the most valuable tip. You don’t have to analyze it, you don’t have to build a recommendation in general or machine learning pipeline, just collect the data. Yo can hire somebody to analyze it and build things on top of it in the future. On a similar vein, I think collecting raw data is very important. So sometimes you have a lot of data, so you might take shortcuts, maybe instead of storing each person’s rating, you just store the average rating for a book. And as with not collecting data, basically once you no longer store the raw data and you just store something derived, like let’s say an average, you can never go back and undo that and get the raw data later on. Imagine for Amazon right now like one of the things they do is they show you how many 1, 2, 3, 4, and 5 star ratings there are. If in their early days they just kept track of averages, they wouldn’t have that data. And once they realized it was valuable , it would actually be too late and there’d be a set to it, and they would have to start collecting it from scratch. So I think whenever possible, store things in the rawest form. Another good tip is to along with collecting data yourself, try to combine it with other data sets. So there have actually been some really interesting studies in the machine learning space which basically show that more data from different sources often beats better algorithms. So you can have a good data set and you can hire, you know, ten of the best machine learning researchers in the world, and they’ll give you a pretty good recommendation model. But you actually might get a better model from getting one decent researcher and giving them twenty different data sets they can work with and combine. And I think there’s kind of an analogy here. You know, a really smart person with a small bookshelf might be less effective than an average person with google or like a huge library. So it makes sense getting as many data sources as you can. Whether they are data you collect or maybe it’s data you bought from somebody or scraped off a website. I think combining your data with other data sets is really valuable.
Nick: I think you’ve talked about that in the past with some of your previous jobs. You assumed when you first took the job that the algorithm was going be the key and then you later found that the power was in the data?
Leo: Yeah. So I worked on, I worked on # Cayman fraud detection, I googled, they , they had a product kind of like Paypal. And so there were a few people on the team basically working algorithms and then a bunch of people on the team trying to just get more data into the system. And I came from an algorithms background, so I really thought that would be the key to doing a great job with fraud prediction. And it turned out that the data side won unanimously. Getting more data points of like how long has this user been a gmail user or, you know, when was the last time they changed their address in google shopping or something. The more pieces of data we could get into the system, the better the predictions were. And the actual model improvements were much more minute than improvements that came from more data.
Nick: I wish Pandora would stop giving me advertisements in Spanish. Like I don’t speak Spanish. And so I don’t know what data they have on me but, but something’s off there. Leo, you’ve talked about you’re with all these different sources, it’s making me think about # Groupon. In Groupon’s ecosystem that they’ve built in Chicago with Belle and some of these other startups,where they’ve got so much data access to all these SMVs that are traditionally core at using data to make large insights about things. And now they’ve got so much data on all these different SMVs, they’re making moves, strategic moves to become embedded, even more so at the SMV onsite on location for various on demand services and management of utilization in restaurants and in things of that nature.
Leo: Yeah, I think that’s interesting. I don’t know if they, Groupon had that plan from the beginning, or they sort of stumbled into it. But I think in general , almost like a thesis of mine would be that if you have a lot of data, it can open a lot of doors. And if you don’t take advantage of data, a lot of doors will be closed to you. So I think in this case, collecting all of this data just from helping SMVs do marketing and use, you know, flash promotions, I think that’s put Groupon in a great position to do a lot of other things.
Nick: So, Leo, finally wrapping up part two on data accumulation, you cite the pitfalls and caveats. What are the major questions founders should be asking theirselves in order to avoid building data sets that have much less tangible value than intended?
Leo: I think that, you know, the highest order of question is to really think about whether data is actually useful to your problem. So the earlier example, when I mentioned shoe sizes, maybe it’s not that important and so it’s not worth struggling and, you know, building some kind of an engineering infrastructure in order to do more with that data because maybe that data is just not that useful. And on the other hand, maybe if it’s user purchase histories that is useful, so that is worth thinking about. But I think it’s important to consider like even if you could get and build this data set and it’s perfectly clean and up to date and wonderful, would it actually be useful to you as a business. I think that the second thing to think about is whether your approach to getting this data is the best approach or if there are other approaches that are easier or more accurate or something else. As an example I think there are a lot of companies that really want to get their hands on kind of credit card purchase data in order to understand maybe how well certain publicly trader companies are doing, maybe to help suggest promotions to consumers or other use cases like that. And I’ve seen some companies try to, for example partner with credit card processors, which on the one hand those guys have great data, on the other hand they’re large, they’re protective of their data, they are hard to work with. So partnering with them is a very challenging process. And so on the flip side, there are easier ways to get that data, like mint.com for example shows that sometimes people just give it to you. Or alternatively maybe if you can get access to somebody’s email inbox in exchange for sending them promotions, you can just scan their inbox for seeds and see where they’ve spent, you know, their money. And so the point is like for a lot of data sets, there are many ways to collect them. And so if you’re going to go all in on one, really think about whether that’s the right one. And finally, I think, you know, it’s useful to think also about whether the data you’re collecting is clean and accurate and up to date and all those other good attributes. Sometimes, specially if you’re doing cloud sourcing, the hardest part is not getting the data but making sure it’s actually clean and good. And Factual, the last company I worked at, was in a location space. And some of the competitors used cloud source location data, where people would say like oh there’s a Starbucks at this latitude longitude coordinate. And what was interesting was that database, on the one hand it was very comprehensive, and specially for the popular places. Because basically any place that was popular, you know, people would put it in a system. On the flip side, because it was cloud source, you’d get a lot of noise and crowd in there, and people would have things like I checked in at your mom’s house last night, you know, or like, I checked into my third grade English class. And having that stuff in your database is not that useful and in fact probably decreases the value more than it increases it. But I think those are some of the top things to watch out for.
Nick: Unless you’re in that greeting card business right?