Big Data Bootstrapping Beware

I suppose this a dumb observation, but the one thing I learned in building ZenThousand is that bootstrapping a big data startup can be expensive. Obviously, it’s due to all of that data you have to deal with before having a single user.

First, there’s the problem of collecting the data. In the case of ZenThousand, I am looking for social network profiles of programmers. Although sites like Github, LinkedIn, and others have collected a treasure trove of personal details on engineers, it’s not like they are just going to let you walk in and take it!

In LinkedIn’s case, over half of their revenue is from their recruiting features. Essentially, the information they have on you is worth nearly a billion dollars. Which is why use of their people search API has strict licensing restrictions. Github and other social sites don’t let you do simple searches for users either–you have to use what tools they give you in their API to sniff info out.

Collecting your own data can be very expensive. Data intensive services such as mapping require massive effort. This is why companies sitting on large datasets are so valuable. People scoff at Foursquare’s valuation, but while the app might not have great user numbers the location database they’ve built is of immense value.

Secondly, there’s the cost to store and process all of this data. With most startups, the amount of data you store is directly proportional to the amount of users you have. Scaling issues become a so-called “good problem to have” as it usually means your app has a lot of traction. If these are paying users, even better–Your data costs are totally covered.

With a big data startup, you have massive amounts of data to store and process with no users. This gets costly really fast. In my case, Google App Engine service fees quickly became prohibitively expensive. My future strategy involves moving off of GAE and on to either Google Compute Engine or a physical box. I know of at least one big data startup that migrated out of the cloud to a colocation facility for both cost and performance reasons.

This doesn’t mean big data isn’t possible without a large investment. It’s just that two of the first big problems you need to solve are how to cost-effectively collect and analyze lots of data before you have any revenue.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s