Big Data Bootstrapping Beware

I suppose this a dumb observation, but the one thing I learned in building ZenThousand is that bootstrapping a big data startup can be expensive. Obviously, it’s due to all of that data you have to deal with before having a single user.

First, there’s the problem of collecting the data. In the case of ZenThousand, I am looking for social network profiles of programmers. Although sites like Github, LinkedIn, and others have collected a treasure trove of personal details on engineers, it’s not like they are just going to let you walk in and take it!

In LinkedIn’s case, over half of their revenue is from their recruiting features. Essentially, the information they have on you is worth nearly a billion dollars. Which is why use of their people search API has strict licensing restrictions. Github and other social sites don’t let you do simple searches for users either–you have to use what tools they give you in their API to sniff info out.

Collecting your own data can be very expensive. Data intensive services such as mapping require massive effort. This is why companies sitting on large datasets are so valuable. People scoff at Foursquare’s valuation, but while the app might not have great user numbers the location database they’ve built is of immense value.

Secondly, there’s the cost to store and process all of this data. With most startups, the amount of data you store is directly proportional to the amount of users you have. Scaling issues become a so-called “good problem to have” as it usually means your app has a lot of traction. If these are paying users, even better–Your data costs are totally covered.

With a big data startup, you have massive amounts of data to store and process with no users. This gets costly really fast. In my case, Google App Engine service fees quickly became prohibitively expensive. My future strategy involves moving off of GAE and on to either Google Compute Engine or a physical box. I know of at least one big data startup that migrated out of the cloud to a colocation facility for both cost and performance reasons.

This doesn’t mean big data isn’t possible without a large investment. It’s just that two of the first big problems you need to solve are how to cost-effectively collect and analyze lots of data before you have any revenue.

Social Recruiting: Google App Engine Experiments 3

My latest experiment is a “social hiring” tool built on top of Google App Engine: zenthousand.

We don’t have a tech talent crisis in America, we have a hiring crisis. Between clueless recruiters and the oblivious companies that rely on them, it’s no wonder nobody can hire effectively.

The thinking exhibited by a lot of employers is if you actually respond to a job post or are actively looking for a position, you must be a total loser. The race is on to hire people who don’t want a job. So, I decided to build a site that pulled in data from Github, Meetup, Stackoverflow, Behance, and other social networks developers gather on to allow people to search for so-called passive candidates. There are countless startups doing the exact same thing under the banner of social recruiting.

This didn’t deter me because last time I tried to build a startup it was something nobody else had done before. It turns out, doing something that’s been done half a dozen times is preferable–at least to investors. Plus, it really didn’t seem that hard to make. How is it that so many companies have raised rounds to do something seemingly so trivial? I wanted to find out.

It took about 3 weeks, but the end result is zenthousand. It’s rough, buggy, and features a useless and sparsely populated database. Still, it gets the point across. Sign up for an account and start searching for candidates based on location and skill set. Bookmark the ones you like so you can browse and annotate them later. Oh, and the paid options use Stripe for billing, but just use the free demo account to test it out. I don’t expect anyone to actually pay for this in its current state.

Building a service like this is like being a detective. Think about where developers hang out and what footprints they may leave. Then write an algorithm using publicly available APIs to extract this information and organize it in a searchable way.

The location search isn’t very accurate–try using no location for more results. I’m using Google’s beta search API which seems to have some bugs in it. Also, not a lot of Github profiles have location attached to them, so I have to spend a bit more time trying to sniff out where candidates are from. I have yet to run the Meetup and Stackoverflow crawlers either, so most profiles only have a Github profile attached.

Most social recruiting startups claim to use big data analysis to detect how skilled a candidate is in specific technology. This is absolute nonsense. No algorithm can overcome Dunning-Kruger. There is no substitute for an interview facilitated by someone with expertise in the area you are hiring for. That’s not to say big data algorithms aren’t useful for filtering candidates by what’s relevant to the user.

Zenthousand was a 3 week crash course in Google App Engine with Python, social network APIs, and finessing an interface with Twitter Bootstrap. As much as I’m infatuated with GAE as of late, I think a big data service like this is too expensive to run on it. If I take this project further, I’ll most likely re-write the GAE-specific parts, migrate to MongoDB, and move the whole thing to Google Compute Engine or a colo.

App Engine Geospatial Datastore Search: A New Way!

One of my pet peeves with Google App Engine is its horrible support for geospatial indexes. Although you can store a GeoPt in the Datastore, you can’t really query it. You can use various hacks such as Geomodel, but they end up being slow and potentially expensive.

Last year Google released the beta API for Google App Engine search. This lets you search documents for text, HTML, numbers, dates, and location. However, it searches documents instead of datastore entities.

If documents are separate from your datastore, how do you use the new search API to do geospatial queries on your database? Simply store the location for each entity inside a document instead. To do this, make a document with a GeoField with the location and a string that contains the id of the associated datastore entity’s key (this code is based on Google’s own location example):

key = str(entity.key())
geopoint = search.GeoPoint(lat, lng)
index = search.Index(_INDEX_NAME)
document = search.Document( fields=[search.TextField(name='key', value=str(key)),
                                                    search.GeoField(name='loc', value=geopoint)])
search.Index(name=_INDEX_NAME).put(document)

Note that you have to store the key’s id as a string since you can’t store a long in a document.

Now, when you perform a geolocation search like this:

index = search.Index(_INDEX_NAME)
query = "distance(loc, geopoint(" + str(lat) + "," + str(lng) + ")) < 1000"
     
try:
     results = index.search(query)
     for doc in results:
         logging.info('Document! ' + str(doc.field("key")))
except search.Error:
     logging.exception('Error!')

You can grab the id field from the document and query the datastore for it to get the rest of your data.

There are several problems with this method. First, it doesn’t work on the local dev_app_server. Currently, GeoField searches only work on the appspot production servers. Also, because the API is in beta you are restricted to the free quotas which don’t allow for very many operations. Finally, when Google reveals pricing changes, it can have disastrous results. It’s very risky to build an app using this method when you have no idea how much it costs.

At least it works! It’s still a mystery to me why they can’t add this feature to the datastore itself.

Donut Vision: Google App Engine Experiments 2

Some (well, very few) of you may remember my previous post on Google App Engine. Developing a GAE app using JSP was a trip down memory lane, using a technology that has seemingly been left unchanged since 2001.

I recently began a project that involves using Vine and Twitter to sort through video clips. I decided to build on Google App Engine again. This time I’m using Python. My initial hacking has resulted in Donut Vision–a search portal for donut videos on Vine. Hey, don’t laugh. These guys are trying to build an actual business off of the same type of sites–Presumably with cokehead money.

Using Python (GAE’s original language) has been an absolute pleasure. On GAE, it really does seem much faster than using Java. GAE’s built in webapp2 framework and Django templates make building sites and APIs a breeze. I swear not having to type brackets has given me some kind of minor productivity boost–Or not. But placebo is a real thing.

My general “get off my lawn” nitpicks with Python are mostly due to it being a weird hybrid of a dynamic language, yet strongly typed. This gives PyDev in Eclipse a problem performing autocomplete since it really doesn’t know what type you’re referring to in most cases. PyDev and Eclipse is a decent combination due to the convenience of deploying to GAE within the IDE. I’d switch to something else with better autocomplete support, though.

As for the details of how this works, it’s really pretty simple. There’s no Vine API yet, so I simply use the Twitter API to search for Vines with relevant hashtags and pull the URLs out of them. Originally I was using Vine’s new embed code to display videos, but I eventually resorted to grabbing the URL of the MP4 file in the S3 bucket it’s stored in to have more control over the video when playing it with video-js. I expect Vine to shut down this method since I’m just running up their AWS bill with no benefit to them–not even a link back to the Vine app. Hey, if Vine provides a proper API, I’d use it.

Oh also, in my earlier post I stated that Google App Engine is not available in China. This is only partially true. The default appspot domain is indeed blocked in China. Yet, when putting my custom domain, donuts.pw, through GreatFirewallOfChina.org I get nothing but green status. Yes, I’m boldly sparking a democratic revolution one French Cruller at a time. So, if you want to serve Chinese customers via GAE, just map a custom domain to it.

I’m seriously considering using Google App Engine as a backend for a new game. The only problem is cost estimation. I have constant paranoia of real-world usage patterns running up my bill. Especially with improperly indexed datastores, you can rack up charges pretty fast. Still, simply writing an app and uploading it to Google’s cloud is significantly easier than fiddling with Amazon Web Services and Beanstalk. If you haven’t checked it out since the early days, GAE is worth another look.

Oh, also the latest version of GAE has sockets support. It’s still experimental, but this may lead to GAE being suitable for real-time applications such as multiplayer game servers.