Tuesday, April 15, 2008

Google App Engine: How AppDrop Does and Does Not Affect Lock-in

The cloud computing blogging world is abuzz with the news that Chris Anderson has created an interesting experiment in which he has created GAE-compliant hosting on Amazon EC2. This is an experiment that I've actually been looking at myself, but having a day job that is intensely busy right now, haven't had a chance to get to.

What did Chris do? In short, he got a working copy of the GAE SDK working on Amazon virtual machines with some modification. There are limitations:
  • It will not scale (none of the Google "secret sauce").
  • It does not support email at this time (though the source code is available for anyone who wants to add it).
  • It could go down at any time, either due to an EC2 outage (not very likely), or because whoever is paying for this doesn't want to foot the bill anymore (much more likely).
The implications of this are big, however. A couple of days ago, in an email exchange with Simon Wardley, I did a little analysis of what it would take to do a scalable version of this in open source. I don't have the depth of knowledge necessary to identify all of the relevant projects, but here's what I came up with:
I think you are indeed correct, though I think a lot of the attention has been on what we discussed earlier: what is GAE *not* capable of doing (right now), and with regards to what it *is* capable of doing, how do you take advantage of Google's amazing infrastructure.

That being said, I think you are on to something potentially big: through the open source SDK, there is an opening for any other hosting company or "OS for the data center" company to provide a "GAE-to-go" solution. All of the SDK would have to be supported--and there is a lot that is specific to Google right now--and the solution would need to be at least comparably scalable, etc.

The good news is that, as your run down the SDK, there are many simple solutions possible, or even existing open source projects available:

The Python Runtime - This is just a python interpreter with a series of rules imposed, running on a autonomic scalable infrastructure. The interpreter can be harvested from the SDK (with some modification, I am sure), and the scalable infrastructure can be damn near anything with a policy-based management component.

The Datastore API - Google's BigTable data store is based on Map/Reduce. I am not exactly sure how well this maps to Hadoop, but that is where I would start. Regardless, I would expect wrapping and/or extending Hadoop to support GQL is the minimum involved.

The Users API - This may indeed be the biggest problem area, but as I read the API, they have hidden complexities such as generating login/logout URLs and accessing account data. Assuming I read the docs correctly, other than "nickname" and "email", Google assumes nothing about what defines a user. Thus, Google's accounts do not have to be used for the APIs to be functional. However, will the community expect a shared identity store, or at least the ability to choose to use Google's accounts?

The URL Fetch API - This was implemented for both security and scalability reasons. Not sure what exists to map it to, but I think it is a function of the identity infrastructure, and how you scale the Python Runtime. In other words, you'd need to map these functions to the appropriate mechanisms in each of the other infrastructure elements.

The Mail API - I would assume there is something close out there, but if not you would need to wrapper a scalable email system with the Python classes defined in the API. Doesn't seem overly hard.

Finally, given the fact that the source code for the "faker" dev environment is open source, there is a lot of basic sample code for many of these "wrapper" APIs. The trick is to find developers that know how to do this at high scale--perhaps request participation at highscalability.org?

Now, the potential "gotchas" here are that you are working on the same limitations that Google has set for itself, can only "officially" extend the API when Google adds something or agrees to implement your requirements (they own the "open source community"), and you would need to test in a real-world high scale environment, which could be expensive (though perhaps, ironically, Amazon could be of some use here).

By the way, none of this solves VMWare's portability issues, Amazon EC2's portability issues or even Cassatt and/or our competitors portability issues. It simply provides a portable web application environment that uses a "sandbox" approach for application execution. Again, a start (and an exciting one), but only a piece of the overall puzzle. Frankly, I think an Amazon portability story would be much more generally interesting to enterprise IT. But, that's just me.

I would think that Chris or others that see the possibilities will get on this. Hell, they are already probably "on this".

By the way, as part of a post questioning Google's lock-in issues, Tim O'Reilly at Radar O'Reilly made an interesting observation about why even if the code is portable, an AppEngine application today is still "locked-in" to Google's site. To set up his argument, he quotes venture capitalist Brad Feld:
At *this* moment in time, it would be difficult to move apps off of AppEngine. Doing that in EC2 is trivial. This, to me, is the biggest issue, as I believe it could make startups less-interesting from an acquisition perspective by anyone other than Google. This will most likely change as people develop compatibility layers. However, Google has yet to provide any information about how to migrate data from their datastore the best I can tell. If you have a substantial amount of data, you can't just write code to dump it because they will only let any request run for a short period before they terminate it.
Tim then goes on to say:
This last point is really very serious. I've been warning for some time that the first phase of Web 2.0 is the acquisition of critical mass via network effects, but that once companies achieve that critical mass, they will be tempted to consolidate their position, leading ultimately to a replay of the personal computer industry's sad decline from an open, energetic marketplace to a controlled economy.
What remains to be seen is how Google's plans to allow for larger data transfers will affect this (see Phil Wainright's post covering the business-readiness of GAE). If they allow unlimited fast transfer--not necessarily for free, but at a reasonable price--they will establish themselves as a truly open platform, competing on their amazing infrastructure and technology innovation. Now, combined with a compatible open source platform, that would be game changing.

4 comments:

J Chris A said...

James,

I don't see why Google needs to offer anything special for data export. It would be a relatively simple task to write a data exporter that converted all an application's data to JSON and returned it to the http requester. You'd have to do some pagination to keep the requests under the time-limit, but there'd really be nothing to it.

Anonymous said...

From the appengine blog http://googleappengine.blogspot.com/2008/04/were-up-and-running.html

..some of the general areas we're focusing on right now..

"Support for offline processing. Right now Google App Engine is great for web apps that do all of their processing in response to user requests, but what about apps that need to perform scheduled tasks or larger-scale data migration? We'd like to support those apps too."

..and ..

"Support for large files. Google App Engine currently imposes a limit of 1MB on all requests, both inbound and outbound. We're interested in providing efficient support for much larger uploads and downloads"

Tom

swardley said...

James,

Good post, I agree with almost everything you've said. That said, it would be fairly trivial to create a data exporter as Chris states. I think Tim is wrong on this particular point.

However the "critical mass via network effects" type of lock-in which Tim points to is very real, this is particularly a problem with proprietary clouds.

As for the ceilings, this was an approach we followed with Zimki to begin with. Why? Because the pricing models and security methods are particularly sensitive to activity and the ceilings gave us a chance to manage this.

As for game changing - it happened as soon as they let the open SDK out and people started talking about and building equivalent environments.

On that note, well done Chris.

Anonymous said...

[url=http://firgonbares.net/][img]http://firgonbares.net/img-add/euro2.jpg[/img][/url]
[b]buy cheap software oem, [url=http://firgonbares.net/]buy open source software[/url]
[url=http://firgonbares.net/][/url] WinZip 12 Pro harvard software purchase
software in canada [url=http://firgonbares.net/]g1 software store[/url] buying softwares
[url=http://firgonbares.net/]lp software store[/url] how do i download acdsee 9
[url=http://firgonbares.net/]icon windows xp midi[/url] acdsee 6
sell free software [url=http://firgonbares.net/]cheap softwares[/b]