Sunday, June 22, 2008

"Follow the Law Computing" on Google Groups: Cloud Computing

Not long after my post outlining my theory of an unexplored economic concern for moving compute loads in a cloud computing environment, a discussion popped up on the Google Groups Cloud Computing group. In this thread, which started out covering BI issues in the cloud, the question of moving data to computing versus moving computing to the data came up. It is a priceless thread, and one that showed me that I have not been the only one thinking about the technology of migrating workloads in the cloud.

The first message that popped out at me was one by Chuck Wegrzyn, apparently of Twisted Storage:
"How does the "cloud" protect data going from the owner to the computing service without being compromised (read that as sniffed)? Will a computing service in country A have the right to impose restrictions on data from another country (even if the results of the computing don't affect the citizens of country A)? An so on. "
He goes on to say, in a separate message:
"While I think trans-national data movement will be an area that requires governance of some kind I think that companies can get around the problem in other ways. I think it just requires looking at the problem in a different way.

I'd think the approach is to keep the data still and move the computing to it. The idea is to see the thousands of machines it takes to hold the petabytes worth of data as the compute cloud. What needs to move to it is the programs that can process the data. I've been working on this approach for the last 3 years (Twisted Storage). "
Bingo! This is what I think is going to start happening as well. Move compute loads to where the legal and regulatory environment is most favorable, and leave the (highly contentious) data where it is.

Khaz Sapenov even has a name for this pattern:
"This is valid approach, that I personally called "Plumber Pattern", when application, encapsulated in some kind of container (e.g. virtual machine image) is marshalled to secure data islands to iteratively do its unique work (say, do a matches on some criterium in Interpol, FBI, CIA, MI5 and other databases, all distributed across continents). Due to utterly confidential nature of these types of data, it is impossible to move them to public storage (at least this time). Above-mentioned case might be
extrapolated to some lines of business as well with reduced privacy/security requirements. "
I have no idea where the term "plumber" comes into this, but it somehow seems to work. More importantly, Khaz gives an excellent use case for a compute problem where the data cannot move for legal and national security reasons, but an authorized (or unauthorized--gulp) software stack could move from data center to data center to compute an aggregate report.

Marc Evans even points out that we already have some open source compute algorithms that can serve as a starting point to address these problems:
"In my experiences(sic), there are cases where having the data / computation as close to the customer edge as possible is what is required for an acceptable user experience. In other cases, the relationship of the user / data / computation is not important. Most often, there is a mix of both. One of the ideas behind Hadoop as I understand it is to bring the computation to the data location, while also providing for the data to be in several locations. The scheduler is critical to making good use of
data locality. So yes, I believe that what you are looking for does exist within Hadoop at a minimum, though I also believe that there is alot of room to evolve the techniques that it uses. "
Jim Peters then asks a simple, but loaded question:
"Even if the cloud providers come up with excellent answers to the security and reliability questions, who's going to trust them? Credit card numbers are one thing, but cloud data is something else entirely. "
At this point, Ray Nugent adds what I think is the quintessential economic consideration:
"Security is really a business issue. Each layer of security should cost no more than the data is worth. So the concept of "secure enough" becomes important. What security is appropriate for a given type of data and is it more or less secure in the cloud than in the corp DC? Is data inherently "less secure" by virtue of being in the cloud than, say, an employees laptop or flash dongle or "on the wire"? I don't think corporate data centers are a secure as you're suggesting they are..."
"Secure enough" is, I think, where its at. Perhaps a new term is needed: "Avoid the Risk Computing"?

Anyway, the discussion goes on from there, and I suggest you read the thread yourself. This is a key topic for cloud computing, and I think there is a good chance that one or more of the biggest technology companies of the early to mid 21st century will hatched from discussions like these.

(This group, by the way, is absolutely awesome, and each thread is packed with intelligent and insightful messages. If you care about cloud computing, you need to join.)