Cloudsecurity.org Interviews Guido van Rossum: Google App Engine, Python and Security

Guido Homepage

In this interview, cloudsecurity.org talks to Guido van Rossum about Python, Google App Engine and security.

Guido is the creator of the Python programming language and more recently, Google App Engine team member.  His involvement with the App Engine project was pretty late - the code “was almost ready for release” when he get involved.  The security architect of App Engine was primarily project lead, Kevin Gibbs, supported by the rest of the App Engine crew and the Google Security Team.

The Interview

cloudsecurity.org: What security principles did you follow for App Engine?

GvR: While I can’t share any specifics on what we’re doing to secure App Engine, I can say that the main principle we’ve followed could be called “defense in depth”. We’re not relying exclusively on a secure interpreter, or any other single security layer, to protect our users.

cloudsecurity.org: Please provide some examples of how those principles played out in terms of the current implementation?

GvR: Sorry, we don’t divulge such information.

cloudsecurity.org: What criteria did you apply to Python module selection?

GvR: We first looked for modules that were useful and straightforward to audit. If a module was large or complex, we’d only audit it (fixing things we found) if it was deemed essential or at least useful for a large number of users; otherwise we’d exclude it.

cloudsecurity.org: What do you see as the security risks inherent in exposing an interpreter runtime in a shared environment?

GvR: I presume you’re asking about risks to users, like providing accidental access to data belonging to another app. We’ve taken extensive measures to isolate different apps from each other. For example, each app runs in a separate process, and the datastore prevents an app from accessing data belonging to other apps.

cloudsecurity.org: I recently attended a fascinating talk by Justin Ferguson (a Seattle based security consultant) at eusecwest in London.  He gave a great talk exploring security vulnerabilities in language interpreters and specifically highlighted some security weaknesses in Python App Engine.  What are your thoughts on his research and specifically the Python issues he highlighted?  When do you anticipate they will get fixed?

GvR: We’ve anticipated all of the possibilities raised in Justin’s talk, and took measures to protect our users. Justin highlighted weaknesses in Python, but not in App Engine. Furthermore, our security model does not rely solely upon protections within the Python interpreter; there are additional protections that these external analyses have missed.

cloudsecurity.org: How do you contain an attacker that exploits bugs in App Engine from exploiting the underlying OS and potentially interfering with other users processes or attacking backend systems?

GvR: You are correct that there are strong measures in place, but I’m not at liberty to discuss details.

cloudsecurity.org: Python was the first language to get the App Engine treatment, what language is next and what are some of the language specific security challenges the team has had to deal with?

GvR: Although I can’t comment on what language is next, we are working on this, and have gotten a lot of great feedback from our developers. As far as language-specific security challenges, they stemmed mostly from the complexity of the Python interpreter. We spent a lot of time auditing this, and did a great deal more than just identifying buffer overflows.  I can also add that Google is actively researching the security of interpreted languages.  Google engineers routinely contribute security fixes to open source projects, including but not limited to Python.

cloudsecurity.org: How does the team decide when ‘enough is enough’ in terms of hardening the interpreter?

GvR: That’s not really how we approach it. We realize that security is an ongoing effort, and try to stay ahead of threats through continuous monitoring and testing.

cloudsecurity.org: Some commentators have suggested that perhaps the difficulty of auditing the implementation led to some modules being more heavily restricted than perhaps necessary.  What are your thoughts on that and what plans, if any, are there to bring back code objects/functions that were eliminated in the initial release?  (with the benefit of hindsight).

GvR: The only thing we are likely to put back is the _ast module, which was not audited based upon an underestimation of its usefulness (see my answer to question #3 above).  We will also put back some dummy functions and other objects whose absence currently prevents some popular frameworks from being loaded without modifications. For example, some harmless functionality in the imp module will come back. We’re also looking into making urllib2 work (to some extent), though that’s not really a security issue but merely a matter of API adjustment.

cloudsecurity.org: It is reported that Google encourages small groups to go off and create.  How involved were the Google security team with App Engine in terms of design and implementation review/testing?  Given the dynamics, is it possible to have a meaningful security process that shadows the development process?

GvR: The Google Security team is involved in everything we do. They have been extremely helpful.

cloudsecurity.org: How can people report security weaknesses they discover in App Engine?  What commitment does Google give in terms of dealing vulnerability reports?

GvR: There is a standard process for submitting security issues. See http://www.google.com/corporate/security.html. Google moves very fast to protect its users when a verifiable security vulnerability is reported.

cloudsecurity.org: One concern is the potential misuse of App Engine to exploit security vulnerabilities in visitors browsers.  This is not a new problem per se, shared hosting providers know all about this.  But with Google and other Cloud providers, the scalability potential is much higher.  What are your thoughts on this and what pro-active steps is Google taking to detect and terminate evil apps?

GvR: This is high on our list of concerns. We deal with this through a combination of restrictions on what you can do (e.g. certain HTTP headers and ports are off-limits) and, again, monitoring.

cloudsecurity.org: Beyond App Engine, what role do you think Python will play in the Cloud both now and in the future?

GvR: Sorry, I’m not prone to philosophizing about the future.

cloudsecurity.org: Trust is often cited as a barrier to enterprise adoption of Cloud Computing.  What role do you personally think Google can play in building that trust?

GvR: I think trust is built up over a long period of experience. Our actions in terms of being open to our users will be the most important factor in establishing trust. Of course, Google’s reputation also helps: everybody understands that Google doesn’t want its name associated with a bad product.

cloudsecurity.org: Looking at the Cloud Computing landscape beyond Google, what are your thoughts on the current state of Cloud Computing and Security?

GvR: It’s obvious that Cloud Computing is only just taking off. The next few years will be very exciting.

cloudsecurity.org: Lastly, what are some of your favourite App Engine apps?

GvR: There are too many to enumerate. If you insist on a highlight, well, I like Rietveld (http://codereview.appspot.com), a tool for collaborative code review which I (largely) wrote myself. It is open source and includes some essential components from Mondrian, a similar internal tool which I created before I joined the App Engine team.

Thanks

My thanks to Guido for his time and sharing his views.

A Question of Integrity: To MD5 or Not to MD5

Cloud Storage offers pay per drink off-site storage. Data to be saved is shuffled from the customer to the Cloud Storage Provider by the network. This all works wonderfully most of the time, what you upload is what you get back later. But what happens where the gremlins strike and what you send is not what is received?

This happened recently to some Amazon S3 customers. There were complaints in the AWS forums about ‘S3 Corruption’. The first post in the forum was recorded at Jun 22, 2008 5:05 PM PDT (although in subsequent posts some people reported emailing Amazon prior to this):

we are having some serious S3 issues.

all data we store on S3 has gone through the same code path for months. starting a couple days ago a small percentage of the objects we are retrieving are not checksumming to the correct values. we hash and store objects by checksum and rehash the objects when we retrieve to ensure there is no data corruption. all the objects we’re having issues with were uploaded at approximately the same time period a few days ago.

we’ve stored 10’s of millions of objects in S3 and never encountered such problems. please let me know ASAP if you have any idea what could be going on here. thanks.

Amazon responded 6 minutes later (!) and started investigating. To troubleshoot they asked customers to email aws@amazon.com with the ‘Bucket-Name and few keys that you believe are having issues’.

Others weighed in reporting similar problems. Amazon provided status updates and on Monday Jun 23rd at 6:10pm PDT, provided the following explanation:

We’ve isolated this issue to a single load balancer that was brought into service at 10:55pm PDT on Friday, 6/20. It was taken out of service at 11am PDT Sunday, 6/22. While it was in service it handled a small fraction of Amazon S3’s total requests in the US. Intermittently, under load, it was corrupting single bytes in the byte stream. When the requests reached Amazon S3, if the Content-MD5 header was specified, Amazon S3 returned an error indicating the object did not match the MD5 supplied. When no MD5 is specified, we are unable to determine if transmission errors occurred, and Amazon S3 must assume that the object has been correctly transmitted. Based on our investigation with both internal and external customers, the small amount of traffic received by this particular load balancer, and the intermittent nature of the above issue on this one load balancer, this appears to have impacted a very small portion of PUTs during this time frame.

What are some of the takeaways?

  • If you are directly using the AWS S3 API, make sure to calculate and send MD5 checksums along with actual data. Check status return codes - an HTTP 400 error code means ’something went wrong’ - respond appropriately.
  • If you are relying on 3rd party tools to access S3, be sure to check with your software vendor that they are following the advice from Amazon to use MD5. If they are not then your data can get silently corrupted…
  • Downloads, aka HTTP GETs, can also be affected. The thread in the forum continues and questions are asked as to whether the corruption caused by the loadbalancer was affecting both incoming and outgoing traffic. The conclusion was yes. If you are hosting media on S3, and the browser is using partial GET requests (to download in chunks) then the corruption will not be automatically detectable.
  • If your business relies on Cloud Storage, are you prepared to wait a 36 hours for a resolution? This isn’t a swipe at Amazon, this is true for any provider. Check your SLA’s, check the trouble ticket resolution times, ask about availability of experts for troubleshooting etc.
  • Cloud Providers will increasingly need to instrument their services such that they can ‘early detect’ negative operational events. In this case, Amazon has stated plans to use better logging and analysis to automate detection of unusual error patterns (i.e. anomoly detection).
  • This incident - caused by an Amazon malfunctioning loadbalancer - did not make it onto the AWS status page at http://status.aws.amazon.com/. Taking Amazon at face value, this incident only affected a small number of transfers, relative to the total number of S3 transfers. But this begs the question, what level of outage or service problem needs to happen before Amazon will flag the issue on their status page? On a sidenote, based on the timestamps, 31 hours passed between the loadbalancer being taken out of service and Amazon providing the explanation on the forum.
  • When Amazon update their S3 API documentation, it would be useful to have entries in the S3 API index for ‘checksum’, ‘MD5′, ‘integrity’ and ‘corruption’.
  • Stepping back, will customers hold Cloud Service Providers to a higher standard than their own internal IT teams?

I’m sure there are more takeaways I didn’t cover. What say you?

###

Kudos for the heads-up on the S3 issue goes to my friend and colleague Jason Harper - network supremo and crypto-head. Thanks Jason!

Cloud Computing and Security For The Masses: Interview on NPR

US National Public Radio

Cloud Computing is starting to escape the technical and business press.

The proof?

I was invited to talk about Cloud Computing and Security on NPR “Morning Edition”.

NPR - National Public Radio - is a US based, non-commercial radio station covering news, talk and current affairs. British readers may find it similar to BBC Radio 4.

Every Monday, the “Morning Edition” has a technology theme. The Cloud Computing segment was high level and aimed primarily at a non-tech audience. I always find it hard to answer the question ‘what is Cloud Computing?’ as there are so many different definitions. Regardless, it was a great chance to talk about an exciting technology and highlight the need for a real security conversation between the providers and people interested in IT security - the primary reason why I created cloudsecurity.org.

The show boasts a very impressive audience - around 13 million! I’ve never before had the opportunity to confuse that many people in one shot ;-).

If you would like to listen (its short - 3.5 mins), click here.

I’d like to publicly thank Nina at NPR for reaching out and extend a warm ‘Welcome’ to any NPR listeners who have dropped by. Feel free to leave a message below or email me if you have any comments or questions.

Your Turn At The Bar Again? Security Costs in a Pay Per Drink Cloud

Lounge

With in-house IT, you pay your upfront capital costs and maintenance fees and you get whatever compute power you paid for. If you over-specify, you have excess computer power or disk - you are wasting money.  If you under-specify, you may be forced to raid your ‘rainy day’ budget and order new hardware.

A primary selling point of Cloud Computing is the ‘pay by the drink’ billing model - you only pay for the CPU cycles and storage you use - that’s it.

If you run any IT security tools at all, Cloud Computing may impact the way you calculate your IT security budgets.

Assessing The Cost of Runtime Security

Security costs can be overt or hidden:

  • budget items spread across infrastructure, security, compliance, midrange.
  • the runtime security costs of security tools that execute on the systems.

How many organisations know their runtime security compute costs?  My guess is not many.  Under the traditional IT billing model, you mostly don’t need to figure this stuff out. As long as your security tools don’t chew up the CPU unnecessarily or fill the disk, everyone is happy.

The performance of security products varies greatly.  On the negative side, poor design or implementation are problems only the vendor can address.   Site specific issues arise through all kinds of madness - customers failing to “read the label” and provision properly, insufficiently trained people making poor configuration choices or simply relying on the default settings in a very non-default environment!

The negative side effects of in-line security tools hit home as system load increases.  Access checks, logging and other ‘in-line’ security operations may perform fine under normal load fail to scale as load increases past a certain threshold.  This can lead to CPU spikes or poor disk access patterns.

Switch Off Or Pay Up?

To bring this closer to home, lets explore how the impact of security tools plays out today under traditional IT and tomorrow, under Cloud Computing.  Lets eavesdrop on a fictitious conversation between Oscar the ORACLE DBA and Simon the Security Dude.

Oscar: Hey Simon, your Security Agents are killing system performance again. Anna in accounts called up to say they can’t do the Quarterly close, the jobs are getting killed before they finish.

Simon: Hi Simon, I understand but we can’t just disable all the security!

Oscar: Well, we need to do something if we are going to finish posting our numbers this quarter. Are you volunteering to explain to our CEO why we didn’t?

Simon: Hmm. Let me check the agent logs, perhaps there is a problem.

Oscar: I already checked them, no errors reported.

Simon: Hmm. I’ll log a call with the Premium International Support Service.

Oscar: You did that last time and the support guy stuck to the party line that the security agent takes 5-10% of CPU. We know those numbers are wrong from our benchmarking - sometimes it takes 20% of CPU and always a lot more during quarter close.

Simon: Hmm. Are there any other processes running on the system we can disable for a while?

Oscar: Nope - we’re running a tight a ship as we can here. I’ve already told Steve from sourcing he is going to have to wait for his reports.

Simon: Hmm. Bugger. OK, I’ll disable the agents - but you must tell me as soon as the quarter close completes so I can start them up again.

Oscar: Thanks - will do.

A classic conversation under the ‘old regime’. Simon is forced into an operational security decision due to an under-specified system or an over indulgent security agent. His only option in this scenario is to disable the poorly scaling security tool. He can’t just scream “Need more power!” and additional CPUs appear.

Now lets see how this plays out with Cloud Computing, where the change in paradigm will remove the compute limits and make your on the spot risk decisions link directly to your costs and security tool efficiencies:

Simon the Security Dude receives an auto-generated email from the Cloud Provider:

A virtual CPU was auto-inserted on virtual machine image FINANCE1 at 10:30am as Runtime Security Compute usage exceeded the agreed threshold in the SLA.   Please note, you have now reached your soft credit limit - please click the link below to authorize an increase. You currently have 4USD left in your account.

So what does Simon do now? He already tapped into his security compute budget five times this week and he’s running low. The silver lining is that at least he gets to make the decision now - he isn’t forced to ’switch off security’. If he has the cash, he can attempt to buy his way out of the problem. The obvious negative is “death by a thousand costs” - he’s running out of budget.

The root cause of the problem is that prior to moving to the Cloud, Simon didn’t have a handle on how much runtime security was *really* costing. He didn’t know (a) his runtime security costs or (b) how much of that cost was unnecessary - caused by security tool inefficiency.  He wasn’t the one paying, so most of the time he didn’t have to care. Even if he had found a way to calculate his costs, he’d still have to figure out how performance differences of Cloud Computing would skew his numbers.

And therein lies the rub: if you don’t know your security runtime costs are today - and where the waste is - how will you cope “tomorrow” when it’s always your turn to pay for drinks at the Cloud Bar?

12 Signs that Your Company is Already in the Cloud

building_gap

What are the telltale signs that your company is already Computing in the Cloud?

Is it when the CIO makes a big announcement at the monthly IT meeting?

Is it when the IT newsletter drops a reference to pilot testing of some ‘web based’ software?

Or, is it when the secretary whips out the boss’s Corporate Credit Card and signs up to a Cloud Service?

Here are 12 indicators that your company is *already* part of the Cloud:

  1. Your internal helpdesk reports fewer password resets.
  2. Finance contacts you to confirm all the DVD readers are disabled - they are puzzled by the number of recurring credit card charges for Amazon (are the secretaries spreading out their orders for “Lost” DVDs again?).
  3. You are asked to authorise a network change ticket that modifies the LAN routing policy.  All traffic will be sent directly to the Internet proxy (for performance reasons).  From the accompanying diagram, the data center appears to have been cut and pasted on the wrong side of the firewall (idiots!).
  4. You walk into the Data Center and it feels cooler than usual.
  5. When the builders next door accidentally saw through the company Internet connection, people complain there must be a DoS attack going on as they can’t get to their files.
  6. During physical inspections, you notice unexplained gaps in server cabinets.
  7. Login failures go down, in fact login “attempts” in general go down but the company car park is full.
  8. As you walk through the office, you notice all the “Security Awareness” posters have been replaced with pictures of Jeff Bezos (!)
  9. You are asked to authorise a visit from the local environment group. Fearing protesters, you are surprised to learn that your company has won a prize for reducing its Carbon Footprint
  10. Your Intrusion Prevention System is preventing the call center from uploading contracts stored as GIF files.
  11. You detect the presence of ‘malware’ in the form of unexplained ‘Machine Images’ on IT’s desktops.
  12. You stop finding Windows passwords under keyboards, instead you find random hex digits next to the words ‘Access Key’ and ‘Secret Key’. You sigh, but at least they are setting difficult to guess passwords now!

If you are charged with IT security in your company, you may want to start checking your web proxy logs for telltale signs that people are talking to the Cloud…or just talk to finance.

Cloud Stacks: Please Mind The Gap

MIND THE GAP

Security gaps creep in when people think other people are ‘taking care of it’.

When a security practitioner assesses a complex system, they’ll look at the ‘hand offs’ between different players within the system.  In fact, if they’ve been in the game for a while, they’ll apply laser sharp focus to where the responsibilities of one party ends and another party begins.  In other words, they’ll be searching for the security gaps, the security ‘no-mans land’.   This is a dark place where - as a good friend of mine puts it - “the bad stuff” gets in and the “good stuff” doesn’t flow.

If you’ve ever performed a security review of an outsourced IT system, you’ll know exactly what I mean.

In the context of Cloud Computing then, who takes responsibility for what?

As a customer of the Cloud, you or your company may strike an agreement with a company perched atop the Cloud.  They provide you with Software as a Service (SaaS) or some other form of high level, end-user service.  Your service agreement and/or contract will define what you can expect from them and what they expect from you.

However, to deliver the service to you, they rely on other Cloud providers further down the stack.  In fact, at any level in the Cloud Stack, it could be multiple players providing the service *they* rely on; e.g. Cloud Storage, Cloud Compute, Cloud Security (?). 

These providers in turn depend upon other service providers at the next layer down in the Cloud and so on.

See where I’m going with this?

This is a new game I’m going to call “Join the Security Dots in Cloud Land“.

And even then it isn’t as simple as I’ve presented it.

To end this post I’m going to ask a question to readers of this blog that provide a service on top of the Cloud (I have logs, I know you’re out there ;-):

What *security* arrangements do you have in place with Cloud Service Providers you rely on to deliver your service?  What are you doing to build “trust in depth” in the Cloud?

To clarify, I’m not asking you to spill your secret sauce on the Cloud Security alter - rather I want to hear what you are doing for your customers to build assurance (and I don’t mean ‘fluffy’ clouds ;-).

Personally, I think this will be one of the keys to selling Cloud Services to Enterprise customers.

Please reply in the comments below or email me.