Notes from “Scaling on AWS for the First 10 Million Users”

Earlier tonight, I had the pleasure of attending a presentation from Chris Munns of Amazon at the offices of First Round Capital about scaling your software on AWS past the first 10 million users. I already had some experience with AWS, but I learned quite a few new things about how to leverage AWS, so I decided to write up my notes in a blog post for future reference, and as a service to other members of the Philadelphia tech community.

Without further preamble, here are my notes from the presentation:

“Auto scaling is a tool and a destination”
AWS regions are connected via the public Internet, but availability zones in the same region are connected via Amazon-owned fiber. So AZ to AZ communication is entirely across Amazon’s infrastructure.
More capacity is added to AWS’s infrastructure each day than powered all of amazon.com in 2003. That’s a lot of servers.
DynamoDB is based on a whitepaper that Amazon wrote back in 2003 or so.
DynamoDB is multi-AZ.
SQL is a well established technology with clear patterns to scalability (explained further down), and no matter what you think–your data is not special.
“Everything fails all the time”
- Build with that in mind
- “You cannot have greater than 4 nines in a single data center”
Elastic Load Balancers are Availability Zone independent
- “session stickiness” can route repeat requests from the same person to the same EC2 instance
S3 and CloudFront should be used for static assets from your website.
- This falls under “outsource your infrastructure” — use Amazon’s CDN instead of implementing your own
ElasticCache is better than a database for caching (less load) or local files (less invalidation issues)
- It is hosted Memcache
S3 has 11 9s of durability
DynamoDB scales really well. Individual customers have done 500,000+ reads/writes with no issue.
Auto scaling lets you match capacity to load
SOA all the things!
- Loose coupling sets you free!
Use SQS as buffers so that load is distributed across consuming machines
- Email, queuing, transcoding, search, databases, monitoring, metrics, logging
Outsource your infrastructure
- If you find yourself re-inventing the wheel, Just Say No.
A photo upload website could use S3 for ingest, SNS to write to differnt SQS queues for resizing, auto scaling EC2 instances to consume SQS queues and write output to S3 buckets, which are then served up though CloudFront.
- If all photo sizing is in the same queue, you could have issues where 100 thumbnails are created in the same time as 1 web image. This issue would eventually leak back to the users. Plus, multiple queues will give you even more parallelism.
- Reduced redundancy storage can be used for thumbnails then send an SNS message so the original is republished.
- Amazon Simple Workflow (SWF) can be used to deal with timeouts in conversions.
Retooling your app to use AWS
- Most apps start off as being monolithic
- Think about workflows of data in both your application and infrastructure
- Serve content smartly with S3/CloudFront
- Move state off tiers that auto scale (ephemeral storage goes away on shutdown)
Scaling past 1 million users:
- Database write issues are going to happen how do you mitigate?
- Federation: split up databases by function or purpose.
- Sharding: split up databases by data (different users on different databases)
  - Sharding has no practical limit on scalability
Pay attention to monitoring and metrics
- if you can’t build it, outsource it!
Not having proper metrics is like flying a plane with an eye patch on in a thunderstorm, and someone tells you an engine is on fire. Which engine is it?
AWS marketplace sells instances with logging and metrics running so you can get an instance up in minutes
As your site grows, Managing your infrastructure will become an ever increasing part of your time. Outsource it!
App management solutions: Elastic Beanstalk, OpsWorks (high level), CloudFormation, EC2 (Low level)
Host based configuration management (Chef and Puppet)
- Can’t scale easily without this
“Split your tiers into individual services, then split those services into individual services. You’ll get to auto scaling eventually”
“If you’re wasting time arguing about which language scales better, you’re just wasting time.”
- Facebook was built in PHP
“Don’t copy the technology stack of another company. Their problems are usually not yours.”
> 10 million users:
- Iterating on top of the patterns here will get you up and over 100 million users
- Spend even more fine tuning your app
- Go from multi-AZ to multi-region (but don’t go there until you get to multi-AZ first!)
- You MAY need to potentially build custom solutions
- Be prepared to perform deep analysis of your whole stack (if you shave off even a few milliseconds, you’ll see considerable savings at this scale)
For further reading: http://aws.amazon.com/architecture/