Earlier tonight, I had the pleasure of attending a presentation from Chris Munns of Amazon at the offices of First Round Capital about scaling your software on AWS past the first 10 million users. I already had some experience with AWS, but I learned quite a few new things about how to leverage AWS, so I decided to write up my notes in a blog post for future reference, and as a service to other members of the Philadelphia tech community.
Without further preamble, here are my notes from the presentation:
"Auto scaling is a tool and a destination"
AWS regions are connected via the public Internet, but availability zones in the same region are connected via Amazon-owned fiber. So AZ to AZ communication is entirely across Amazon's infrastructure.
More capacity is added to AWS's infrastructure each day than powered all of amazon.com in 2003. That's a lot of servers.
DynamoDB is based on a whitepaper that Amazon wrote back in 2003 or so.
DynamoDB is multi-AZ.
SQL is a well established technology with clear patterns to scalability (explained further down), and no matter what you think--your data is not special.
"Everything fails all the time"
Build with that in mind
"You cannot have greater than 4 nines in a single data center"
Elastic Load Balancers are Availability Zone independent
"session stickiness" can route repeat requests from the same person to the same EC2 instance
S3 and CloudFront should be used for static assets from your website.
This falls under "outsource your infrastructure" -- use Amazon's CDN instead of implementing your own
ElasticCache is better than a database for caching (less load) or local files (less invalidation issues)
It is hosted Memcache
S3 has 11 9s of durability
DynamoDB scales really well. Individual customers have done 500,000+ reads/writes with no issue.
Auto scaling lets you match capacity to load
"Engineer of the Year"
SOA all the things!
Loose coupling sets you free!
Use SQS as buffers so that load is distributed across consuming machines
If you find yourself re-inventing the wheel, Just Say No.
A photo upload website could use S3 for ingest, SNS to write to differnt SQS queues for resizing, auto scaling EC2 instances to consume SQS queues and write output to S3 buckets, which are then served up though CloudFront.
If all photo sizing is in the same queue, you could have issues where 100 thumbnails are created in the same time as 1 web image. This issue would eventually leak back to the users. Plus, multiple queues will give you even more parallelism.
Reduced redundancy storage can be used for thumbnails then send an SNS message so the original is republished.