I just need to get some basics off of my chest here, it’s by no means a full list but it’s the most basic list I can think of to start with, and it’s basic because I am surprised by some of the slop I’ve seen in production environments.
1. Highly available server clusters – this is different than load balancing cluster, if confused see here.
2. Disaster recovery
-> this means daily,weekly,monthly backups as well as off site backups, and tertiary backups as well as a plan to get those backups imported and running in production as fast as possible. Backups should have consistency checking when they are created.
-> perimeter on the network, VLAN’d databases from the web/app servers, firewall, ACLs, etc
-> system level: strong passwords on OS and database accounts (no blank passwords – that *should* be obvious but you’d be surprised what I’ve run into), file permissions, encryption of sensitive database information.
4. Monitoring: monitor everything possible. Log files, disk partitions, service ports, service details (traffic for a service, memory used, tuning parameters: query cache usage, etc), CPU/RAM usage, logged in users, and most importantly being alerted about monitored services. If you’re not getting called when something has passed a threshold, you need to pay more attention to the infrastructure.