This is the one resource that probably has the most room for variance. Several factors can help to determine the optimal disk size:
- Anticipated size of a single copy of the dataset
- Replication Factor (RF)
- Operational throughput requirements
- Cost of cloud volumes (usually per hour)
- Compaction strategy used on the larger tables
- Whether the size of the dataset will be static, or grow over time
- Whether the application team has an archival strategy
I have built production Cassandra instances on as much as 1 TB, and as little as 40 GB. Typically, nodes with larger amounts of data also need to have more compute resource available to them.
Let's walk through a little exercise here.
Assume that we need to build a cluster for an application ...