Data will be processed at several stages on the write path, starting with the immediate logging of a write and ending in with a write of data to disk:
1. Logging data in the commit log
2. Writing data to the memtable
3. Flushing data from the memtable
4. Storing data on disk in SSTables
Note: The time stamp for all writes is UTC (Universal Time Coordinated).
Logging writes and memtable storage
- When a write occurs, the database stores the data in a memory structure called memtable.
- To provide configurable durability, the database also appends writes to the commit log on disk.
Flushing data from the memtable
While flushing the data from the memtable, the database writes data to disk in the memtable-sorted order. A partition index is also created on the disk that maps the tokens to a location on disk.
You can manually flush a table using nodetool flush or nodetool drain (flushes memtables without listening for connections to other nodes). To reduce the commit log replay time, DataStax recommends flushing the memtable before you restart the nodes.
Purging commit log segments
The database uses the commit log to rebuild memtables during startup to recover after a crash. The database purges commit log segments only after all the data in a segment has been flushed to disk from the memtable.
If the commit log directory reaches the maximum size (commitlog_total_space_in_mb), the oldest segments are purged and the corresponding tables are flushed to disk.
Storing data on disk in SSTables
Memtables and SSTables are maintained per table. The commit log is shared among tables. SSTables are immutable, not written to again after the memtable is flushed. Consequently, a partition is typically stored across multiple SSTable files.
SSTable names and versions : SSTables are files stored on disk. The data files are stored in a data directory that varies with installation.
Example : The following SSTable version is aa and the format is
cyclingis the keyspace name which distinguishes the keyspace for streaming or bulk loading data.
cyclist_expensesis the table name which is followed by a dash and a hexadecimal string (
-e4f31e122bc511e8891b23da85222d3d) the unique identifier of the table.
aais the SSTable version and
btiis the format.
Each SSTable is comprised of multiple components stored in separate files:
Check more about SSTable structure : http://besttechreads.com/interview-questions/interview-question-what-are-the-files-stored-in-each-sstable-how-to-find-the-sstable/
Source : How is data written? from DataStax