The Flow of Information: How transformers workAttention MechanismsMulti Head AttentionThe Feed Forward NetworkLayer Normalization and Residual ConnectionsDecoder-Only DominanceGrouped Query Attention (GQA)Continued Innovation: Mistral and MixtralSliding Window AttentionSparse Mixture of ExpertsGemma: Google’s Open ApproachTokenizers and EmbeddingsTokenizationByte Pair Encoding (BPE)EmbeddingsOther ConsiderationsInstruction FollowingReasoning DepthContext Windows and the ‘Lost in the Middle’ problemSummary