A new deep dive explores how to optimize matrix multiplication from Gflop/s to Tflop/s using native Swift on Apple Silicon. The project achieves high-performance neural network training by bypassing external frameworks and utilizing CPU, SIMD, and AMB units directly.