SILO: Beyond Decoder-Only Next Token Prediction
Abstract: This talk presents two distinct approaches that expand the potential of Transformer architectures beyond the traditional decoder-only, causal-attention models for next-token prediction. In the first half, we will examine looped Transformers with an adaptive iteration mechanism, demonstrating that these models can learn highly length-generalizable solutions for algorithmic tasks. The …