Systems | Information | Learning | Optimization
 

SILO: Beyond Decoder-Only Next Token Prediction

Abstract: This talk presents two distinct approaches that expand the potential of Transformer architectures beyond the traditional decoder-only, causal-attention models for next-token prediction. In the first half, we will examine looped Transformers with an adaptive iteration mechanism, demonstrating that these models can learn highly length-generalizable solutions for algorithmic tasks. The …