This talk presents two distinct approaches that expand the potential of Transformer architectures beyond the traditional decoder-only, causal-attention models for next-token prediction. In the first half, we will examine looped Transformers with an adaptive iteration mechanism, demonstrating that these models can learn highly length-generalizable solutions for algorithmic tasks. The second paper introduces Encoder-only Next Token Prediction (ENTP), challenging the necessity of causal attention in next-token prediction by showing that encoder-only architectures can outperform decoder-only models on various tasks. Through analysis and experimental validation, our results suggest substantial untapped potential for outperforming current large language models, opening pathways to more capable Transformer designs. The first half will be about: https://sites.google.com/wisc.edu/looped-transformers-for-lengen/home, and the second half will be about: https://sites.google.com/wisc.edu/entp/home. This talk is based on joint work with Ying Fan, Yilun Du, Kannan Ramchandarn, Ethan Ewer, Daewon Chae, Thomas Zeng, and Jinkyu Kim.
Bio:
Kangwook Lee is an Assistant Professor in the Electrical and Computer Engineering Department and the Computer Sciences Department (by courtesy) at the University of Wisconsin-Madison. Previously, he was a Research Assistant Professor at the Information and Electronics Research Institute of KAIST and was a postdoctoral scholar at the same institute. He received his PhD in 2016 from the Electrical Engineering and Computer Science department at UC Berkeley. He is the recipient of the NSF CAREER Award, IEEE Joint Communications Society/Information Theory Society Paper Award, Amazon Research Award, and the KSEA Young Investigator Grant Award.