Papers | Julie Kallini

2026

ICML
Fast Byte Latent Transformer

Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer, Christopher Potts, Xiaochuang Han, and Srinivasan Iyer

In The Forty-Third International Conference on Machine Learning, Jul 2026

Abs Bib arXiv tweet

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT’s local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.
@inproceedings{kallini2026fastblt, title = {Fast Byte Latent Transformer}, author = {Kallini, Julie and Pagnoni, Artidoro and Limisiewicz, Tomasz and Ghosh, Gargi and Zettlemoyer, Luke and Potts, Christopher and Han, Xiaochuang and Iyer, Srinivasan}, year = {2026}, month = jul, booktitle = {The Forty-Third International Conference on Machine Learning}, }
ACL
Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet

Milan Miletić, Julie Kallini, and Ekaterina Shutova

In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, Jul 2026

Abs Bib arXiv

Multilingual language models often exhibit performance disparities across languages that can arise as early as the tokenization stage. Widely-used subword tokenization approaches favor high-resource languages, and tokenizer-free methods still yield longer sequences for scripts with a higher bytes-per-character ratio. To address these shortcomings, we propose to use the International Phonetic Alphabet (IPA) as a language-agnostic input representation for multilingual tokenizers. IPA provides a compact symbol inventory, greater cross-lingual character overlap, and a more balanced byte-per-character distribution across languages. We train matched pairs of text vs. IPA subword tokenizers across 24 languages and 14 scripts and demonstrate that IPA tokenizers consistently improve tokenization quality, especially for non-Latin scripts, and generalize more effectively to unseen languages and scripts.
@inproceedings{miletic2026phonemes, title = {Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet}, author = {Miletić, Milan and Kallini, Julie and Shutova, Ekaterina}, year = {2026}, month = jul, booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics}, address = {San Diego, California, United States}, }
BBS
Language models as tools for investigating the distinction between possible and impossible natural languages

Julie Kallini and Christopher Potts

Behavioral and Brain Sciences, Jul 2026

Abs Bib arXiv

We argue that language models (LMs) have strong potential as investigative tools for probing the distinction between possible and impossible natural languages and thus uncovering the inductive biases that support human language learning. We outline a phased research program in which LM architectures are iteratively refined to better discriminate between possible and impossible languages, supporting linking hypotheses to human cognition.
@article{Kallini_Potts_2026, title = {Language models as tools for investigating the distinction between possible and impossible natural languages}, volume = {49}, doi = {10.1017/S0140525X26104737}, journal = {Behavioral and Brain Sciences}, author = {Kallini, Julie and Potts, Christopher}, year = {2026}, pages = {e211}, }

2025

EMNLP
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models

Julie Kallini, Dan Jurafsky, Christopher Potts, and Martijn Bartelds

In Findings of the Association for Computational Linguistics: EMNLP 2025, Nov 2025

Abs Bib arXiv Code tweet

Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages. Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages? Prior work offers mixed evidence, partly due to varied setups and confounders, such as token frequency or subword segmentation granularity. To address this question, we devise a controlled experiment where we train bilingual autoregressive models on multiple language pairs under systematically varied vocabulary overlap settings. Crucially, we explore a new dimension to understanding how overlap affects transfer: the semantic similarity of tokens shared across languages. We first analyze our models’ hidden representations and find that overlap *of any kind* creates embedding spaces that capture cross-lingual semantic relationships, while this effect is much weaker in models with disjoint vocabularies. On XNLI and XQuAD, we find that models with overlap outperform models with disjoint vocabularies, and that transfer performance generally improves as overlap increases. Overall, our findings highlight the advantages of token overlap in multilingual models and show that substantial shared vocabulary remains a beneficial design choice for multilingual tokenizers.
@inproceedings{kallini-etal-2025-false, title = {False {F}riends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models}, author = {Kallini, Julie and Jurafsky, Dan and Potts, Christopher and Bartelds, Martijn}, editor = {Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025}, month = nov, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, pages = {21138--21154}, isbn = {979-8-89176-335-7}, }
ICLR
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, and Róbert Csordás

In The Thirteenth International Conference on Learning Representations, Apr 2025

Abs Bib arXiv Code tweet

Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level models like ByT5 attempt to address these concerns, they have not gained widespread adoption – processing raw byte streams without tokenization results in significantly longer sequence lengths, making training and inference inefficient. This work introduces MrT5 (MergeT5), a more efficient variant of ByT5 that integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length. After processing through a fixed number of encoder layers, a learnt delete gate determines which tokens are to be removed and which are to be retained for subsequent layers. MrT5 effectively “merges” critical information from deleted tokens into a more compact sequence, leveraging contextual information from the remaining tokens. In continued pre-training experiments, we find that MrT5 can achieve significant gains in inference runtime with minimal effect on performance. When trained on English text, MrT5 demonstrates the capability to transfer its deletion feature zero-shot across several languages, with significant additional improvements following multilingual training. Furthermore, MrT5 shows comparable accuracy to ByT5 on downstream evaluations such as XNLI and character-level tasks while reducing sequence lengths by up to 80%. Our approach presents a solution to the practical limitations of existing byte-level models.
@inproceedings{kallini2025mrt, title = {MrT5: Dynamic Token Merging for Efficient Byte-level Language Models}, author = {Kallini, Julie and Murty, Shikhar and Manning, Christopher D. and Potts, Christopher and Csordás, Róbert}, month = apr, year = {2025}, booktitle = {The Thirteenth International Conference on Learning Representations}, url = {https://openreview.net/forum?id=VYWBMq1L7H}, }

2024

ACL
Mission: Impossible Language Models

Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, and Christopher Potts

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024

(Best Paper Award)

Abs Bib arXiv Code tweet

Chomsky and others have very directly claimed that large language models (LLMs) are equally capable of learning languages that are possible and impossible for humans to learn. However, there is very little published experimental evidence to support such a claim. Here, we develop a set of synthetic impossible languages of differing complexity, each designed by systematically altering English data with unnatural word orders and grammar rules. These languages lie on an impossibility continuum: at one end are languages that are inherently impossible, such as random and irreversible shuffles of English words, and on the other, languages that may not be intuitively impossible but are often considered so in linguistics, particularly those with rules based on counting word positions. We report on a wide range of evaluations to assess the capacity of GPT-2 small models to learn these uncontroversially impossible languages, and crucially, we perform these assessments at various stages throughout training to compare the learning process for each language. Our core finding is that GPT-2 struggles to learn impossible languages when compared to English as a control, challenging the core claim. More importantly, we hope our approach opens up a productive line of inquiry in which different LLM architectures are tested on a variety of impossible languages in an effort to learn more about how LLMs can be used as tools for these cognitive and typological investigations.
@inproceedings{kallini-etal-2024-mission, title = {Mission: Impossible Language Models}, author = {Kallini, Julie and Papadimitriou, Isabel and Futrell, Richard and Mahowald, Kyle and Potts, Christopher}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = aug, year = {2024}, address = {Bangkok, Thailand}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.acl-long.787}, pages = {14691--14714}, }

2023

GWC
What to Make of make? Sense Distinctions for Light Verbs

Julie Kallini and Christiane Fellbaum

In Global WordNet Conference 2023, Jan 2023

Abs Bib PDF

Verbs like make, have, and get present challenges for applications requiring automatic word sense discrimination. These verbs are both highly frequent and polysemous, with semantically “full” readings, as in make dinner, and “light” readings, as in make a request. Lexical resources like WordNet encode dozens of senses, making discrimination difficult and inviting proposals for reducing the number of entries or grouping them into coarser-grained supersenses. We propose a data-driven, linguistically-based approach to establishing a motivated sense inventory, focusing on make to establish a proof of concept. From several large, syntactically annotated corpora, we extract nouns that are complements of the verb make, and group them into clusters based on their Word2Vec semantic vectors. We manually inspect, for each cluster, the words with vectors closest to the centroid as well as a random sample of words within the cluster. The results show that the clusters reflect an intuitively plausible sense discrimination of make. As an evaluation, we test whether words within a given cluster cooccur in coordination phrases, such as apples and oranges, as prior work has shown that such conjoined nouns are semantically related. Conversely, noun complements from different clusters are less likely to be conjoined. Thus, coordination provides a similarity metric independent of the contextual embeddings used for clustering. Our results pave the way for a WordNet sense inventory that, while not inconsistent with the present one, would reduce it significantly and hold promise for improved automatic word sense discrimination.
@inproceedings{GWC2023, title = {What to Make of <em>make</em>? Sense Distinctions for Light Verbs}, author = {Kallini, Julie and Fellbaum, Christiane}, booktitle = {Global WordNet Conference 2023}, month = jan, year = {2023}, address = {Donostia-San Sebastian, Spain}, publisher = {Global WordNet Association}, }

2022

TSD
Computational Approaches for Understanding Semantic Constraints on Two-termed Coordination Structures

Julie Kallini and Christiane Fellbaum

In Proceedings of the 25th International Conference on Text, Speech and Dialogue, Sep 2022

Abs Bib PDF Slides

Coordination is a linguistic phenomenon where two or more terms or phrases, called conjuncts, are conjoined by a coordinating conjunction, such as and, or, or but. Well-formed coordination structures seem to require that the conjuncts are semantically similar or related. In this paper, we utilize English corpus data to examine the semantic constraints on syntactically like coordinations, which link constituents with the same lexical or syntactic categories. We examine the extent to which these semantic constraints depend on the type of conjunction or on the lexical or syntactic category of the conjuncts. We employ two distinct, independent metrics to measure the semantic similarity of conjuncts: WordNet relations and semantic word embeddings. Our results indicate that both measures of similarity have varying distributions depending on the particular conjunction and the conjuncts’ lexical or syntactic categories.
@inproceedings{TSD2022, author = {Kallini, Julie and Fellbaum, Christiane}, editor = {Sojka, Petr and Hor{\'a}k, Ale{\v{s}} and Kope{\v{c}}ek, Ivan and Pala, Karel}, title = {Computational Approaches for Understanding Semantic Constraints on Two-termed Coordination Structures}, booktitle = {Proceedings of the 25th International Conference on Text, Speech and Dialogue}, month = sep, year = {2022}, publisher = {Springer International Publishing}, address = {Cham}, pages = {64--76}, isbn = {978-3-031-16270-1} }

2021

EMNLP
A Corpus-based Syntactic Analysis of Two-termed Unlike Coordination

Julie Kallini and Christiane Fellbaum

In Findings of the Association for Computational Linguistics: EMNLP 2021, Nov 2021

Abs Bib PDF Slides

Coordination is a phenomenon of language that conjoins two or more terms or phrases using a coordinating conjunction. Although coordination has been explored extensively in the linguistics literature, the rules and constraints that govern its structure are still largely elusive and widely debated amongst linguists. This paper presents a study of two-termed unlike coordinations in particular, where the two conjuncts of the coordination phrase form valid constituents but have distinct categories. We conducted a syntactic analysis of the phrasal categories that can be conjoined in such unlike coordinations through a computational corpus-based approach, utilizing the Corpus of Contemporary American English (COCA) as the main data source, as well as the Penn Treebank (PTB). The results show that the two conjuncts within unlike coordinations display different properties based on their position, supporting an antisymmetric view of the structure of coordination. This research provides new data and perspectives through the use of statistical techniques that can help shape future theories and models of coordination.
@inproceedings{kallini-fellbaum-2021-corpus-based, title = {A Corpus-based Syntactic Analysis of Two-termed Unlike Coordination}, author = {Kallini, Julie and Fellbaum, Christiane}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2021}, month = nov, year = {2021}, address = {Punta Cana, Dominican Republic}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2021.findings-emnlp.335}, doi = {10.18653/v1/2021.findings-emnlp.335}, pages = {3998--4008}, }