Publications | Yilei Tu

2025

ACL 2025
Findings

Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning

Yilei Tu, Andrew Xue, and Freda Shi

Findings of the Association for Computational Linguistics: ACL 2025, 2025

Abs arXiv PDF

While multilingual large language models generally perform adequately, and sometimes even rival English performance on high-resource languages (HRLs), they often significantly underperform on low-resource languages (LRLs). Among several prompting strategies aiming at bridging the gap, multilingual in-context learning (ICL) has been particularly effective when demonstration in target languages is unavailable. However, there lacks a systematic understanding of when and why it works well. In this work, we systematically analyze multilingual ICL, using demonstrations in HRLs to enhance cross-lingual transfer. We show that demonstrations in mixed HRLs consistently outperform English-only ones across the board, particularly for tasks written in LRLs. Surprisingly, our ablation study shows that the presence of irrelevant non-English sentences in the prompt yields measurable gains, suggesting the effectiveness of multilingual exposure itself. Our results highlight the potential of strategically leveraging multilingual resources to bridge the performance gap for underrepresented languages.
ICML 2025

Do Vision-Language Models Really Understand Visual Language?

Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan

Proceedings of the 42nd International Conference on Machine Learning, 2025

Abs arXiv PDF

Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Yet, recent studies seem to suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across several domains to evaluate the recognition and reasoning abilities of models. Our evaluation of three LVLMs (GPT-4V, GPT-4o, and Gemini) shows that while these models can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.
Manuscript

Knowledge-Enhanced Academic Recommender: Harnessing Large Language Models and Knowledge Graphs

Noah Mamié*, Yilei Tu*, Prakhar Bhandari, Susie Xi Rao, and Peter Egger

2025

2023

Manuscript

Complex Probes are Favored: A Revisit of Probe Complexity

Yilei Tu, Jiaoda Li, and Ryan Cotterell

2023

Abs PDF

Probes are widely used to discern linguistic information embedded in pretrained representations. A common preference is towards simple probes, arguing that complex probes, endowed with high learning capabilities, blur the distinction whether the high performance is due to inherent linguistic knowledge in the representations or the probe’s learning on the task. This argument builds upon an implicit assumption: a probe with higher complexity can learn any task better. We investigate the performance of probes with varying complexities on three different types of representation in a practical setting with a gradient-descent based optimization algorithm and a training set of fixed size. Contrary to a common belief, we find that more complex probes perform worse, especially when linguistic information is not present in the representations. Yet on pretrained representations that are believed to contain linguistic information, the performance of probes exhibits robustness, much less affected by complexity. Our results challenge the common wisdom of favoring low-complexity probes.
IJCNLP 2023

SAINE: Scientific Annotation and Inference Engine of Scientific Research

Susie Xi Rao, Yilei Tu, and Peter H. Egger

IJCNLP-AACL 2023: System Demonstrations, Nov 2023

Abs PDF

We present SAINE, an Scientific Annotation and Inference ENgine based on a set of standard open-source software, such as Label Studio and MLflow. We show that our annotation engine can benefit the further development of a more accurate classification. Based on our previous work on hierarchical discipline classifications, we demonstrate its application using SAINE in understanding the space for scholarly publications. The user study of our annotation results shows that user input collected with the help of our system can help us better understand the classification process. We believe that our work will help to foster greater transparency and better understand scientific research. Our annotation and inference engine can further support the downstream meta-science projects. We welcome collaboration and feedback from the scientific community on these projects. The demonstration video can be accessed from this https URL. A live demo website is available at this https URL upon free registration.
IEEE TASE

Reinforcement-Learning-Informed Prescriptive Analytics for Air Traffic Flow Management

Yuan Wang, Weilin Cai, Yilei Tu, and Jianfeng Mao

IEEE Transactions on Automation Science and Engineering, Nov 2023

Abs PDF

Air Traffic Flow Management (ATFM) is a complex sequential decision-making problem that involves dynamically matching flights with sectors under changing environmental conditions. Finding an optimal solution for ATFM is challenging due to its dynamic nature and operational constraints. Reinforcement learning is a well-suited approach for sequential decision-making problems. However, ATFM poses three potential challenges: 1) large state space, 2) combinatorial action space, and 3) variational feasible action set, resulting from numerous agents with tightly-coupled constraints. These challenges can hinder the effectiveness of direct application of reinforcement learning methods. While prescriptive analytics can readily handle hard constraints via a mathematical optimization model, but it is computationally intractable for online sequential decision-making problems under changing environments. To address these challenges, we propose a novel framework, Reinforcement-Learning-Informed Prescriptive Analytics (RLIPA), in which an “informing” scheme is devised to integrate reinforcement learning and prescriptive analytics and leverage their strengths in predicting future reward and coping with hard constraints respectively. RLIPA is a general framework that can be adapted to other problems beyond ATFM, which typically involves many agents with tightly-coupled hard constraints. We demonstrate the usage and performance of RLIPA using numerical results and a real case study in comparison to two baseline approaches. Note to Practitioners —To improve Air Traffic Flow Management (ATFM) and reduce flight congestion, we propose a new method called reinforcement-learning-informed prescriptive analytics (RLIPA). RLIPA is a general framework that facilitates online sequential decision-making problems with multiple agents coupled with hard constraints. The approach consists of two stages: first, estimating future potential rewards for each agent via reinforcement learning, and second, informing the potential rewards to the following prescriptive analysis and using the information to construct and solve the downstream optimization problem dealing with hard coupling constraints among agents. Numerical experiments demonstrate the efficiency and effectiveness of RLIPA in the application of ATFM. In the most cases, RLIPA can offer more than 10x improvement in computational efficiency while maintaining or improving the level of optimality. The framework of RLIPA can be further extended to problems such as order dispatch in ride-hailing systems and food delivery.
TRC

Prediction of estimated time of arrival for multi-airport systems via “Bubble” mechanism

Lechen Wang, Jianfeng Mao, Lishuai Li, Xuechun Li, and Yilei Tu

Transportation Research Part C: Emerging Technologies, Sep 2023

Abs PDF

Predicting Estimated Time of Arrival (ETA) for a Multi-Airport System (MAS) is much more challenging than for a single airport system because of complex air route structure, dense air traffic volume and vagaries of traffic conditions in an MAS. In this work, we propose a novel “Bubble” mechanism to accurately predict medium-term ETA for a Multi-Airport System (MAS), in which the prediction of travel time of an origin–destination (OD) pair is decomposed into two stages, termed as out-MAS and in-MAS stages. For the out-MAS stage, Auto-Regressive Integrated Moving Average (ARIMA) is used to predict the travel time of a flight to reach the MAS boundary. For the in-MAS stage, we construct new spatio-temporal features based on clustering analysis of trajectory patterns facilitated by a novel data-driven hybrid polar sampling method. A sequence-to-sequence prediction model, Multi-variate Stacked Fully connected Bidirectional Long–Short Term Memory, is further developed to achieve multi-step-ahead predictions of in-MAS travel time for each trajectory pattern using the spatio-temporal features as input. Finally, the medium-term ETA prediction for an MAS is achieved by integrating the out-MAS and in-MAS prediction with the help of trajectory pattern prediction via random forest. A case study of predicting medium-term ETA for a typical MAS in China, Guangdong–Hong Kong–Macao Greater Bay Area, is conducted to demonstrate the usage and promising performance of the proposed method in comparison to several commonly used end-to-end learning methods.