Longxu Dou 窦隆绪

I am a Research Scientist at Sea AI Lab, working on Natural Language Processing, particularly in multilingual LLM pre-training (Sailor/Sailor2).

I earned my Ph.D. and Bachelor's degree in Computer Science from Harbin Institute of Technology advised by Professor Wanxiang Che. I worked as a research intern at Microsoft Research Asia with Dr.Jian-Guang Lou and at National University of Singapore with Professor Min-Yen Kan.

Internship positions are available both onsite (in Mainland China, Hong Kong, and Singapore) and remotely. I’m always open to discussions and collaborations. Feel free to reach out via email with your background and interests!

Email  /  Google Scholar  /  LinkedIn  /  Github  /  Twitter

profile photo
Recent Research Projects (# indicates mentorship, * indicates equal contribution.)
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Longxu Dou*, Qian Liu*, Fan Zhou*, Changyu Chen*, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
Report, 2025

Sailor2 is a community-driven project delivering state-of-the-art multilingual language models in three scales - 1B, 8B, and 20B parameters. Released under the Apache 2.0 license, these models specialize in South-East Asian (SEA) languages, making advanced models more accessible across the region. Building upon the foundation of Qwen2.5 , Sailor2 is continually pre-trained over 500B high-quality tokens to support 15 languages, including English, Chinese, Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, Waray.

Sailor2-20B-Chat achieves a nearly 50% win rate against GPT-4o-0806 on SeaWildBench, showcasing GPT-4o-level performance in local chat scenarios on South-East Asian Languages.
Sailor: Open Language Models for South-East Asia
Longxu Dou*, Qian Liu*, Guangtao Zeng, Jia Guo, Jiahui Zhou, Xin Mao, Ziqi Jin, Wei Lu, Min Lin
EMNLP Demo, 2024

Sailor is a family of open language models ranging from 0.5B to 14B parameters, tailored for South-East Asian (SEA) languages. These models are continually pre-trained from Qwen1.5, a great language model for multilingual use cases. From Qwen1.5, Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize data mixture.

Over 150K downloads since March 2024.
SailCraft: Data Toolkit for Sailor Language Models
Longxu Dou, Qian Liu
Tool, 2024

The full data processing script used in developing our Sailor models. The repo provides an end-to-end data processing pipeline for LLM training. With this codebase, you can clean your own dataset with: (1) Get filtered data counts after each processing stage. (2) Easily configure language-specific cleaning rules (we support Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimize for English, Indonesian, Vietnamese, Chinese, Thai, Lao, Malay). (3) Investigate what data was removed at each processing stage.

RegMix: Data Mixture as Regression for Language Model Pre-training
Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin
ICLR (Spotlight) , 2025

The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix trains many small models on diverse data mixtures, uses regression to predict performance of unseen mixtures, and applies the best predicted mixture to train a large-scale model with orders of magnitude more compute.

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Chaofan Tao, Qian Liu#, Longxu Dou#, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong
NeurIPS, 2024

Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies.


Design and source code from Jon Barron.