Equitable Tokenization for Southeast Asian LLMs
Created using ChatSlide
This research delves into multilingual tokenisation challenges, focusing on Southeast Asian languages, and aims to establish equitable strategies. It reviews related work, identifies gaps in fairness metrics, and analyses tokenisation methods. Preliminary experiments highlight key challenges and propose design strategies. The project outlines achievable goals, including creating open-source tools, equitable frameworks, and improved fairness metrics. Future perspectives emphasise broader...