The Learning Curve, Part 3: Mastering AI Data for Optimal Performance

Samsung is at the forefront of developing cutting-edge mobile AI experiences. We are currently visiting Samsung Research centres around the world to explore how Galaxy AI is unlocking the full potential of its users. With support for 16 languages, Galaxy AI is now making it easier for individuals to enhance their language skills, even without an internet connection. This is made possible through on-device translation in various features like Live Translate, Interpreter, Note Assist, and Browsing Assist. During our recent trip to Jordan, we had the opportunity to delve into the intricacies of creating an AI model specifically designed for the Arabic language, which is known for its diverse range of dialects. For our next adventure, we’ll be heading to Vietnam to delve into the fascinating world of data preparation for training AI models.

Can you explain the distinctions between a ghost, grave, and mother in Vietnamese? Despite being spoken by 97 million people worldwide, the attention given to this language is surprisingly minimal. Every word can be translated as “ma,” “mả,” or “má,” and the only way to differentiate them is by their tone. This highlights the challenges that AI models face when it comes to language learning. They lack the ability to grasp the context, emotions, and intentions of conversations, making it a complex task for them.

Samsung R&D Institute Vietnam (SRV) utilised highly accurate data to enhance its AI model’s ability to accurately identify even the most nuanced variations in language.

The accuracy of automatic speech recognition (ASR), neural machine translation (NMT), and text-to-speech (TTS) is directly influenced by the quality of data. These processes are crucial for Galaxy AI features like Live Translate, Interpreter, Chat Assist, and Browsing Assist, as they effectively overcome language barriers.

A Storm of Challenges

“Vietnamese is a language that is both intricate and varied, with a wide range of expressions that can be difficult to fully grasp,” explains Ngô Hồng Thái, the NMT lead at SRV. Developing Vietnamese as one of the 16 supported languages was quite challenging for Galaxy AI.

“Creating an AI model for Vietnamese was quite a challenge,” he remarks, as he goes on to describe the obstacles encountered during the development process.

Vietnamese, like many other languages, features six distinct tones that are an integral part of its linguistic structure. As shown in the example above, slight variations in vocalisation can have a significant impact on the meanings of words. Thus, a careful and thorough approach was required.

“When words that sound similar are analysed, one word is made up of multiple short segments, or ‘frame sets’,” explains Bui Ngoc Tung, the ASR lead at SRV. The AI model is able to distinguish between short audio frames that last around 20 milliseconds. It can identify which words correspond to a specific sequence of frames. It is crucial to dedicate significant effort to the initial phases of the AI learning process.

In addition, Vietnamese frequently includes homophones and homonyms. Typically, individuals can depend on the surrounding context and nonverbal cues during conversations to distinguish between homophones or homographs with distinct definitions. Nevertheless, AI models must be trained to effectively recognise and distinguish between tones and words that are alike.

“This task is quite complex,” Thái explains. Ensuring the accuracy of the data is crucial in order to recognise the subtle nuances of the Vietnamese language, in addition to the quantity of data.

Thorough Preparation

The process of refining the data involves three steps. Initially, it is necessary to thoroughly review and rectify the audio and text utilised for training the AI model. Afterwards, this dataset undergoes random inspections to ensure its overall quality. After completing the necessary steps, the dataset is prepared for training by normalising and cleaning it.

“We thoroughly performed a series of tests to check the accuracy of our dataset,” says Nguyen Manh Duy, TTS lead at SRV who oversees database creation. “We faced a number of unexpected problems including misspelled words in scripts and background noise or incorrect pronunciation during audio recordings. We spent significant time refining and improving our training data.”

In addition to the unique linguistic challenges in Vietnamese, there is a lack of universally accessible data compared to more widely spoken languages. “This is another reason why the data refinement stage is so important,” he adds. “Since we had limited sources, every piece of data had to be fully reliable. There was no margin for error.”

Moreover, the AI model for Vietnamese must consider both tonal and regional differences. To improve the AI model’s accuracy, the team collected vast amounts of data with Vietnam’s northern, central and southern accents — resulting in an enormous amount of information to refine and verify.

Continued Improvement

Developers at SRV completed the project after months of hard work, and Vietnamese became one of the first languages to be supported by Galaxy AI. Despite this success, the team is ceaselessly working to improve the Vietnamese Galaxy AI experience.

“We’re continuing to enhance the AI model by incorporating user feedback about the relevance of words and phrases in Galaxy AI,” says Tran Tuan Minh, leader of the AI language development project at SRV. “We have just taken our first steps into a more open world — and we have so much more to explore together.”

In the next episode of The Learning Curve, we will head to China to dig into how AI models are trained and fine-tuned.

Join Galaxy AI-Volution Squad Today!

Purchase our latest Galaxy innovation 

To learn more about Galaxy AI, visit: https://www.samsung.com/my/galaxy-ai

Leave a Reply

Your email address will not be published. Required fields are marked *