Now, with just about every device and mobile device they could have adopted or at least experimented with voice control, AI conversation is fast becoming a new frontier. Rather than handling a single request and giving an answer or action, Conversation AI aims to provide a real-time interactive system that can reach multiple questions, answers and comments. While the core components of AI conversations, such as BERT and RoBERTa for language modeling, are similar to those for single-speech recognition, these concepts are complemented by additional performance requirements for training, inference, and the sizes of the models. Today, Nvidia released three open source technologies designed to address the problem.
Faster BERT training
Although in many cases it is possible to use pre-trained language models for new tasks setting alone, for optimal performance in certain contexts it is essential to retrain. Nvidia has shown that it can now train BERT (Google Reference Language Model) in less than an hour on DGX SuperPOD, consisting of 1,472 Tesla V100-SXM3-32GB GPUs, 92 DGX-2H servers, and 10 Mellanox Infiniband per node. No, I don't even want to try to estimate hourly rental rates for one of them. But because a model like this generally takes days to train even on high-end GPU clusters, this will definitely help commercialize companies that can afford the costs.
Faster language model inference
For natural conversation, the industry benchmark is a 10 ms response time. Understanding the query and providing suggested answers is only part of the process, so it takes less than 10 ms. By optimizing BERT using TensorRT 5.1, Nvidia has concluded it in 2.2ms on Nvidia T4. The good thing is that T4 is really within reach of almost all serious projects. I use it in Google Compute Cloud for my text generation system. Virtual 4-vCPU server with T4 for rent is just over $ 1 / hour when I do the project.
Support for even larger models
One of the heels of Achilles' nervous tissue is the requirement that all model parameters (including a large number of weights) must be in memory at once. That limits the complexity of the model that can be trained on the GPU by its RAM size. In my case, for example, my desktop is Nvidia GTX 1080 You can only train the model that matches your 8GB. I can train a larger model on my CPU, which has more RAM, but takes longer. Full GPT2 language model has 1.5 billion parameters, for example, and the expanded version has 8.3 billion
However, Nvidia has devised a way to allow multiple GPUs to work on modeling languages in parallel. Like other announcements today, they've opened the source code to make this happen. I would like to know if this technique is specific to language models or if it can be applied to allow the training of multiple GPUs for other kinds of neural networks.
Along with this development and launch code on GitHub, Nvidia announced that they would partner with Microsoft to improve Bing search results, as well as Clinc on voice agents, AI Passage on chatbots, and RecordSure on conversation analytics.