Small is the new big: the rise of small language models

Sunita Tiwary

Jul 22, 2024

In the dynamic realm of artificial intelligence (AI) and machine learning, a compelling shift is taking center stage: the ascent of small language models (SLMs). The tech world is smitten with the race to build and use large, complex models boasting billions and trillions of parameters, and consumers have become unwitting accomplices in the obsession with “large”. However, recent trends indicate a growing interest in smaller, more efficient models. This article delves into the reasons behind this shift, its implications, and what it means for the future of AI.

Before we dive into SLMs, how did the wave of large languages grow

In the not-so-distant past, natural language processing (NLP) was deemed too intricate and nuanced for modern AI. Then, in November 2022, OpenAI introduced ChatGPT, and within a mere week, it garnered more than a million users. Suddenly, AI, once confined to research and academic circles, became accessible to the masses. For example, my nine-year-old daughter effortlessly began using ChatGPT for school research tasks, while my mother-in-law, in her late sixties, whose only tech acquaintance was limited to WhatsApp and Facebook, now enthusiastically shares the latest news about AI, and her budding interest in GenAI during our tea time conversations.

The launch of ChatGPT marked the onset of the very loud and very public (and costly) GenAI revolution, effectively democratizing AI. This is evident in integrating AI as copilots in various products, the exponential growth of large language models (LLMs), and the rise of numerous startups in this space. The landscape of technology and our world will never be the same.

To comprehend the magnitude of this shift, let’s delve into the parameters of AI models. The number of parameters is a core measure of an AI’s scale and complexity. GPT-2 had 1.5 billion parameters, and then OpenAI released GPT-3, which had a whopping 175 billion parameters. This was the largest neural network ever created, more than a hundred times larger than its predecessor just a year earlier. Now we see a trillion parameter LLMs.

Deciphering SLMs

While the definition of an SLM remains contextual, some research identifies them as models encompassing approximately 10 billion parameters or less. SLMs are lightweight neural networks that can process natural language with fewer parameters and computational resources than LLMs. Unlike LLMs (which are generalized models), SLMs are usually purpose-driven and tailored to address specific tasks, applications, or use cases.

Recent studies demonstrate that SLMs can be fine-tuned to achieve comparable or even superior performance compared to their larger counterparts in specific tasks.

For example, phi-3-mini, a 3.8 billion parameter SLM trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69 percent on MMLU and 8.38 on MT-bench). Another example is phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts. Similarly phi-2 matches or outperforms models up to 25 times larger on complex benchmarks. Another such model is Orca2, which was built for research purposes. Similarly, TinyLlama, launched in late 2023, had just 1B parameters, followed by the OpenELM by Apple for edge devices launched in April 2024.

Why does it matter?

SLMs bring many benefits, notably their swift training and faster inference speed. Beyond efficiency, these models contribute to a more sustainable footprint, showcasing reduced carbon and water usage. In addition, SLMs strike a harmonious balance between performance and resource efficiency. Training SLMs is much more cost-effective due to the reduced number of parameters and offloading the processing workload to edge devices further decreases infrastructure and operating costs.

Credit: Microsoft

1. Efficiency and sustainability

It is crucial to acknowledge that LLMs demand substantial computational resources and energy. Complex architecture and vast parameters necessitate significant processing power that contributes to environmental and sustainability concerns.

In contrast, SLMs significantly reduce computational and power consumption through several key factors:

Reduced computational load: Small models have fewer parameters and require less computation during inference, leading to lower power consumption
Shorter processing time: The reduced model size decreases the time required to process inputs thus consuming less energy per task
Lower memory usage: Smaller models need less memory, which reduces the power needed for memory access and management which is a significant factor in energy consumption. Efficient use of memory further minimizes the energy needed to store and retrieve parameters and intermediate calculations
Thermal management: Lower computational requirements generate less heat, reducing the need for power-hungry cooling systems. Furthermore, reduced thermal stress increases the longevity of hardware components, indirectly reducing the energy and resources needed to replace and maintain them.

SLMs are increasingly becoming popular due to their efficiency. They require less computational resources and storage than LLMs, making them a more practical solution for many applications requiring real-time processing or deployment on edge devices with limited resources. Therefore, by reducing the model size and complexity, developers can achieve faster inference times, lower latency, and improved performance, making small models preferred for resource-constrained environments such as mobile phones, personal computers, or connected devices. For example, phi-3 is highly capable of running locally on a cell phone. Phi-3 can be quantized to four bits so it occupies only ~1.8GB of memory. The quantized model of phi-3 when tested on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second (the rate at which a model processes tokens (words, subwords, or characters) during inference).

According to the Tirias Research GenAI Forecast and TCO Model, if 20 percent of GenAI processing workload could be offloaded from data centers by 2028 using on-device and hybrid processing, then the cost of data center infrastructure and operating cost for GenAI processing would decline by $15 billion (where data center infrastructure and operating costs projected to increase to over $76 billion by 2028.). This also reduces the overall data center power requirements for GenAI applications by 800 megawatts.

2. Economic viability

Developing and maintaining LLMs comes with steep costs, demanding significant investments in computational resources, energy usage, and specialized skills. In contrast, SLMs present a more budget-friendly solution. Their streamlined design means they are more efficient at training and require less data and hardware, leading to more economical computing costs. SLMs often employ optimized algorithms and architectures designed for efficiency. Techniques like pruning (removing unnecessary parameters) and quantization (using lower precision arithmetic) make these more economically viable.

3. Scalability and accessibility

Smaller models are inherently more scalable and accessible than their larger counterparts. By reducing model size and complexity, developers can deploy AI applications across various devices and platforms, including smartphones, IoT devices, and embedded systems. This democratizes AI, encourages wider adoption, and accelerates innovation, unlocking new opportunities across many industries and use cases.

4. Ethical and regulatory dimensions

Ethical and regulatory considerations also contribute to the shift towards SLMs. As AI technologies become increasingly pervasive, data privacy, security, and bias concerns become more pronounced. Embracing small models allows organizations to reduce data exposure, address privacy challenges, and reinforce transparency and accountability. When trained on specific, high-quality datasets, smaller models significantly reduce the risk of data exposure. They require less training data compared to their larger counterparts, which lowers the risk of memorizing, overfitting, and inadvertently revealing sensitive information within the training set. With fewer parameters, these models have simpler architectures, minimizing potential pathways for data leakage. Furthermore, smaller models are easier to interpret, validate, and regulate, facilitating compliance with emerging regulatory frameworks and ethical guidelines.

Limitations of SLMs

While SLMs have great benefits, there are challenges and limitations too. Due to their smaller size, these models do not have the capacity to store too much “factual knowledge” This could lead to hallucination, factual inaccuracies, amplification of biases, inappropriate content generation, and safety issues. However, this can be mitigated by the use of carefully curated training data and targeted post-training and improvements from red teaming insight. Models can also be augmented with a search engine for factual knowledge.

Conclusion

The transition to SLMs represents a significant trend in the AI field. While LLMs excel due to their vast size, intensive training, and advanced NLP capabilities, SLMs offer targeted efficiency, cost-effectiveness, and scalability. By adopting these models, organizations can unlock new opportunities, speed up innovation, and create value across various sectors.

The future of generative AI is also moving towards the edge, enabled by small, efficient language models. These models transform everyday technology with natural, generative interfaces, encompassing everything from personal devices and home automation to industrial machinery and intelligent cities.

SLMs are essential to enable AI at the edge. According to IBM, Huawei, and Grand View Research, the edge AI market is valued at $21 billion and is expected to grow at a CAGR of 21 percent. Companies like Google, Samsung, and Microsoft are advancing generative AI for PCs, mobile, and connected devices. Apple is joining this effort with OpenELM, a group of open-source LLMs and SLMs designed to run entirely on a single device without cloud server connections. This model, optimized for on-device use, can handle AI tasks independently, marking a new era in mobile AI innovation, as noted by Alphasense.

Finally, it’s not a matter of choosing one over the other. LLMs are generalists with extensive training in massive data and have extensive knowledge across various subjects. They have the ability to perform complex interactions like chatbots, content summarization, and information retrieval and have vast applicability, however, are expensive and have a high operational cost. SLMs on the other hand are specialized, domain-specific powerful, and less computationally intensive but struggle with complex context and hallucinations if not used on their specific use case and context. The choice between SLM and LLM is dependent on the need and availability of resources, nevertheless, SLM is surely a game changer in the AI era. but struggles with complex context and hallucinations if not used on their specific use case and context. The choice between SLM and LLM is dependent on the need and availability of resources, nevertheless, SLM is surely a game changer in the AI era.

Author

Sunita Tiwary is the GenAI Priority leader at Capgemini for Tech & Digital Industry. A thought leader who comes with a strategic perspective to Gen AI and Industry knowledge. She comes with close to 20 years of diverse experience across strategic partnership, business development, presales, and delivery. In her previous role in Microsoft, she was leading one of the strategic partnerships and co-creating solutions to accelerate market growth in the India SMB segment. She is an engineer with technical certifications across Data & AI, Cloud & CRM. In addition, she has a strong commitment to promoting Diversity and Inclusion and championed key initiatives during her tenure at Microsoft.

Fabio brings over 20 years of extensive experience, blending cutting-edge technologies, data analytics, artificial intelligence, and deep domain expertise to tackle complex challenges in R&D and Engineering for diverse clients and is continuously forward-thinking.