Why Copilot will only sort of run locally on AI PCs for now

Why Copilot will only sort of run locally on AI PCs for now

Why Copilot will only sort of run locally on AI PCs for now PlatoBlockchain Data Intelligence. Vertical Search. Ai.

Comment Microsoft’s definition of what does and doesn’t constitute an AI PC is taking shape. With the latest version of Windows, a dedicated Copilot key, and an NPU capable of at least 40 trillion operations per second, you’ll soon be able to run Microsoft Copilot locally, ish, on your machine.

Redmond’s requirements for its AI model on Windows were made official by Intel — one of the strongest cheerleaders of the AI PC category — during the chip giant’s AI Summit in Taipei this week.

Running a large language model (LLM) locally has some intrinsic benefits. End users should have lower latency and therefore improved response times, since queries don’t need to be sent to and from a remote datacenter, plus more privacy, in theory. For Microsoft, meanwhile, shifting more of the AI workload onto customer devices frees up its own resources for other tasks, such as helping train the next OpenAI model or offering it as a cloud API.

Microsoft hopes to run its Copilot LLM entirely on the NPUs, or neural processing units, in people’s Windows AI PCs eventually, judging by comments apparently made by Intel execs at the summit. We can imagine the x86 goliath pushing that line to convince everyone that its silicon is powerful enough to run Redmond’s stuff at home or in the office.

While the idea of untethering Copilot from Azure’s umbilical might be attractive to some, not everyone seems to be a fan of Clippy incarnate and at least some amount of processing will almost certainly be done in the cloud for the foreseeable future.

Intel executives have said as much: Faster hardware will enable more “elements” of Copilot to run locally. In other words, you’re still going to be reliant on a network connection for at least some of the functionality, and the rest the AI PC will handle itself.

The reason shouldn’t come as much of a surprise. These AI PCs have finite resources and the model powering Copilot — OpenAI’s GPT-4 — is enormous. We don’t know exactly how big the version Microsoft is using is, but estimates put the full GPT-4 model at around 1.7 trillion parameters. Even with quantization or running the model at INT4, you’d need about 900GB of memory.

How we think it’s gonna work

GPT-4 is a so-called mixture-of-experts model. In a nutshell, this means it’s actually assembled from a number of smaller, specialized pre-trained models to which queries are routed. By having multiple models optimized for text-generation, summarization, code creation, and so on, inferencing performance can be improved since the entire model doesn’t need to run to complete a task.

Intel’s use of the term “elements” to describe running Copilot features locally suggests that some of these experts could be substituted for smaller, nimbler models capable of running on laptop hardware. As we’ve explored previously, existing personal hardware is more than capable of running smaller AI models from the likes of Mistral or Meta.

Coincidentally, Microsoft recently pumped €15 million ($16.3 million) into French mini-model builder Mistral AI, with plans to make its work available to Azure customers. At just 7 billion parameters in size, the Mistral-7B is certainly small enough to fit comfortably into an AI PC’s memory, requiring in the neighborhood of 4GB memory when using 4-bit quantization.

And that’s for a general purpose model. Conceivably, you could get by with even smaller models tuned for source code generation that are only loaded into memory when the application, say Visual Studio Code, is launched and an active Github Copilot subscription is detected. Remember, Copilot is more than just a chatbot; it’s a suite of AI features that are getting baked into Microsoft’s OS and software library.

Redmond hasn’t said just how much memory its AI PC spec calls for, but, in our experience with local LLMs, 16GB of speedy DDR5 should be adequate.

Whatever route Microsoft ends up taking, the combination of local and remote models could lead to some interesting behavior. We don’t know yet under what circumstances these local models will take over, but Microsoft corporate veep of Windows Devices Pavan Davuluri has suggested the mix may be dynamic.

“We wanna be able to load shift between the cloud and the client to provide the best of computing across both those worlds,” he said on stage during AMD’s Advancing AI event in December. “It brings together the benefits of local compute, things like enhanced privacy and responsiveness and latency with the power of the cloud, high performance models, large data sets, cross platform inferencing.”

As such, we can see a couple scenarios how Microsoft may use local AI. The first is to offload work from Microsoft servers and improve response times. As hardware improves, more Copilot features could be pushed out of the cloud and onto user devices.

The second would be to have it as a fall back in the case of network disruptions. You can imagine your AI PC just getting dumber rather than stopping entirely when cut off from the net.

Hardware constraints

Before you get too excited about split-brained AI PCs drafting off-grid manifestos, there currently aren’t any machines out there that meet the hardware requirements, and it’s not for the lack of a Copilot key.

The issue is that NPUs are still relatively new in x86 silicon, and what does exist isn’t nearly powerful enough. AMD was among the first to add an NPU to its mobile processors back in early 2023 with the launch of its Ryzen 7040 series chips.

That lineup received a clock bump in December during the House of Zen’s Advancing AI event. AMD also brought its NPUs to the desktop with the launch of its 8000G APUs at CES in January this year.

Intel rolled out its dedicated AI accelerator blocks with the launch of its Meteor Lake microprocessor parts in late December. These Core Ultra chips feature an NPU derived from Intel’s Movidius vision processing unit (VPU), which Intel demoed running a variety of workloads during its Innovation event last year.

Unfortunately, chips are only capable of 10 to 16 trillion (typically INT4) operations per second, far below that of Microsoft’s 40 TOPS spec. That means that most of the so-called AI PCs on the market won’t meet the requirements — not without leaning on the GPU to make up the difference.

Both Intel and AMD have more capable chips coming with Lunar Lake and Strix Point silicon respectively. However, in the near term, it looks like Qualcomm is going to have the market cornered.

Notebooks sporting Qualcomm’s Snapdragon X Elite mobile processors are due out sometime in mid-2024 and will feature an NPU capable of 45 TOPS. Combined with an Adreno GPU capable of 4.6 teraFLOPS of FP32 performance, Qualcomm says the part will be able to run AI models up to 13 billion parameters entirely on device and generate 30 tokens a second when running smaller 7-billion-parameter LLMs.

As PCs with higher performance NPUs and larger memory stores arrive, and small models grow more capable, we suspect Microsoft will begin offloading more functionality to local devices – once the hardware can handle it. ®

Time Stamp:

More from The Register