AI Desktop Tools - Training and Inferencing

A LLM inside a linux docker container on my Windows pc should be considered remote for Warp?

It seems to use OpenAI and send all requests directly to OpenAI, so it's currently remote only.
It'd be nice if it can use a local LLM such as Mistral's AI model in the future, at least as an option.
Docker Hub is the world’s largest repository for container images with an extensive collection of AI/ML development-focused container images, including leading frameworks and tools such as PyTorch, TensorFlow, Langchain, Hugging Face, and Ollama. With more than 100 million pull requests for AI/ML-related images, Docker Hub’s significance to the developer community is self-evident. It not only simplifies the development of AI/ML applications but also democratizes innovation, making AI technologies accessible to developers across the globe.

NVIDIA’s Docker Hub library offers a suite of container images that harness the power of accelerated computing, supplementing NVIDIA’s API catalog. Docker Hub’s vast audience — which includes approximately 27 million monthly active IPs, showcasing an impressive 47% year-over-year growth — can use these container images to enhance AI performance.

Docker Desktop on Windows and Mac helps deliver NVIDIA AI Workbench developers a smooth experience on local and remote machines.

NVIDIA AI Workbench is an easy-to-use toolkit (free) that allows developers to create, test, and customize AI and machine learning models on their PC or workstation and scale them to the data center or public cloud. It simplifies interactive development workflows while automating technical tasks that halt beginners and derail experts. AI Workbench makes workstation setup and configuration fast and easy. Example projects are also included to help developers get started even faster with their own data and use cases.

Docker engineering teams are collaborating with NVIDIA to improve the user experience with NVIDIA GPU-accelerated platforms through recent improvements to the AI Workbench installation on WSL2.

Check out how NVIDIA AI Workbench can be used locally to tune a generative image model to produce more accurate prompted results.

Last edited:
LangChain is pretty neat for using a local LLM to do local things. As I slowly suck less at this stuff, I'm now experimenting with using LangChain to permit my local Mistral LLM to take actions within my local HomeAssistant environment. Being able to ask the AI to turn lights off and on, or fans, or TVs, or even to adjust the thermostat temperature, or to ask where our Tesla is (finally got that working, although it's still a little wonky with the car in motion.)

Being able to host a voice AI like Alexa or Google Assistant without any internet connection requirement is pretty swanky.
NVIDIA AI Workbench, a toolkit for AI and ML developers, is now generally available as a free download. It features automation that removes roadblocks for novice developers and makes experts more productive.

Developers can experience a fast and reliable GPU environment setup and the freedom to work, manage, and collaborate across heterogeneous platforms regardless of skill level. Enterprise support is also available for customers who purchase a license for NVIDIA AI Enterprise.


Key AI Workbench features include:

  • Fast installation, setup, and configuration for GPU-based development environments.
  • Pre-built, ready-to-go generative AI, and ML example projects based on the latest models.
  • Deploy generative AI models with cloud endpoints from the NVIDIA API catalog or locally with NVIDIA NIM microservices.
  • An intuitive UX plus command line interface (CLI).
  • Easy reproducibility and portability across development environments.
  • Automation for Git and container-based developer environments.
  • Version control and management for containers and Git repositories.
  • Integrations with GitHub, GitLab, and the NVIDIA NGC catalog.
  • Transparent handling of credentials, secrets, and file system changes.
Since its Beta release, AI Workbench also has several new key features:

  • Visual Studio (VS) Code support: Directly integrated with VS Code to orchestrate containerized projects on GPU environments.
  • Choice of base images: Users can choose their own container image as the project base image when creating projects. The container image must use image labels that follow the base image specifications.
  • Improved package management: Users can manage and add packages directly to containers through the Workbench user interface.
  • Installation improvements: Users have an easier install path on Windows and MacOS. There is also improved support for the Docker container runtime.
Last edited:
March 27, 2024
Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and which showcases new hardware, software, tools and accelerations for RTX PC users.

Now, the TensorRT extension for the popular Stable Diffusion WebUI by Automatic1111 is adding support for ControlNets, tools that give users more control to refine generative outputs by adding other images as guidance.

Plus, the TensorRT extension for Stable Diffusion WebUI boosts performance by up to 2x — significantly streamlining Stable Diffusion workflows.

With the extension’s latest update, TensorRT optimizations extend to ControlNets — a set of AI models that help guide a diffusion model’s output by adding extra conditions. With TensorRT, ControlNets are 40% faster.

Users can guide aspects of the output to match an input image, which gives them more control over the final image. They can also use multiple ControlNets together for even greater control. A ControlNet can be a depth map, edge map, normal map or keypoint detection model, among others.

If you're an early adopter of ChatRTX, you should probably update to the latest March 2024 build. The UI contained a couple of 'Medium' and 'High' severity security vulnerabilities. According to the security bulletin, the more dangerous of the two (given an 8.2 rating) lets potential attackers gain access to system files. This exploit could lead to an "escalation of privileges, information disclosure, and data tampering."

The second security vulnerability, rated 6.5) doesn't sound much better. The exploit allows attackers to run "malicious scripts in users' browsers," which can cause denial of service, information disclosure, and even code execution.

The good news is that the latest version of ChatRTX with the new security updates is available to download via NVIDIA credits those who pointed out these exploits in its update, and there's no evidence of them being used to date.

Google Making Major Changes in AI Operations to Pull in Cash from Gemini​

April 4, 2024
Google has also put a price on using its Gemini API and cut off most of its free access to its APIs. The message is clear: the party is over for developers looking for AI freebies, and Google wants to make money off AI tools such as Gemini.

Google provided developers free access to its older and newer APIs to its LLMs. Free access was an attempt to woo developers to adopt its AI products.
Google is attracting developers via its cloud service and AI Studio service. For now, developers can get free API keys on Google’s website, which provides access to Google’s LLMs through a chatbot interface. Developers and users have until now enjoyed free access to Google’s LLMs, but that is also ending.

This week, Google threw a double whammy that effectively shuts down free access to its APIs via AI studio.
Google also announced this week that it is restricting API access to its Google Gemini model in a bid to turn free users into paid customers. Free access to Gemini allowed many companies to offer chatbots based on the LLM for free, but Google’s changes will likely mean many of those chatbots will shut down.

“Pay-as-you-go pricing for the Gemini API will be introduced,” Google said in an email on Monday to developers.

The free plan includes two requests per minute, 32,000 tokens per minute, and a maximum of 50 requests per day. However, one drawback is that Google will use chatbot responses to improve its products, which purportedly include its LLMs.
The hundreds of billions spent in data centers to run AI is a gamble, as the companies do not have proven AI revenue models. As the use of the LLMs grows, small revenue streams through offerings like APIs could contribute to the cost of building the hardware and data centers.

Bloomberg recently reported that Amazon was spending $150 billion over 15 years to establish new data centers.

OpenAI and Microsoft plan to spend $100 billion on a supercomputer called Stargate, according to The Information.

For customers unwilling to pay, Google has released the Gemma large language models, around which customers can build their own AI applications. Other open-source models, such as Mistral, are also gaining in popularity.

Customers are leaning toward open-source LLMs as the cost of AI grows. These models can be downloaded and run on custom hardware that is tuned to run the applications, but most can’t afford the hardware, which in most cases is Nvidia’s GPUs. AI hardware is also not easily available off the shelf.

One positive is Chat with RTX currently provides access to two open-source AI models including Mistral and Llama. Access to more free models are planned in the future.
Last edited:
Local AI tools offer increased data security, allow for more customization, and can save money that would otherwise be spent on monthly software subscriptions. Today, I’d like to present a selection of tools that can be run on your own hardware, emphasizing options with a relatively low entry barrier due to low or no cost and ease of use. I hope you’ll discover something that has a niche in your workflow!

ChatRTX: NVIDIA's AI chatbot integrates new language models to run locally on your PC​

NVIDIA is beefing up its experimental chatbot ChatRTX by grafting new artificial intelligence models onto it. As a result, the latter sees its arsenal expand and its capabilities evolve considerably.

Introduced last February as "Chat with RTX," ChatRTX was initially just a demo app. In concrete terms, the application creates a local chatbot server that can be accessed from your browser. This allows you to feed the AI with your documents and even YouTube videos, turning your machine into a powerful search tool capable of summarizing your content and answering your questions.
Initially able to leverage the Mistral and Llama 2 models, ChatRTX can now count on new models: ChatGLM3 (bilingual English and Chinese language model), OpenAI's CLIP (capable of generating text descriptions from images, and vice versa) and Google's Gemma. The latter has been specially designed in collaboration with NVIDIA and works wonderfully on solidly equipped PCs.

ChatRTX centralizes all these new models and takes care of simplifying their local execution. The interface appears to be rather intuitive and it is possible to juggle between different modules according to your needs: analysis of your photos, online videos or document summaries for example.
One downside of using CUDA on Windows (including WSL2) is that Windows locks a certain amount of memory, so you can't use the full VRAM.
One downside of using CUDA on Windows (including WSL2) is that Windows locks a certain amount of memory, so you can't use the full VRAM.
I haven't used CUDA on WSL2 yet but hopefully, if required they will add additional WSL confile parameters for memory management. They (WSL folks) have a beta version where they do extend memory management for some Windows related memory parameters.

# Settings apply across all Linux distros running on WSL 2

# Limits VM memory to use no more than 4 GB, this can be set as whole numbers using GB or MB

# Sets amount of swap storage space to 8GB, default is 25% of available RAM

# Sets the VM to use two virtual processors

# Specify a custom Linux kernel to use with your installed distros. The default kernel used can be found at
# kernel=C:\\temp\\myCustomKernel

# Sets additional kernel parameters, in this case enabling older Linux base images such as Centos 6
# kernelCommandLine = vsyscall=emulate

# Sets swapfile path location, default is %USERPROFILE%\AppData\Local\Temp\swap.vhdx
# swapfile=C:\\temp\\wsl-swap.vhdx

# Disable page reporting so WSL retains all allocated memory claimed from Windows and releases none back when free
# pageReporting=false

# Turn on default connection to bind WSL 2 localhost to Windows localhost
# localhostforwarding=true

# Disables nested virtualization
# nestedVirtualization=false

# Turns on output console showing contents of dmesg when opening a WSL 2 distro for debugging
# debugConsole=true

# Enable experimental features
Although single-core CPU speed does affect performance when executing GPU inference with llama.cpp, the impact is relatively small. It appears that almost any relatively modern CPU will not restrict performance in any significant way, and the performance of these smaller models is such that the user experience should not be affected. We may explore whether this holds true for other inference libraries, such as exllamav2, in a future article.

In general, this means that if you are using smaller LLM models that fit within a typical consumer-class GPU, you don’t have to worry about what base platform and CPU you are using. Except in a few isolated instances, it should largely be inconsequential and will have a minimal impact on how fast your system is able to run the LLM. However, this test serves as a good reminder that no single component of a system exists in a vacuum and will be affected in some way by the other components in the system as a whole.