AI Desktop Tools - Training and Inferencing

pharma

Legend
Thread dedicated to new AI tools becoming available for running training and inferencing algorithms against data residing on your desktop.
Can easily create custom LLM model for data stored on local hard drive and query the trained data for relevant information without using the cloud.

Seems to have some benefit for local desktop generative AI queries w/o using cloud, ie training thousands of photos and text notes sitting on your hard drive and asking questions like how many photos contain seascapes or dogs, or more detailed questions like which pictures show a temple in Nepal and include my sister (or someother criteria). While currently available for RTX gpus with tensor cores it will also be available later as an Open source project.

Chat With RTX is a demo app that lets you personalize a GPT large language model (LLM) connected to your own content—docs, notes, videos, or other data. Leveraging retrieval-augmented generation (RAG), TensorRT-LLM, and RTX acceleration, you can query a custom chatbot to quickly get contextually relevant answers. And because it all runs locally on your Windows RTX PC or workstation, you’ll get fast and secure results.
...

Chat for RTX allows AI enthusiasts to easily connect PC LLMs to their own data using a popular technique known as retrieval-augmented generation (RAG). The demo, accelerated by TensorRT-LLM, enables users to quickly interact with their notes, documents and other content. It will also be available as an open-source reference project, so developers can easily implement the same capabilities in their own applications.
 
Last edited:
I've finally been poking my pinky toe into some localized LLM stuff, specifically around a future replacement for Alexa integration into my home automation system. Rather than pinning my entire automation ecosystem on an Amazon ecosystem, I instead link specific (fully user defined) Home Assistant entities to an Alexa routine, so someone can ask Alexa to toggle various lights in the house. The challenge is, you must use very specific verbiage to trigger the event, and if you don't know the verbiage then Alexa isn't going to figure it out for you. Also, even with specifically defined action words, sometimes Alexa will either get confused or "ask to make sure" which often defeats the purpose. Example: turn on the kitchen lights vs turn on the kitchen table lights more often results in "Multiple routines share a similar name: turn on the kitchen lights, and turn on the kitchen table lights. Which one did you want?" And if you don't repeat the entire sentence, the voice interpretation doesn't work.

So I'm teaching myself some LLM hosting things, and geez there's a lot to learn :)
 
I believe LLM will transform how we use computers. It's quite possible that in the future we'll go back to using some sort of "command line" interface, but this time the computer accepts human language as commands. We have something like this today such as Siri or Ok Google but I imagine that it'll be the majority in the future instead of just some particular actions. For this to happen, the LLM has to run locally.

To be more specific, it could be something like, if you want to play Palworld, it'll be typing something like "launch palworld" in a text box, instead of opening Steam, finding the game, and click "play." It's not hard to imagine that it can handle simple works like "change screen scaling to 120%" or "turn to dark mode." However, recent LLM proves that it'll be possible to handle more complex works such as "find all photos with dogs taken within 3 days in my library" or "open my BoA bank account web page," etc.

Of course, there will be security implications, so at least in earlier versions it'll just show something on your screen and you still have to click on something to proceed, so GUI will still be there.
 
what local LLM solution you use that can interact with the OS? i've only tried faraday and it seems to be containerized inside the app itself
 
what local LLM solution you use that can interact with the OS? i've only tried faraday and it seems to be containerized inside the app itself

I am not aware of any existing today, but it certainly can be done. For example, you can ask some LLM to generate a powershell script for you today, you just need a tool to actually run the script. However, directly running a powershell script generated by a LLM is probably quite risky, so a better solution is to generate some kind of UI for you to confirm the actions you actually want to do. I believe this is something Microsoft should do with their "Copilot" product in Windows 11.
 
what local LLM solution you use that can interact with the OS? i've only tried faraday and it seems to be containerized inside the app itself
I think that's the next step and you will see developers using this (or similar) to extend their current apps (paid, shareware, freeware) to facilitate localized AI processing at the OS level. Most of those apps currently interact with the OS but without the AI aspect. Anyone who has the talent and a good idea should be able to take advantage of this next wave of AI aware local applications.

Nvidia's github site is currently accessible for developers to incorporate in their apps.
Developers can use that reference to develop and deploy their own RAG-based applications for RTX, accelerated by TensorRT-LLM.
I would expect revisions to account for manufacturer variations of tensor cores, ie Intel XMX cores. (not sure if AMD will have Matrix cores on consumer gpus)
 
The new MLPerf Client benchmark will rate how well chips and systems perform with generative AI on desktops, laptops and workstations, starting with Meta's Llama 2 LLM.
...
In anticipation of a wave of laptops and desktops with special AI abilities, the computer industry's leading organization for rating the performance of neural networks this week announced the formation of a new effort to benchmark AI on personal computers.

The initial benchmarks will focus on measuring how fast computers make predictions, commonly known as "inference," as opposed to "training," which is when a neural net is first developed. Most client computing devices do not possess the requisite power to perform the training of neural networks, so inference is a logical place to start.

The new benchmark is timely as PC executives are pinning their hopes for a revival of the moribund PC market on local processing of AI as an adjunct to the cloud-based processing that currently dominates AI. PC sales fell 15% last year in unit terms, the worst year in PC history, according to research firm Gartner.
 
AI Coding Assistants ... pretty easy to see how AI assistants may be applied to different corporate departments and how people-oriented AI oriented corporations might become more widespread.

AI coding assistants, or code LLMs, have emerged as one domain to help accomplish this. By 2025, 80% of the product development lifecycle will make use of generative AI code generation, with developers acting as validators and orchestrators of backend and frontend components and integrations. You can tune an LLM for code tasks, streamline workflows for developers, and lower the barrier for novice coders. Code LLMs not only generate code, but can also fill in missing code, add documentation, and provide tips for solving hard problems.
...
Programmers spend a lot of time looking up workarounds for common problems, or skimming online forums for faster ways to write code. The key idea behind AI code assistants is to put the information that programmers need right next to the code they are writing. The tool tracks the code and comments in the file a programmer is working on, as well as other files that it links to or that have been edited in the same project. It sends all this text to an LLM as a prompt. It then predicts what the programmer is trying to do and suggests code to accomplish it.

Example:
For example, you could issue a generic prompt such as:

Write a function that computes the square root.

# Use Newton's method,
# where x_(n+1) = 1/2 * (x_n + (y/x_n))
# y = number to find square root of
# x_0 = first guess
# epsilon = how close is close enough?
# Implement this in a function called newton_sqrt that has three parameters
# and returns one value.
Specifying the programming language and output will yield better results. For example:

Write Python code to compute the square root and print the result.

# To find the square root of a number in Python, you can use the math library and its sqrt function:

from math import sqrt

number = float(input('Enter a number: '))
square_root = sqrt(number)
print(f'The square root of {number} is approximately {square_root:.2f}.')
 
GitHub's copilot is quite close to an AI coding assistant. It'd be nice to have something like that running locally though.
 
A RAG pipeline embeds multimodal data -- such as documents, images, and video -- into a database connected to a LLM. RAG lets users chat with their data!

Developer RAG Examples
The developer RAG examples run on a single VM. They demonstrate how to combine NVIDIA GPU acceleration with popular LLM programming frameworks using NVIDIA's open source connectors. The examples are easy to deploy via Docker Compose.

Examples support local and remote inference endpoints. If you have a GPU, you can inference locally via TensorRT-LLM. If you don't have a GPU, you can inference and embed remotely via NVIDIA AI Foundations endpoints.


Enterprise RAG Examples
The enterprise RAG examples run as microservies distributed across multiple VMs and GPUs. They show how RAG pipelines can be orchestrated with Kubernetes and deployed with Helm.

Enterprise RAG examples include a Kubernetes operator for LLM lifecycle management. It is compatible with the NVIDIA GPU operator that automates GPU discovery and lifecycle management in a Kubernetes cluster.

Enterprise RAG examples also support local and remote inference via TensorRT-LLM and NVIDIA AI Foundations endpoints.


Also available at the GitHub site:


Tools
Example tools and tutorials to enhance LLM development and productivity when using NVIDIA RAG pipelines.

Open Source Integrations
These are open source connectors for NVIDIA-hosted and self-hosted API endpoints.
These open source connectors are maintained and tested by NVIDIA engineers.

NVIDIA support
In each example README we indicate the level of support provided.

Feedback / Contributions
We're posting these examples on GitHub to support the NVIDIA LLM community, facilitate feedback.
We invite contributions via GitHub Issues or pull requests!
 
NVIDIA launched an AI LLM tool that works locally on your PC to answer any questions.

Chat With RTX is a demo app that lets you personalize a GPT large language model (LLM) connected to your own content—docs, notes, videos, or other data. Leveraging retrieval-augmented generation (RAG), TensorRT-LLM, and RTX acceleration, you can query a custom chatbot to quickly get contextually relevant answers. And because it all runs locally on your Windows RTX PC or workstation, you’ll get fast and secure results.

 
Time to give it a try. I've been slowly assembling an LLM rig out of some cheap server parts and, well, it's not easy to get started from bare metal to K8S cluster to pytorch your own 7B LLM, to then distill and prune it down to size, and get it loaded up with the things I care about.

Wonder if it supports something like LangChain so I can have it perform actions and not just "talk to me."
 
A bit more information ... Excited to see what additional functionality and features they can bring to models running locally.


"Chat with RTX was first announced in January. The app uses Retrieval Augmented Generation (RAG), in conjunction with TensorRT-LLM, which users can customise by granting access to specified files and folders. Running locally, Chat with RTX can pull information from text, PDF, doc and XML files, so you can swiftly have the app retrieve data for you. There is also a YouTube transcript function, so you can paste in a YouTube playlist and have the app generate a local transcript. You can then ask the bot questions about the content to sift through what you need.

I've had access to Chat with RTX for a few days already. The initial download is substantial, coming in at over 30GB, and the final install size can be as high as 100GB, depending on the AI models you have installed. To start off with, Chat with RTX will have access to two AI models, Mistral and Llama. The former is a model created by ex Meta and Google employees, while the latter is a model created and released by Meta.

Installing the app takes some time. On my system with a Ryzen 9 5900X and RTX 4080, the install took around 20 minutes, with the LLM install taking the longest amount of time. The install time can take as much as one hour depending on your internet connection and hardware. After the installer wraps up though, loading up the app takes very little time and works impressively quick considering it is running locally.

As the app runs offline, you don't run the risk of exposing your sensitive data online, and you have greater control over what the AI has access to and can pull from. In its base form, Chat with RTX only has access to the RAG folder it comes with, so it can answer some basic questions regarding Nvidia RTX products and features, but won't be able to go beyond that. With that in mind, you'll want to point the app to your own dataset folder.
...
Chat with RTX also does not currently have the ability to remember context, so you won't be able to ask follow-up questions based on your original query. Whether or not this functionality will change over time remains to be seen, but running machine learning and massive data sets locally isn't exactly feasible for the majority of users."

Chat with RTX can be downloaded directly from the Nvidia website. You will need an RTX 3000 or RTX 4000 series graphics card to run the application.
 
Last edited:
Yeah, I actually ran up to my attic to grab an older 1TB SATA Sammy 850 so I can use it as the new LLM play area for my Windows rig. I'm REALLY curious how Chat with RTX performs on cards with single-digit gigabytes of memory... Edit: the fine print says it needs a card with 8GB+ memory.
 
I was thinking outside the box and this tool could actually be pretty useful if Nvidia continues to expand functionality and feature set.
The list of potential uses would be enormous. Running this on home utility bills, home security log files/videos, auto/home maintenance, home weather log files, bank/investment statements, computer log files, grocery shopping, etc ... basically any information you have access and can save on your harddrive.

Small businesses would also benefit from the mountain of information they store everyday.
 
Two additional models from Google to be added to the Chat with RTX AI tool.
NVIDIA, in collaboration with Google, today launched optimizations across all NVIDIA AI platforms for Gemma — Google’s state-of-the-art new lightweight 2 billion– and 7 billion-parameter open language models that can be run anywhere, reducing costs and speeding innovative work for domain-specific use cases.
...
Adding support for Gemma soon is Chat with RTX, an NVIDIA tech demo that uses retrieval-augmented generation and TensorRT-LLM software to give users generative AI capabilities on their local, RTX-powered Windows PCs.
 
A friend told me about a terminal software 'Warp' which added AI enhanced functions called Warp AI. It takes natural language commands, such as "expose docker port in container" and it outputs the commands you need to run.
However, it seems to handle AI requests remotely, not on a local LLM.
 
Back
Top