Playbooks Research Jun 07, 2026

Chat With Your Documents Locally (RAG) With AnythingLLM

Drag in a PDF, ask a question, get an answer with source citations, all on your own machine. Here is the exact RAG setup we demo in client meetings, built on AnythingLLM and Ollama.

Chat With Your Documents Locally (RAG) With AnythingLLM

“Can we chat with our documents?” is the question we hear most in client AI meetings. Done from scratch, the answer involves embeddings, a vector database, a retriever, a chunking strategy, and a chat UI to tie it all together. That is a lot of moving parts for what people imagine as a simple thing.

AnythingLLM bundles all of it into one app. Drag in a PDF, ask a question, and get an answer with clickable source citations, all running against a model on your own machine. The acronym for this is RAG, retrieval-augmented generation, but you do not need to care about the acronym to use it. This is the setup we reach for most often when someone wants to see “chat with our data” work, fully private, in a single session. Here is how to build it.

What you will end up with

  • Ollama serving a local model on your machine.
  • AnythingLLM, the desktop app, pointed at that model.
  • A workspace you can drag documents into and ask questions of, with source citations and nothing leaving your disk.

No API keys. No uploading sensitive documents to someone else’s server. No subscription.

Before you start

You need a Mac, Windows, or Linux machine. AnythingLLM itself is light (around 4GB of RAM for the app), but it leans on a local LLM to do the actual answering, and that model wants headroom. Plan for 16GB of RAM total so a 7B model has room to run alongside the app. Apple Silicon is noticeably faster than Intel here.

The one prerequisite that matters: you need a local LLM running before AnythingLLM is useful. We use Ollama for this, and step 1 covers it. If you already have Ollama installed with a model pulled, skip ahead to step 2.

Step 1: Get Ollama running with a model

Ollama is the engine that serves the model AnythingLLM will talk to.

# macOS
brew install ollama

On Windows, winget install Ollama.Ollama does the same job, or download the installer from https://ollama.com/download. On Linux, run curl -fsSL https://ollama.com/install.sh | sh. After install, Ollama runs quietly as a background service.

Now pull a model. A 7B or larger model gives noticeably better answers for document work:

ollama pull llama3.2

Confirm it landed:

ollama list

You should see your model in the list. Ollama listens on http://localhost:11434 by default, which is where AnythingLLM will find it.

Step 2: Install AnythingLLM

  1. Download the desktop app from https://anythingllm.com/desktop.
  2. On a Mac, open the .dmg and drag AnythingLLM to Applications. On Windows, run the installer. On Linux, make the AppImage executable and run it.
  3. Launch it.

The desktop app is the right choice for a single person. There is also a Docker version for small-team rollouts, but for one user on one machine, the desktop app removes all the friction.

Step 3: Walk through the setup wizard

On first launch, AnythingLLM runs a short setup wizard. Three choices matter:

  1. LLM provider. Choose Ollama. If Ollama is running, the app usually auto-detects it. Confirm the model you pulled in step 1 is selected.
  2. Embedding provider. Leave it on the built-in default. It is local and fine for getting started.
  3. Vector database. Leave it on the built-in default (LanceDB). Also fine.

The defaults for embedding and vector storage are the right call until you have a specific reason to change them. The only choice you actually need to get right is pointing the LLM provider at Ollama.

Step 4: Create a workspace and add a document

A workspace is a folder for related documents and the chats about them. Keep one workspace per project or client, not one giant pile.

  1. Create a workspace and give it a name.
  2. Open it.
  3. Click the upload icon and drag in a document: PDF, .docx, .txt, or .md all work. You can also paste a URL.
  4. Wait for embedding to finish. For a typical document this is under a minute; a 100-page PDF can take a few.

Embedding is the step where AnythingLLM reads the document and turns it into something it can search. It only happens once per document.

Prove it works

Ask a question that can only be answered from the document you just added:

According to the document I uploaded, what are the main points in section 2,
and what does it recommend?

Two things tell you it is working:

  • The answer is specific to your document, not a generic response a model would give without it.
  • The answer includes source citations. Click one and AnythingLLM shows you the original passage it drew from.

The citations are the part that builds trust. Plenty of people have watched a chatbot invent a confident, wrong answer. Watching AnythingLLM point at the exact passage it used is a different experience, and it is the moment the whole thing clicks. And the point worth saying out loud: your documents never left this laptop.

For a stronger test, add a second related document and ask a question that spans both:

What does the first document say about X, and how does it compare to the
second document's position on Y?

A good answer will cite passages from both.

Trade-offs and gotchas

AnythingLLM is the cleanest path to local document chat, but it is worth knowing where the edges are.

  • First ingest takes time. A 100-page PDF can take a few minutes to embed. For a quick demo, use small documents or do the embedding before anyone is watching.
  • The default embedding model is local but slow on CPU. If speed becomes a problem, switch to a faster embedding provider in settings.
  • Citations are good, not perfect. Sometimes a citation points at a nearby chunk rather than the exact passage. Worth watching for, especially in a demo.
  • Default chunking is good enough for most work, not state of the art. It covers the large majority of use cases well. For complex documents with tables, formulas, or multi-column layouts, results jump noticeably if you convert the PDF to clean markdown with Marker first, then ingest that.
  • Storage grows with embeddings. A few hundred megabytes per large workspace is normal. Keep an eye on disk usage over time.

Our verdict, in short: this is the most-used demo tool in our client meetings, because “chat with our docs” is the number one ask and AnythingLLM is the cleanest way to show it working, fully local, in a single sitting. The desktop app removes all friction for one person; the Docker version handles small teams. The defaults are good, and you only reach past them when a document is genuinely complex.

Where to go next

The model is the ceiling on answer quality. If responses feel thin, the most direct upgrade is a stronger local model in Ollama; the rest of the setup stays exactly as it is.

From here, the natural extensions reuse what you have built. Feed in transcripts so an interview archive becomes searchable (our WhisperX Playbook covers producing those). For dense, table-heavy PDFs, add Marker to the front of the pipeline for cleaner ingestion. Both build on the workspace and the model you already have running.

You now have a private document assistant: drag in a file, ask a question, get a cited answer, with nothing leaving your machine. Curious about these things. You should be too.

Harness your curiosity.

— Stridenote · № 008