PDFs are designed for printing, not for machines to read. Pull text out of one with a naive tool like pdftotext or PyPDF2 and you get the wreckage: columns mashed together, tables destroyed, equations garbled. Anyone who has tried to feed a research paper into an AI tool has met this wall.
Marker is the fix for the “in” part of garbage in, garbage out. It uses AI-driven layout detection to convert PDFs into clean markdown: headings stay headings, tables become markdown tables, equations become LaTeX, code blocks stay together. The result is text that downstream tools, summaries, RAG, analysis, can actually use. In the studio it is a quiet workhorse. We rarely think about it, but it is the first step every time we ingest research PDFs or feed background reading into a model. Here is how to set it up.
What you will end up with
- Marker installed in its own Python environment.
- One command that turns a PDF into clean markdown, with images extracted alongside it.
- A folder workflow for converting many documents at once, all on your own machine.
No upload, no cloud document service, no per-page fee for the default pipeline.
Before you start
You need a Mac, Windows, or Linux machine with at least 8GB of RAM. 16GB is more comfortable. A GPU is optional: it makes conversion faster, but CPU works, and Apple Silicon uses CPU/MPS automatically with no extra setup.
You also need Python 3 and a terminal. Marker is command-line and library first; there is no desktop app.
One honest expectation to set now: processing time. On a Mac CPU, a 50-page document can take five minutes or more. With a GPU it is much faster. This is batch work, not something you run live in front of someone. Convert ahead of time.
Step 1: Install Marker in a virtual environment
Install Marker into its own Python environment so its dependencies stay isolated.
# create and activate a virtual environment
python3 -m venv marker-env
source marker-env/bin/activate
# install
pip install marker-pdf
On Apple Silicon, that is all you need; Marker picks up CPU/MPS on its own.
If you have an NVIDIA GPU on Windows or Linux and want the speed, install the CUDA build of PyTorch first, then Marker:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install marker-pdf
On Windows, create and activate the venv with python -m venv marker-env and marker-env\Scripts\activate before those two commands.
Step 2: Convert your first PDF
The single-file command is marker_single. Point it at a document and let it run.
marker_single document.pdf
When it finishes, the output lands in a folder next to the PDF:
document/
├── document.md # clean markdown
├── document_meta.json # metadata (page count, languages, processing time)
└── *.png # extracted images, referenced from the markdown
The .md file is the prize. The _meta.json is useful for confirming what happened, including how long it took.
Step 3: Convert a folder, and reach for the options when you need them
When you have a stack of PDFs, point Marker at the whole directory:
marker pdfs_folder/ output_folder/
A few flags worth knowing:
# choose an output format (markdown is default; also JSON, HTML)
marker_single document.pdf --output_format json
# handle a multi-language document
marker_single document.pdf --languages en,fr
# verbose output when something looks wrong
marker_single document.pdf --debug
# LLM-enhanced cleanup for tough cases (adds API cost per page)
marker_single document.pdf --use_llm --llm_model claude-3-7-sonnet-20250219
The default pipeline is fully local and free. The optional --use_llm mode calls a cloud model to clean up difficult cases, fuzzy scans and the like, and it does add cost per page. Leave it off for batch work; turn it on selectively for the documents that need it.
Prove it works
Pick a deliberately messy PDF: a research paper, a financial report, anything with multiple columns, tables, and headings. That is where naive extractors fall apart, so it is the honest test.
Run marker_single on it, wait for it to finish, then open the resulting .md file and check three things:
- Headings survived. Section titles are markdown headings, not run-together body text.
- Tables are tables. A table in the PDF came out as a markdown table you can read, not a scrambled block of numbers.
- Structure holds. Multi-column layout reads in the right order, and any equations came through as LaTeX.
Compare that to what pdftotext would have given you on the same file. The jump in usable structure is the whole point, and it is the moment that makes the tool worth installing.
Trade-offs and gotchas
Marker is strong and dependable, with a few sharp edges worth knowing.
- Processing time is real on CPU. A 50-page document can take five minutes or more without a GPU. Plan to convert ahead of time, not on demand.
- GPL v3 license. Personal and internal use is fine. If you ship Marker inside a product, you are bound by GPL terms; read them first.
- Equations come out as LaTeX. Great for academic work, but you need a LaTeX renderer downstream to display them.
- Scanned PDFs vary more. Marker handles both native and scanned PDFs, but scan-only documents lean harder on the OCR layer, and quality is less consistent there.
- Image extraction multiplies files. A long PDF full of figures can produce dozens of PNGs alongside the markdown. Expect the clutter.
--use_llmadds cost. It is excellent for difficult documents, but it makes API calls per page. Use it selectively, not by default.
Our verdict, in short: Marker is a quiet workhorse and the missing pre-processing step for AI on PDFs. The quality jump in RAG output, compared to feeding a tool its own built-in PDF extraction, is dramatic. For documentary research it turns “wrestling with the PDF” into “reading the markdown,” and the time saved compounds across a project. The author, Vik Paruchuri, also maintains Surya (the OCR engine Marker uses internally), and the project sees regular updates, which is the kind of context that earns trust.
Where to go next
Clean markdown is an input, not an end. The natural next step is to feed it somewhere.
Point a local model at the output with Ollama to summarize, extract quotes, or answer questions over a document. Or use Marker as the pre-processing stage for a documents tool like AnythingLLM: cleaner input means noticeably better retrieval, which is why we run Marker first rather than relying on a tool’s built-in extraction. We cover the documents-and-RAG side in a separate Playbook.
You now have a local pipeline that turns printer-shaped PDFs into machine-readable markdown, on your own machine, with no upload required. Curious about these things. You should be too.
Harness your curiosity.
— Stridenote · № 016