NSHipster

“Only Apple can do this” Variously attributed to Tim Cook

Apple introduced Apple Intelligence at WWDC 2024. After waiting almost a year for Apple to, in Craig Federighi’s words, “get it right”, its promise of “AI for the rest of us” feels just as distant as ever.

While we wait for Apple Intelligence to arrive on our devices, something remarkable is already running on our Macs. Think of it as a locavore approach to artificial intelligence: homegrown, sustainable, and available year-round.

This week on NSHipster, we’ll look at how you can use Ollama to run LLMs locally on your Mac — both as an end-user and as a developer.

What is Ollama?

Ollama is the easiest way to run large language models on your Mac. You can think of it as “Docker for LLMs” - a way to pull, run, and manage AI models as easily as containers.

Download Ollama with Homebrew or directly from their website. Then pull and run llama3.2 (2GB).

$brew install--cask ollama
        $ollama run llama3.2
        >>> Tell me a joke about Swift programming.
        What's a Apple developer's favorite drink? 
        The Kool-Aid.

Under the hood, Ollama is powered by llama.cpp. But where llama.cpp provides the engine, Ollama gives you a vehicle you’d actually want to drive — handling all the complexity of model management, optimization, and inference.

Similar to how Dockerfiles define container images, Ollama uses Modelfiles to configure model behavior:

FROM mistral:latest
        PARAMETER temperature 0.7
        TEMPLATE """
        You are a helpful assistant.
        User: 
        Assistant: """

Ollama uses the Open Container Initiative (OCI) standard to distribute models. Each model is split into layers and described by a manifest, the same approach used by Docker containers:

{"mediaType":"application/vnd.oci.image.manifest.v1+json","config":{"mediaType":"application/vnd.ollama.image.config.v1+json","digest":"sha256:..."},"layers":[{"mediaType":"application/vnd.ollama.image.layer.v1+json","digest":"sha256:...","size":4019248935}]}

Overall, Ollama’s approach is thoughtful and well-engineered. And best of all, it just works.

What’s the big deal about running models locally?

Jevons paradox states that, as something becomes more efficient, we tend to use more of it, not less.

Having AI on your own device changes everything. When computation becomes essentially free, you start to see intelligence differently.

While frontier models like GPT-4 and Claude are undeniably miraculous, there’s something to be said for the small miracle of running open models locally.

Privacy: Your data never leaves your device. Essential for working with sensitive information.
Cost: Run 24/7 without usage meters ticking. No more rationing prompts like ’90s cell phone minutes. Just a fixed, up-front cost for unlimited inference.
Latency: No network round-trips means faster responses. Your /M\d Mac((Book( Pro| Air)?)|Mini|Studio)/ can easily generate dozens of tokens per second. (Try to keep up!)
Control: No black-box RLHF or censorship. The AI works for you, not the other way around.
Reliability: No outages or API quota limits. 100% uptime for your exocortex. Like having Wikipedia on a thumb drive.

Building macOS Apps with Ollama

Ollama also exposes an HTTP API on port 11431 (leetspeak for llama 🦙). This makes it easy to integrate with any programming language or tool.

To that end, we’ve created the Ollama Swift package to help developers integrate Ollama into their apps.

Text Completions

The simplest way to use a language model is to generate text from a prompt:

importOllamaletclient=Client.defaultletresponse=tryawaitclient.generate(model:"llama3.2",prompt:"Tell me a joke about Swift programming.",options:["temperature":0.7])print(response.response)// How many Apple engineers does it take to document an API? // None - that's what WWDC videos are for.

Chat Completions

For more structured interactions, you can use the chat API to maintain a conversation with multiple messages and different roles:

letinitialResponse=tryawaitclient.chat(model:"llama3.2",messages:[.system("You are a helpful assistant."),.user("What city is Apple located in?")])print(initialResponse.message.content)// Apple's headquarters, known as the Apple Park campus, is located in Cupertino, California.// The company was originally founded in Los Altos, California, and later moved to Cupertino in 1997.letfollowUp=tryawaitclient.chat(model:"llama3.2",messages:[.system("You are a helpful assistant."),.user("What city is Apple located in?"),.assistant(initialResponse.message.content),.user("Please summarize in a single word")])print(followUp.message.content)// Cupertino

Generating text embeddings

Embeddings convert text into high-dimensional vectors that capture semantic meaning. These vectors can be used to find similar content or perform semantic search.

For example, if you wanted to find documents similar to a user’s query:

letdocuments:[String]=…// Convert text into vectors we can compare for similarityletembeddings=tryawaitclient.embeddings(model:"nomic-embed-text",texts:documents)/// Finds relevant documentsfuncfindRelevantDocuments(forquery:String,threshold:Float=0.7,// cutoff for matching, tunablelimit:Int=5)asyncthrows->[String]{// Get embedding for the querylet[queryEmbedding]=tryawaitclient.embeddings(model:"llama3.2",texts:[query])// See: https://en.wikipedia.org/wiki/Cosine_similarityfunccosineSimilarity(_a:[Float],_b:[Float])->Float{letdotProduct=zip(a,b).map(*).reduce(0,+)letmagnitude={sqrt($0.map{$0*$0}.reduce(0,+))}returndotProduct/(magnitude(a)*magnitude(b))}// Find documents above similarity thresholdletrankedDocuments=zip(embeddings,documents).map{embedding,documentin(similarity:cosineSimilarity(embedding,queryEmbedding),document:document)}.filter{$0.similarity>=threshold}.sorted{$0.similarity>$1.similarity}.prefix(limit)returnrankedDocuments.map(\.document)}

For simple use cases, you can also use Apple’s Natural Language framework for text embeddings. They’re fast and don’t require additional dependencies.

importNaturalLanguageletembedding=NLEmbedding.wordEmbedding(for:.english)letvector=embedding?.vector(for:"swift")

Building a RAG System

Embeddings really shine when combined with text generation in a RAG (Retrieval Augmented Generation) workflow. Instead of asking the model to generate information from its training data, we can ground its responses in our own documents by:

Converting documents into embeddings
Finding relevant documents based on the query
Using those documents as context for generation

Here’s a simple example:

letquery="What were AAPL's earnings in Q3 2024?"letrelevantDocs=tryawaitfindRelevantDocuments(query:query)letcontext="""
        Use the following documents to answer the question. 
        If the answer isn't contained in the documents, say so.
        Documents:
        \(relevantDocs.joined(separator:"\n---\n"))
        Question: \(query)"""letresponse=tryawaitclient.generate(model:"llama3.2",prompt:context)

To summarize: Different models have different capabilities.

Models like llama3.2 and deepseek-r1 generate text.
- Some text models have “base” or “instruct” variants, suitable for fine-tuning or chat completion, respectively.
- Some text models are tuned to support tool use, which let them perform more complex tasks and interact with the outside world.
Models like llama3.2-vision can take images along with text as inputs.
Models like nomic-embed-text create numerical vectors that capture semantic meaning.

With Ollama, you get unlimited access to a wealth of these and many more open-source language models.

So, what can you build with all of this?
Here’s just one example:

Nominate.app

Nominate is a macOS app that uses Ollama to intelligently rename PDF files based on their contents.

Like many of us striving for a paperless lifestyle, you might find yourself scanning documents only to end up with cryptically-named PDFs like Scan2025-02-03_123456.pdf. Nominate solves this by combining AI with traditional NLP techniques to automatically generate descriptive filenames based on document contents.

The app leverages several technologies we’ve discussed:

Ollama’s API for content analysis via the ollama-swift package
Apple’s PDFKit for OCR
The Natural Language framework for text processing
Foundation’s DateFormatter for parsing dates

Looking Ahead

“The future is already here – it’s just not evenly distributed yet.” William Gibson

Think about the timelines:

Apple Intelligence was announced last year.
Swift came out 10 years ago.
SwiftUI 6 years ago.

If you wait for Apple to deliver on its promises, you’re going to miss out on the most important technological shift in a generation.

The future is here today. You don’t have to wait. With Ollama, you can start building the next generation of AI-powered apps right now.

Language Server Protocol (LSP) revolutionized how programming languages integrate with developer tools. Model Context Protocol (MCP) aims to do the same for a new generation of AI tools.

But before we lay our scene of two protocols (both alike in dignity), let’s take a moment to motivate the problem a bit more.

Tool Use & Agents

Today’s frontier models are extraordinary. But they’re limited in two key ways:

They don’t know anything outside their training data
They can’t do anything outside of predicting the next token

To their credit, LLMs (kinda, sorta) understand these limitations, and have been trained to ask for what they need. We (humans) have settled on a practice called tool use that lets AI do just that.

Here’s an example:

User: Write a limerick about today's weather in Portland, Oregon.
Client, to Assistant →: Also, there's this "Get Weather" tool available if you need it. It takes a set of coordinates and returns the current temperature and conditions.
Assistant: <thinking>To write an accurate poem, I should get the weather. From my training, I know Portland's coordinates to be 45.5155° N, 122.6789° W</thinking>; Tool call: {tool: "Get Weather", latitude: 45.5155, longitude: -122.6789}
← Client, to User: Is it alright to make that tool call?
User: 👍
Client, to Assistant →: Tool result: {temperature: 12, conditions: 'cloudy', humidity: 85%}
Assistant: Here's your limerick about Portland's weather:; There once was a city called Portland
Where clouds made the skies quite importland
At twelve degrees C
It's as gray as can be
With a dampness that's truly absorbland!

Pretty clever, right?
But then again, so what?

On its face, this is a very slow, expensive way to do something better-served by a quick Google search — or sticking your head out a window, for that matter.

But here’s the thing:
Tool use gives rise to agency (✨𝓮𝓶𝓮𝓻𝓰𝓮𝓷𝓬𝓮✨)

As the saying goes, Quantity has a quality all its own. Give a language model a dozen… a hundred… a thousand tools — hell, give it a tool for making more tools. How confident are you that you could find a problem that couldn’t be solved by such a system?

We’re only just starting to see what’s possible.

But back to more practical matters:
Let’s talk about where we are today with Model Context Protocol.

The New M × N Problem

We’ve written previously about Language Server Protocol, and the M × N problem. LSP’s challenge was connecting M editors with N programming languages MCP faces a similar challenge, of connecting M clients with N resources. Without MCP, each AI application must implement custom integrations for every data source it wants to access.

This creates the same kind of fragmented ecosystem that plagued development tools before LSP:

Some AI applications offer deep integration with specific data sources but limited support for others
Different applications implement similar integrations in incompatible ways
Improvements to one integration rarely benefit the broader ecosystem

Like LSP, MCP transforms this M × N problem into an M + N problem through standardization. Rather than each AI application implementing custom integrations, it only needs to support the MCP standard. In doing so, it gains access to all MCP-compatible data sources and tools 🌈

How Model Context Protocol Works

MCP follows a client-server architecture similar to LSP:

The client is typically an AI application or development environment
For example, Claude Desktop, Zed, and Cursor.
The server is a program that provides access to data and/or tools

Requests and responses are encoded according to the JSON-RPC 2.0 specification. Communication between client and server happens over Stdio (stdin/stdout) or HTTP with Server-Sent Events transports.

Like LSP, MCP has clients and servers negotiate a set of capabilities. When a client connects to a server, it sends an initialize message, with information about what protocol version it supports. The server responds in kind.

From there, the client can ask the server about what features it has. MCP describes three different kinds of features that a server can provide:

Prompts: Templates that shape how language models respond. They’re the difference between getting generic text and precise, useful results. A good prompt is like a well-designed API - it defines the contract between you and the model.
Resources: Reference materials that ground models in reality. By providing structured data alongside your query, you transform a model from a creative writer into an informed assistant with domain-specific knowledge. (Think: databases, file systems, documents)
Tools: Functions that extend what models can do. They allow AI to calculate, retrieve information, or interact with external systems when simple text generation isn’t enough. Tools bridge the gap between language understanding and practical capability.

Our previous example handwaved the existence of a “Get Weather” tool. MCP gives our client a standard way to consult various connected services.

To get a list of available tools on an MCP, the client would send a tools/list request to the server:

{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}

In our example, the server would respond:

{"jsonrpc":"2.0","id":1,"result":{"tools":[{"name":"get_weather","description":"Returns current weather conditions for the specified coordinates.","inputSchema":{"type":"object","properties":{"latitude":{"type":"number"},"longitude":{"type":"number"}},"required":["latitude","longitude"]}}]}}

The client can share this list of tools with the language model in a system prompt or a user message. When the model responds wanting to invoke the get_weather tool, the client asks the user to confirm tool use. If the human-in-the-loop says 🆗, the client sends a tools/call request:

{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"get_weather","arguments":{"latitude":45.5155,"longitude":-122.6789}}}

In response, the server sends:

{"jsonrpc":"2.0","id":2,"content":[{"type":"text","text":"{\"temperature\": 12, \"conditions\": \"cloudy\", \"humidity\": 85}""annotations":{"audience":["assistant"]}}]}

The client then passes that result to the AI assistant, the assistant generates a response with this information, and the client passes that along to the user.

That’s pretty much all there is to it. There are plenty of details to get bogged down with. But that’s what LLMs are for. Now is the time for vibes coding.
MCP is punk rock.

How do I start?

MCP is an emerging standard from Anthropic. So it’s no surprise that Claude Desktop is most capable of showing off what it can do.

Once you have Claude Desktop installed, you can peruse the myriad example servers available.

Or, if you want to skip straight to la crème de la crème, then have a taste of what we’ve been cooking up with MCP lately:

iMCP

Fun fact! The word “paradise” derives from an old word for “walled garden”.

Ironic how Apple has a way of making your digital life a living hell sometimes.

For many of us who exist in Apple’s walled garden, we’re often frustrated by the product design and software quality that gets between us and our data. Spotlight search is stuck in the ‘00s. Apple Intelligence didn’t live up to the hype. Siri seems doomed to suck forever.

That was our motivation for building iMCP.

iMCP is a macOS app for connecting your digital life with AI. It works with Claude Desktop and a growing list of clients that support MCP. It gives MCP access to your calendars, contacts, even messages — no small feat!

Download it today and get a taste of some real Apple intelligence.

mcp-swift-sdk

In the process of building iMCP, we built a Swift SDK for Model Context Protocol servers and clients.

If you’re inspired to build your own MCP app and like working in Swift more than Python or TypeScript, definitely give this a try!

hype

If, however, you have accepted Python into your heart as I have, then I’d recommend checking out another project I’ve been working on: hype.

My goal with hype is to eliminate every barrier between writing Python code and calling it in a way that’s useful. Add the @hype.up decorator to a function to instantly generate an HTTP API, a CLI, a GUI, or an MCP.

# example.py
        importhypefrompydanticimportField@hype.updefdivide(x:int,y:int=Field(gt=0),)->int:"""
        Divides one number by another.
        :param x: The numerator
        :param y: The denominator
        :return: The quotient
        """returnx//y

Start up an MCP server with the hype command:

$hype mcp example.py

emcee

But really, the best code is code you don’t have to write. If you already have a web application with an OpenAPI specification, you can use another tool we built — emcee— to instantly spin up an MCP server to it.

emcee

We think emcee is a convenient way to connect to services that don’t have an existing MCP server implementation — especially for services you’re building yourself. Got a web app with an OpenAPI spec? You might be surprised how far you can get without a dashboard or client library.

In case it’s not abundantly clear, we here at NSHipster dot com are pretty bought into the promise of Model Context Protocol. And we’re excited to see where everything goes in the coming months.

If you’re building in this space, I’d love to hear from you✌️