technovangelist / scripts / vs - ollama 0.1.26

vs - ollama 0.1.26

I think we may have what could be the most significant release for Ollama in a long time, which you can find out more about at ollama.com. I am thinking it may one of the top 5 or so most significant new features. But when I watched the discord for what folks were excited about, I think they were looking at the wrong thing. They saw the top item, the headliner, which is about a new model, Gemma, from Google. But the exciting thing here is just one of the line items down below. The line is support for BERT and NOMIC-BERT embedding models. This is huge. And it’s a foundational feature that allows Ollama to be used in far more places than ever before. It’s the feature I begged the team to deliver on in August. And it’s finally here.

Embedding is all about creating a vector that represents the semantic meaning of whatever data you provide the model. The most common use case for embedding is in RAG search. RAG lets you find relevant content to your question so you can provide the model with the right inputs to come up with a good answer. Even with larger contexts of today’s latest models, RAG is still incredibly important to keep the model on point and to move with speed.

I just heard a podcast today talking about how RAG isn’t very good because it only delivers a few hopefully relevant fragments of the source text. Sure, that’s the Hello World of RAG, but there are so many examples of rag that uses those fragments and larger sections plus summaries plus summaries of summaries to try to make what the database delivers more relevant. I even made a code sample in the Ollama repo that shows this using a portion of the collection of the Chicago Institute of Art, and that was back in September.

When you build out a vector database for RAG, you provide a lot of content that first needs to be ‘embedded’. That embedding, along with the source text, is stored in the database. When you ask a question, that question is also converted to an embedding. Your question embedding is the same size as all the other embeddings for all your content. So it’s actually very easy to mathematically compare the embeddings to find the ones closest, or most similar, to your question. And often a vector database can do that comparison with millions or more other embeddings in way less than a second.

As a side note, there are a bunch of different vector databases out there that you can use in your applications. I can’t really say what’s different about them. The main differentiators seem to be around how easy they are to use in python vs typescript, or maybe how easy they are to host outside of an application. Or how they serialize the information to disk. I’m not sure there is a big difference in how fast they can filter information, but there may be a difference in their maximum capacity. I think that would be a good thing for me to look at in a future video. Let me know if that would be interesting to you.

Ollama has supported embedding for a long time, but it only used the regular models like Llama 2 and Mistral and everything else on Ollama.com in the past. This will work, but it turns out to not be super accurate. And even worse, it’s really slow. So whenever someone asked, I would usually recommend not using Ollama embeddings and instead use the HuggingFace API’s that you can run locally to do that embed. But every time I did that, I had to read the same article to figure out how to set it up. It tended to be hard to do. And so most just went to the OpenAI embedding model instead.

So let’s take a look at how to do this with Ollama now that we have version 0.1.26.

First, let’s try running this on the command line. I am using curl with a simple phrase. And there is our embedding. Superfast. But hopefully, your documents are a bit more complicated than this.

Normally, you don’t want to embed an entire document all at once. Instead, you need to split it up into chunks and then embed each of those. That way, you can supply just the relevant parts of the document to the model. This isn’t just about context size. You also need to provide info that makes sense for the question. There are a lot of strategies about how large the chunks need to be and also how much overlap there should be between them. What do you summarize and should you also summarize topics that cross multiple sections along the way. But that’s a topic for another video.

So here I have some code that splits a text file into 500 word chunks. This is using Bun which is an alternative to JavaScript. And then I am going to feed it the text of war and peace, which is about 550 thousand words, or about 1100 of our 500 word chunks. Then I will loop through each one and embed them using nomic-embed-text. So let’s run it.

I skipped to the end and the process took about 50 seconds to run. Which is super impressive. looking at the logs for Ollama, it looks like each 500 word chunk took about 40 milliseconds to process. Let’s compare that to what llama2 would take. I’ll just swap out the models and run it. Each chunk takes about 1.4 seconds. So in the time nomic embed did ALL of war and peace, Llama2 did 35 chunks. All of War and Peace will take llama2 about 25 minutes. That’s a pretty huge difference.

Let’s take a look at the code in python to do the same thing, in case you are interested. And to run that takes, well, just about the same amount of time as the js version…50 seconds.

Most of the code in both examples is reading in the file and chunking up the text. The actual embed is basically a single line of code that takes 40 ish milliseconds to run.

I am super excited about what we will be able to do with embedding now that it runs reliably well in Ollama. And not just reliable, but superfast.

The other features in 0.1.26 include support for Gemma from Google, which I’ll take a look at soon. It looks like the team is still working with Google to make the model more reliable on its responses. And there is some clean up for Windows support. So that’s great. Regarding Windows, I saw a comment just now pointing to issues with the way I set up environment variables. The recommendation from the team is to use system variables, but it was suggested that that might not be the right solution. I’ll take a look at that and see if I need to make a correction. That video was made when Ollama Windows existed for about a day, so wouldn’t be surprised if I got something off.

Anytime anyone finds a problem with one of my videos, I am super happy to replace it with the corrected content. It has to be credible and repeatable, though, so don’t just claim that bunnies can fly without some way for me to verify it. But if there really is a way they can fly, I would update the appropriate video right away. Though this may be the first time I mention bunnies in any video of mine. Maybe YOU know what I am actually talking about there.

Well, thanks so much for watching. I hope you are as excited about Ollama 0.1.26 as I am. let me know what you think in the comments below. I love the comments and they have been particularly active more recently. And as for cadence on videos, I am still figuring it out. I am going to experiment with the idea of making a video like this every Monday and Thursday and then lots of shorts based on the content every day. We’ll see how that goes. My wife and daughter are feeling the impact of 3 a week. Again, thanks so much for being here. Goodbye.


shorts #

1 - Announcing 0.1.26 #

There is a new version of Ollama out. Its version 0.1.26. You can find out more at ollama.com. The new features include support for Gemma from Google, some updates to windows, some memory fixes, and some better AMD logic. But I think by far the biggest new update is the one few are talking about. Embedding. Ollama has supported embedding since the beginning but it used the llama2 models and everything else on ollama.com. It did not support models specifically for embedding. And now it does. This is HUGE. This means that a RAG solution can be done superfast and entirely with Ollama. Before I recommended folks use HuggingFace, but I found that complicated to use. How fast is it? I tried embedding a document with 550 thousand words. I split it up to 1100 500 word chunks. When I did this using the llama2 model, it took about 25 minutes to embed. But with the new 0.1.26 and nomic-embed-text model, its down to 50 seconds. I am super excited about what that is going to mean for solutions that need local embeddings to work with models, and it’s all because there is a new version of Ollama out…

2 - How to do embeddings with javascript and ollama #

Doing embeddings is really easy with JavaScript and typescript using Ollama. In this example I am using bun which is a super fast runner and full ecosystem for JavaScript. So I’ll start with bun init to create a new project. Then run bun add ollama to install the Ollama JS library. Now in my code, add import ollama from ollama and then I’ll initialize the Ollama class. Now all i have to do is call ollama.embeddings and pass it an object with model set to one of the embed models on ollama.com. I will use nomic-embed-text. Then set prompt to the text you want to embed. I’ll go ahead and output the embedding so that we can see it. And here it is. It’s a simple JSON object with a single key value pair of embedding set to a long array of floats. And that is the embed vector. Wish I was using Python instead, watch out for another one of these where I cover that. Learn more about Ollama at ollama.com. So as I said, doing embeddings is really easy with JavaScript and typescript using Ollama.

3 - How to do embeddings with python and ollama #

Doing embeddings is really easy with Python using Ollama. I’ll start with pip install ollama to install the Ollama python library. Now in my code, add import ollama. Now all i have to do is call ollama.embeddings and pass two parameters, model and prompt. I’ll set model to one of the embed models on ollama.com. I will use nomic-embed-text. Then set prompt to the text you want to embed. I’ll go ahead and output the embedding so that we can see it. And here it is. It’s a simple JSON object with a single key value pair of embedding set to a long array of floats. And that is the embed vector. Wish I was using typescript or JavaScript instead, watch out for another one of these where I cover that. Learn more about Ollama at ollama.com. So as I said, doing embeddings is really easy with Python using Ollama.

4 - Why is RAG relevant with larger context models #

RAG. It’s the usual solution to getting all of your companies knowledge into a Large Language Model so that the model can answer your industry-specific questions. But now we are seeing more and more models with massive context sizes. To support that, they need massive amounts of RAM, and that doesn’t come cheap. Plus, it takes a while to populate that massive amount of RAM with content, especially if it’s not already on the machine running the model. And you don’t have to be an honorary member of the tin foil hat crowd to think that a search engine company with access to all your companies secrets might not be a great idea. But even if you or your security team are ok with that, then having the model sift through maybe 80% of material that is irrelevant to the question can have a detrimental effect on the speed of the answer. Plus, some of the newly announced models aren’t even available and some companies have a history of over promising and under delivering, plus yanking the product out of existence at random times. I think for all of those reasons, there will always be a place in your stack for RAG.

5 - what is rag in ai? #

RAG. Retrieval Augmented Generation. It’s been a key way to add all of your knowledge from your own documents into a model that you can ask questions to. But your knowledge isn’t really all going into the model. Instead, it goes into a database, usually a vector database. Vector databases take raw text, metadata about that text, and an embedding. Embeddings are a mathematical representation of the meaning of the text. When you look at them, they are a series, or array, of floating-point numbers. And all the embeddings in a database will have an identical length, regardless of the length of the source material. One of the benefits of these vector embeddings of equal length is that they become really easy to compare with each other, so you can find the closest matches and therefore the text that is most similar to your query. And one of the easiest ways to produce the embeddings is called Ollama which you can find at ollama.com. Once you have a good set of matches, you can provide the source text to an LLM model as context to help it answer your question. So now you know what your colleague is talking about when they bring up AI and RAG.