
Creating a Retrieval-Augmented Generation (RAG) Workflow with Upstage AI Upstage Console Link: https://go.upstage.ai/SOLAR_DEV_DIGEST Credit Redeem Coupon Code: SOLAR_DEV_DIGEST_2408 Start registering for the coupon: August 15, 2024, 12:00 AM (UTC) End of coupon registration: September 15, 2024, 12:00 AM (UTC) Credit expiration date: November 1, 2024, 12:00 AM (UTC) Credit amount: $30 Console: https://console.upstage.ai/ Repo: https://git.new/answr In this informative video, I showcase the various services offered by Upstage AI, demonstrating how to integrate their embeddings and Solar LLM models into your LLM applications from scratch. The video provides a detailed walkthrough, including vector storage setup, similarity search, chunking, and combining these elements for retrieval-augmented generation. Additionally, a brief overview of the Upstage console and its user-friendly features, such as document OCR and key information extraction, is also shared. Follow along as I code a complete reg workflow without utilizing any frameworks, ensuring ease of understanding in every step. 00:00 Introduction to Upstage AI Services 00:29 Setting Up from Scratch: No Frameworks Needed 01:01 Exploring the Upstage Console 01:58 Document AI Features and Benefits 03:05 Combining Chat and Embeddings for RAG Applications 04:46 Coding the RAG Application 08:05 Performing Similarity Search and Chat Completion 11:10 Conclusion and Final Thoughts
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
--- type: transcript date: 2024-08-15 youtube_id: PfUHqDoL5mM --- # Transcript: Get Started with Upstage AI's Solar LLM in 10 Minutes this video I'm going to be showing you upstage Ai and a number of different services that they offer that you can begin to incorporate into your llm applications and why some of them might be particularly useful to you I'm also going to be showing you completely from scratch how we can leverage both our embeddings model as well as their solar llm model and how you can combine them together to write out a rag workflow from scratch like this we're all familiar with chat GPD or anthropic where you're able to upload these documents and be able to ask questions about the documents I'm going to show how to do that completely from scratch we're not going to be using any Frameworks or anything like that I'm going to show you how to do everything from the vector storage to the similarity search splitting chunking creating vectors of the different pieces of the document and then finally doing that retrieval augmented generation and passing it within an inference provider for your chat completion it might sound like a lot but it's only about 150 lines of code that we're going to be running through to set all of this up and I wanted to run through an example on my channel on how you don't necessarily need a framework to set something like this up before I get into that I want to go over upstage the upstage console is userfriendly and it includes a bunch of example code so it makes it easy to understand on how to actually use it and implement it within your application one thing that I really like about their console is how intuitive it is you log in you're greeted with this home screen you can go right over to API Keys there's API keys that you can get within one click and then there's also example code for all of the different services that they have if you want to get started with their chat completion model you can go ahead reach for the Llama index implementation or the Lang chain of implementation even the JavaScript implementation which is all really nice of a consideration to include it like this honestly I wish more providers had something like this the thing I like about this is you're really two clicks away from getting started with your application right so if you want to use function calling and you want to Lage it within JavaScript you can click over here and you have a working example that you can just paste in within age file grab your key and boom you're op to the races with the example so one thing I want to highlight in the video which I'm not going to be showing you how to set up but I want you to know that it's there is this document AI feature which include document OCR key information extraction as well as layout analysis so one of the things that stood out to me with the key information extraction is its ability to extract tables as well as figures from any document with ease this could be a picture of a document this could be a scan document they have a few examples here just showing you how it works but it does well on a diverse set of documents when you try and query it and extract the different information from it and the benefit of having these different offerings is instead of passing these images into something like a model like GPD 40 or Cloud which could be potentially pretty expensive if you're passing a number of images at scale they have these end points where you can pass in documents or photos and they don't necessarily even need to be good photos like you see here there's a business card that's just laying crooked on a table here and it's able to extract that information for you and is even able to give you all of the bounding boxes and the coordinates for where it extracted the information and then you can leverage this data potentially within your application or for other use cases there's a ton within here what I'm going to be focused on is both the chat completions endpoint as well as the embeddings endpoint and between chat and embeddings those combined is how you can create a rag application to move over to the coding portion the way that this is going to work within our application is we can plug in one of the upstage solar models we can select that we're going to be using a particular document in this case we'll just leverage a white paper that we have here and I can ask a question I can say what is this document about and then what it's going to be doing on the back end is it's going to Loop through that example it's going to break it up into chunks it's going to be sending those chunks to the embedding Zeno once it's embedded we're going to store it locally within memory and then we're going to be retrieving the top results from our stored Vector storage that we create from scratch and then we're going to be generating a response just like this so this just to show you a basic example on how you can pass in a piece of text and then be able to break it up and then leverage it within your chat completion I've heard some really good feedback in terms of the upstage embeddings model that's why I decided to use the embeddings with the solar chat model to create and develop this rag app within the llm answer engine one of the benefits of leveraging upstage is that their service have both the embedding and the llm model built into their platform and I did find that when building out the application the ease of use of integrating something like this into the llm project was extremely simple as well as when I tested this across a number of different documents it did perform exceptionally well to the questions that I asked the first thing that we're going to do is we're going to import open a from open AI the nice thing with upstage is it does allow us to use the open a schema as well as the open a SK so it conforms for both the embedding zempo their function calling capabilities as well as the chat completion for their solar models the first class that we're going to declare is in memory Vector store and what this will do is we're going to have two methods we're going to have a method that's going to add documents and then we're going to have a method to actually perform the similarity search I'm going to dive into this a little bit more later in the coding portion but essentially what's going to happen is when we ask a query of particular documents is we're going to compare the relatedness between the query that we've put in as well as all of those of text that we've broken out which you'll see in just a moment here so next this is what a cosine similarity function looks like the particular algorithm that's taking the query as well as the embed to see which one is the closest in terms of the numerical representation of how close those text queries are to one another next we have a simple function where we're going to be splitting up the text so once you pass in the text it's going to split it up based on a th000 characters so that's all that this function is is doing here and then once we have that we're going to initialize our Vector store and there are buil-in methods in these Frameworks like Lang chain and llama index where you can use an in-memory Vector store or something very similar but this is just to give you an overall idea on how those work under the hood next we have our main rag function and then we have our perform chat completion So within the rag function we're going to be passing in a number of different things from the front end we're going to be passing in the user's message we're going to be passing in a couple streamable methods because we are going to be leveraging the versel AIS SDK and then from the files themselves we're going to be leveraging the name the type and the content of the files the first thing that we're going to do is we're going to check whether the API key is set if it isn't no then there is an issue then we're going to initialize the openai client this is going to be where we can establish the base URL to point to solar the nice thing with the open AI SK is you don't just have to rely on open AI you can pass in any base URL here and in this case it is the solar endpoint interact with the models that they have hosted there then we're going to be passing in our API key just like we had declared above next what we're going to be doing is we're going to get all of the files from the front end here which we're sending across within this object here where we have the name the type as well as the content and we are sending them over as base 64 and then what we're going to be doing here is we're going to do a little bit of processing for our files themselves so we're going to be splitting up the text of all the pieces of files that we're going to be sending through here we're passing in and looping through all of those different documents after they've been decoded once we've broken apart all of our documents into these little chunks we're going to be passing each of those chunks to the solar embeddings model we're going to pass it one chunk at a time you can imagine passing in 500 characters or a th000 characters you can play around with this a little bit see what might work best for you and then what we're going to be doing as soon as they return is we're going to be storing them within our Vector storage within memory there that's where we're adding the document here with the chunk as well as the adding and this is going to be how it retrieves the piece of text from the numerical representation of that piece of text so then what we're going to be doing next is we're going to be getting the embedding from the user's query this is just going to take that user's message say if you ask a particular question of the document see if it's like a Harry Potter book and you say what did Gryffindor do this day or whatever it might be you can go ahead and pass in the query of your users's message there and the reason why we have to do this is because with embeddings in similarity search you have to compare apples to apples and in this case you're going to be comparing that numerical representation of all of those different chunks that we have within our memory storage there once we've embedded our users query then we're finally ready to perform that similarity search here again we're going to reach into our Vector storage and then we're going to perform the similarity search and we're going to be passing in our user query and then we're going to be specifying that we want three results from our Vector storage from here we're going to join all of those together and you can change this number L from 3 to 4 to 5 whatever it might be essentially what you have to keep in mind is you have to make sure that you have enough room for the context that solar allows which is about 32,000 tokens of contacts so you can play around with this number a little bit it doesn't necessarily need to be three but essentially what this means is we're going to be returning the three top results from those vectors and then we're going to be joining them all together and the reason we have to join them is because once we pass it into the llm is going to want it within that string because it is all based on text for instance if you were to pass it in a structured format like XML or Json you just have to make sure that it is within a string because otherwise it's not going to be valid to pass to the llm so you have to perform that coercion to actually get there then we're going to be performing our chat completion which is going to be the last step which I'm just going to be showing you here if there aren't any files within our application we're just going to immediately perform that chat completion because maybe we just want to interact with that solar model then from here it's very straightforward but this is going to be the portion where we're going to be streaming those responses back to our nextjs application from the versell aisk we're going to be passing in the streamable method and this is going to be where we're going to be actually leveraging it what we're going to be doing here is we have our system message we're going to say you are helpful assistant always respond back in markdown and be concise passing in the context which is all of the top results from our similarity search if we haven't passed in any files if you're not passing in those files you're not can have this context set up in a way with a turnery where if you're not using it for a reg implementation and so you just want to interact with the solar model is it will detect that the contact isn't there and it's just going to leverage the user message for the chat completion then we're going to specify it to streaming true and then finally as those responses coming back so it's going to come back in chunks we're going to loot through those chunks as they come back and then once they come back we're going to stream them to the front end which is how it's going to be sent within the answer engine application then once it's done we're just going to pass back a signal and that the llm response is done and that's how it will move the icon from the input bar to the content pane of where you've completed the question that's it for this video thank you upstage for sponsoring this video and allowing me to build out this rag implementation within answer engine I really encourage everyone to check them out the pricing is reasonable as you can see and they'll also provide free credits to enable you to test drive the upstage services here to experience the value that they have to offer if you found this video useful please comment share and subscribe otherwise until the next one
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.