
In this video, I introduce Unstract, an AI-powered no-code platform for automating the processing of large unstructured documents like PDFs, images, and scanned files. I discuss the challenges of dealing with unstructured data and how traditional data processing methods are often time-consuming and error-prone. Unstract offers a solution by allowing users to automate tasks such as document classification, data extraction, and validation. I go through how to create an account, set up document parsing keys, and run workflows. I also explain the flexibility of Unstract in terms of integrating with different LLMs and vector databases. Finally, I highlight the useful features like LLM Whisperer for text extraction and the ability to deploy workflows to an API. Overall, Unstract is a valuable tool for organizations aiming to efficiently manage and process large volumes of unstructured data. Links: Unstract.com https://docs.unstract.com/ https://github.com/Zipstack/unstract Try LLMWhisperer for FREE: https://pg.llmwhisperer.unstract.com/ Timestamps: 00:00 Introduction to Unstract: AI-Powered No-Code Platform 00:21 Challenges of Unstructured Data 01:10 Unstract's Solution for Document Processing 02:17 Getting Started with Unstract 02:30 Defining and Extracting Data from Documents 03:56 API Integration and Workflow Creation 06:40 Advanced Features: ETL Pipelines and Vector Databases 09:06 LLM Whisperer and Prompt Studio 11:12 Comprehensive Documentation and Setup 12:10 Conclusion and Final Thoughts
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
--- type: transcript date: 2025-02-12 youtube_id: Ymq8o7FSoVc --- # Transcript: Unstract: AI Document Parser: Extract Data from Complex PDFs at Scale! (Open Source) in this video I'm going to be showing you unstr which is an AI power no code platform that's designed to help organizations automate and streamline the processing of large unstructured documents such as PDFs images and scanned files if you deal with a large volume of documents within your industry you will likely find unr's approach to data extraction and integration particularly useful first I want to just touch on a little bit about the challenge of unstructured data many organizations are used to receiving or storing information in unstructured format this could be various types of forms invoices contracts even maybe sometimes handwritten notes traditional data procing methods often require manual intervention or complicated rule sets in one of my past jobs this used to actually be a whole segment of the organization where we'd have data entry Specialists where their whole job was basically going through different documents where they were effectively taking documents as well as data from one place and entering it within another place that type of approach it almost goes without saying that is both timec consuming expensive and often times prone to error let's enter the solution and more or less the modern error what unru allows you to do is to take all of these different document types be able to parse those and get extracted data for instance say you have an invoice this will be an example of what you could get extracted from that invoice so you could send in a request with the path to that file and ultimately you can get this nice clean structured data that you can store within your database or use in whatever way you see fit the first thing that I want to point out is unra is an open source repository this allows you to use this on your own infrastructure if you're interested they do also have hosted Solutions if you're looking to have an easy TurnKey solution within here you'll be able to automate tasks like document classification data extraction data validation and you can even integrate this with other business systems the nice thing with this platform is it is a no code platform so it's accessible to users who may not have an extensive technical background this broader accessibility can make document processing automation even easier to adopt within different types of organizations the first thing that I'm going to show you is you can go and make an account for free on unstructed and once you're within the platform this is what it looks like there's going to be a number of different examples that you can look at within here we see there is a credit card example so the way that this works is what you can do is you can Define the different keys of the items that you want to extract from the document within this example we have issuer name we have customer name we have customer address and we have payment info as well as the spend line items within each of these items we see that there's a description of what each of these are for the description of the key of customer name we have the description of what we're looking to extract this is going to be the customer to whom this credit card stat belongs to the customer address is pretty self-explanatory the full address City and zip code as well as the issuer name this is going to be the issuer of the bank who has issued this credit card what you can do here is you can go and run the llm on this particular document but you can also have a series of documents within this the way that this works is if you want to add in a new value I could say something like the minimum payment and within here I can say this is the minimum payment the customer owes for this particular cycle once we've defined that we can go and determine whether this is text or a number in this case I'm going to specify that it is a number and what I can do here is run this again for the particular document now we're going to get all of those values including the minimum payment that we just specified here here we see that the value is 20539 and we can see on the credit card balance that it is 20539 now what this will look like if you are going to be leveraging the API is you're going to have this nice data format here we see all of those defined keys that we had within our document parser and this is what we get returned as the payload the other thing that I want to point out within the platform is you can look at it from the PDF viewer you can also see the raw format of it as well so this is the text representation of the PDF as you can see here so this is all the text values now the other thing that you can do is this just has one document but let's say you have a number of different documents you can go and upload as many documents as you'd like if you have a batch of different documents that you want to go and extract all that different data from you can go and add them all in here just like that the nice thing with how the prompt studio is set up is you're going to be able to have these different projects so you can imagine within here you could have a project for invoices and it would go through a particular flow of all the different values that you care about within an invoice whereas within a resume it could go through and parse the different things that you're interested within a resume you can start to see the different use cases on how this can be helpful you have this visual interface but you can also programmatically make requests and have the same output as well so the other option that you have within here is you can create these workflows so I'll just demonstrate this here I'll create an example workflow and within here what you can do is you can select these different tools so we have the file classifier which classifies a file into a bin based on its contents or we have the text extractor which is a tool designed to convert doc into its text representation let's say we're going to use the text extractor what we can do here is we can determine whether it's going to be based on the file system or the API and then we can also select the connector let's say I have an API for both of these I can go ahead and run this workflow to test this I'm going to go ahead and upload an image of a bank statement here we can see that it is working it's running through the process where we're processing the file and once that extraction is done we'll see all of the text within here now once you've set up a workflow you'll be able to easily deploy this to an API so what you can do here is if we just Define what the name is for our API and we can say example API now we see on how we can make a request to that endpoint within here we see this is going to be the JavaScript version we also have the python requests or we can look at the curl requests what this is going to do is we see that it's going to make a request to The unra Domain and within here we have the rote for our API we have API or and then we have example API if we just take a look here we see the header we see the form and that's as easy as it is and that's how we can connect a workflow to our API deployments another option that you do have is you can set up these ETL pipelines what this allows you to do is you can effectively transform your unstructured data and ultimately have that pipeline into your database or potentially other systems that you have it makes it really easy on if you just want to streamline the process of having these documents that go within a system and ultimately get inputed into a database or solution that you're already using the other thing to note is within here you can choose from a variety of different llm so you can go ahead and select from even olama if you want you can use anthropic you can use some of the Google models AWS any scale open AI vertex AI mistol or Azure open AI as well and it also shows that they are going to be supporting replicate soon if you want to use this locally AMA could potentially be a really good option or if you want a hosted solution there are increasingly a number of great options out there that you can use for data extraction another great thing with the platform is you can set this up natively to work with a number of different Vector databases from postgress Pine Cone we8 milis and more similar to the llms is very flexible in terms of the vector database that you want to use effectively you can choose the vector database that you want to use plug in your API key and then you can have that within the implementation on how you can store and ultimately retrieve information from your documents as well just to touch on vectors and embeddings for a moment the way that they work is you're going to send in a document that could be an invoice for example and when you send in that invoice what's going to be returned back is a numerical representation of that text once we have that numerical representation we can store that within our Vector database and where it's helpful is when a user goes and puts in a query or if we're searching for something when we have that piece of text we're going to go embed that piece of text get another numeric representation and the key piece with this is when it performs that search functionality it's going to take what we're looking for as well as the numbers that currently exist and it's going to find the closest or the most similar numbers within that set or that embedding and it's going to retrieve those results that's going to give us the relevance on the different queries where this can be helpful is if you're dealing with a huge volume of documents this is a very quick and performant way on how you can scan this can be tens of thousands potentially hundreds of thousands of documents that you can quickly scan by leveraging Vector databases as well as embeddings finally you do have the option for the text extractor within here you can select from a number of different extractor options so you can use unstructured IO llama parse or they also have llm Whisperer just to show you what llm Whisperer looks like here's an example of a form that an organization might have to deal with so here we have all of the text we see that it's scanned we see that it's crooked and we even have written text within here so what llm whisper allows us to do is to take this document and convert it within this text version of it and where this is helpful is the text version can be stored within a database you can also use this text version within large language models as we can see here the performance of this is spoton if we look at the name we see I'm a card holder we see the social security number here we see within the extraction that it also exists here as well we can see even the different fields if there are checkboxes we see that represented here and basically all of the different values throughout here we we can see match one for one between the different documents the other key piece with this is it also preserves the layout this doesn't just extracted in a way where it's just going to be one long string of text we can see that it's taking the text but it's also keeping it in a format where it's going to be uniform with the document that was passed in so now I just want to touch on the prom Studio again within here one of the key features of the prom studio is they have something known as llm challenge that works within the prom studio so here we see this receipt we can see all the different keys that we want to extract but another really great feature of this is if we go over to the settings and we go to the llm challenge setting what we can do is we can select to enable llm Challenge and I can save this out and what it will do is it's going to use two separate llms to extract and challenge the information so what this allows you to do is to have a dual catch mechanism that will catch and then discard the hallucinations early in the process which makes this highly reliable last just a couple other pieces that I want to highlight they have really great comprehensive documentations whether you're using llm whisper or if you're using unra itself you can go and check out all of the different steps to see whatever process that you might be interested in setting up everything is very well documented within here then if you're interested in setting this up locally or on your own infrastructure you can see all of the requirements within the repo here and to set it up it is super simple effectively you can pull down the repo you can go ahead and run the command to start it and then you will see it within the particular port for unstructed the hosted version comes with a 14-day trial if you're interested in trying it out in terms of the providers as I showed you within the video there are a ton of llms that you can use this with Vector databases and embeddings text extractors in terms of ETL destinations at time of recording they have support for snowflake red shift big query postgress MySQL as well as a few others overall that's pretty much it for this video in highlighting unstr for those who manage High volumes of data and need more reliable document parsing unrack will very likely be a useful option to explore hopefully you have found this useful and you can see how you can potentially leverage their platform and how it might fit into your particular work flows for taking unstructured data making it structured and have these pipelines that can work at scale unstr is a very useful option to explore that's pretty much it for this video if you found this video useful please like comment share and subscribe otherwise until the next one
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.