
Links: https://unstructured.io/ https://unstructured.io/product https://unstructured-io.github.io/unstructured/api.html https://github.com/Unstructured-IO/unstructured https://github.com/Unstructured-IO/unstructured-api-gui https://blog.langchain.dev/langchain-unstructured/ https://aws.amazon.com/marketplace/pp/prodview-fuvslrofyuato Connect and Support I'm the developer behind Developers Digest. If you find my work helpful or enjoy what I do, consider supporting me. Here are a few ways you can do that: Patreon: Support me on Patreon at patreon.com/DevelopersDigest Buy Me A Coffee: You can buy me a coffee at buymeacoffee.com/developersdigest Website: Check out my website at developersdigest.tech Github: Follow me on GitHub at github.com/developersdigest Twitter: Follow me on Twitter at twitter.com/dev__digest
--- type: transcript date: 2023-12-20 youtube_id: Ngv8WrKDIu0 --- # Transcript: Unstructured.IO: Get Your Data LLM-Ready in this video I'm going to be showing you unstructured IO which is both an open source project as well as a platform that provides tools for simplifying and streamlining the pre-processing of both structured and unstructured documents for your llm applications so what unstructured IO allows you to do is essentially take in all these different file formats that you see on screen here and what it will do is it will go ahead and parse those into a Json format then from that Json format you can take that you can send it directly within your llm potentially on how your application set up or you can what I think a lot of people will likely do with it is put it within a vector database and the nice thing with how it's structured is the uh structure that you get back from whatever you send in there is considerably more structured and coherent than just doing something like arbitrarily breaking them by uh you know like a thousand characters or a couple hundred characters like all these different things that are floating around so what this will do is it will recognize things like tables if say it's HTML it's going to see that there's a table there know that it's a table or if there's a list know that it's a list if there's titles Etc and so on so it's nice with this is it will give you that coherent structured uh piece of data that you can chunk and put into your vector database but you can also leverage all the metadata that is generated from this which I'll touch on in just a moment so in terms of the actual project itself so there is a GitHub repository which you can pull down there are a number of ways that you can run this locally on your machine so you can pull down the docker container and get it set up that way or you can go ahead and run through the installation steps uh for the python uh approach now you can also go ahead and I would actually encourage you to really check out this unstructured API goey especially if you haven't used this tool before so I'm going to be running through an example with this just at the end of the video here now the other thing to note with this is if you are an AWS uh user there is a template where you can go ahead and deploy this right onto an ec2 instance uh if you'd like so uh as someone who uses AWS it's really nice to see that there is something like this that you can just go ahead and off the shelf play around with and have something that's sort of production ready potentially and uh without uh forgetting this is there is also an implementation within Lang chain so the unstructured file loader as well as the directory loader and I think there's likely other uh implementations as well from the time of this blog post because it is a little bit dated at this point but uh needless to say there's a number of different um approaches that you can use within Lang chain to leverage unstructured under the hood so to get into the example here so you're able to with unstructured use their free tier for prototyping so this gooey is great for that so you can go ahead to unstructured API key like you see here plug in your API key and then you can just get off the ground running so I'll just go ahead and upload a relatively long document here so this is uh regulatory filing from Apple so this is straight from their SEC filings and it's a relatively big document and the thing that's nice with this is for a document like this if I was to parse this myself I would run into a ton of different things that I'd have to handle so I'd have to take that document and uh maybe the first pass it would just be breaking up the character vectors into chunks that I would put into my Vector database this it's really just taking it and giving that structure without really having to do much all I did was put in that file and if we look over here what it's doing is it's doing a ton of really helpful thing so it's giving me metadata is recognizing the language of that particular chunk of text it's indicating the type of text that it is so if it's a title it's recognizing that this is a form 10K a which is a title uh if there's a a body of text it's going to say that it's narrative text uh if there's lists it's going to uh specify that and so on and so forth so it's really nice because it gives you these nice little chunks where if you can imagine if you're trying to get the top results of a vector search uh this is going to give you much more uh coherent search results then say if you're just uh running through and breaking it after every 200 characters or 500 characters or something where it might just end right in the middle of a sentence that's relatively important so if an LM is being ped this chunk of information it's going to be arguably more uh useful than say if it just broke somewhere in the middle and it just gave you that result right so I'm going to be diving into this a little bit more in future videos and actually building it into an application I'm thinking potentially a nextjs lang chain application where I can show you how you can actually potentially use this in something that's a bit more closer to maybe a a production use case but overall I just wanted to really encourage you to check out this unstructured IO project because it's incredibly powerful what they're doing and I think it's a really important area within the llm and AI space that they're looking to solve so that's it for this one if you found this video useful please like comment share and subscribe otherwise until the next one
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.