
Firecrawl, a tool that converts website URLs into Markdown format. This is useful for Retrieval-Augmented Generation (RAG) pipelines and LLM (Large Language Model) inference. I demonstrate how Firecrawl can crawl a URL, like the Langchain website, and convert the content into organized Markdown. Additional features include scraping single URLs and the LLM extract feature, which pulls specific information based on user-defined schemas. I encourage viewers to explore Firecrawl on GitHub, try its playground. An open-source version and SDKs are also available for developers. Site: https://www.firecrawl.dev/ Repo: https://github.com/mendableai/firecrawl 00:00 Introduction to Fire Crawl: Transforming URLs into Markdown 00:43 Why Markdown Matters for LLM Applications 02:00 Exploring Fire Crawl's Features and Use Cases 02:25 LLM Extract: A New Feature in Action 02:59 Pricing, Open Source Version, and Developer Support 03:50 Conclusion and Encouragement to Explore Fire Crawl
--- type: transcript date: 2024-05-16 youtube_id: fDSM7chMo5E --- # Transcript: Firecrawl: Convert Websites into LLM-Ready Data in this video I'm going to be showing you fir craw which is a really great way on how you can take in URLs from websites and convert them into markdown that you can use whether it's within reg pipelines or for llm inference the great thing with this tool is you can paste in a URL and it's going to go and recursively crawl that URL if I try it on the Lang chain website here what this is doing behind the scenes is first it's going to hit that initial link and then from there it's going to find all of the different links within the page and then it's going to go ahead and subsequently crawl those and convert those pages into markdown you can see what the output from these scraped web pages look like here it's really nice succinct organized markdown you might be wondering why do I need markdown and why is it useful within my llm application but you don't need markdown to pass into your llm application as you likely know you can really put in anything within these chat interfaces or Within These API you can pass in code you can pass in text you can pass in all sorts of things but the benefit of passing in markdown is you see how clean this is if you tried to pass in a raw website within an llm you're going to be passing in a ton of different tokens that just don't apply you think about an HTML document there's going to be all the div tags all the heading tags all the classes data attributes IDs you name it there's going to be a ton of stuff within that HTML that is going to be a lot of bloat and frankly wasted tokens if you're trying to pass in raw HTML now in alternative if you try and pass in just the text content of everything you're going to lose out on the different links within the page it's not going to be organized in a way where it's going to understand the different headings on the page and all of that this is a really nice way where it gives you that hierarchy of if the website set up with semantic HTML it should hopefully give you a nice representation of that HTML page within markdown just a couple other features within fir crawl and what I love about this project is they're just publicly building this out which is really great to see I love when people just build useful things that we can learn from and use and I think this is a really great implementation here so you can crawl a URL do that recursive crawl like you just saw you can scrape a single URL you can imagine different use cases for each of these right you can search and there's a new feature called llm extract which is a really neat idea essentially you pass in a URL and you get the responses for the schema that you're looking for in this case it says company Mission supports SSO is open source Etc let's just go ahead and try and run this on the Lang chain website and see what we get we passed in the Lang chain website and the response back that we have the company's Mission here we have supports SSO false is open source true I really love this as an add-on to the other features here check out the playground you can make an account as well there's also pricing here so it's a credit based system if you want to go ahead and use their API alternatively there is an open source version of this if you like getting your hands dirty and trying to set this up yourself you can definitely go ahead and try this as well there's a number of different ways that you can use it you can use it from the python node SDK Lang chain llama index and then even the Lang chain JS integration really great to see there's a robust consideration for developers I always love to see when there a number of different sdks available and seeing projects that aren't just necessarily using python or what have you it's really great to see a ton of different implementations here there's some good documentation on if you go ahead and try and run this locally you can just go ahead run through these instructions here and then get it all set up just a quick one today I just wanted to show you fir crawl CU it's a really cool project I encourage you to check it out kudos to the team at mendable keep doing what you're doing I'm eager to see how this grows over time if you found this video useful please like comment share and subscribe otherwise until the next one
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.