
In this video, I'll guide you through creating an AI-powered web scraping system using OpenAI's new structured outputs and Bright Data's web unlocker feature. By the end of this tutorial, you'll learn how to build an application that can answer queries with real-time web data while avoiding common scraping pitfalls. Follow along as I demonstrate how to set up Bright Data's proxy and scraping infrastructure, leverage Node.js libraries like Puppeteer for browser emulation, and integrate these components to scrape and process data effectively. Links: https://platform.openai.com/api-keys https://brdta.com/developers_digest ($25 in Free Credits) Answer Engine Repo: https://git.new/answr Bright Data Repo: https://github.com/developersdigest/bright-data 00:00 Introduction to AI-Powered Web Scraping 01:58 Setting Up Bright Data Account 02:18 Configuring Web Unlocker Feature 02:58 Benefits of Using Bright Data 04:49 Leveraging Puppeteer for Advanced Scraping 05:55 Coding the Bright Data Server 09:20 Implementing the Web Unlocker 10:12 Handling Puppeteer Logic 13:17 Integrating with OpenAI 17:14 Final Thoughts and Conclusion
--- type: transcript date: 2024-09-28 youtube_id: Q3juMnxK2rQ --- # Transcript: Build an AI Web Scraping System Using OpenAI GPT-4o Structured Outputs in this video I'm going to be showing you how to build an AI powered web scraping system using Advanced web proxying techniques along with open AI gp4 structured outputs by the end of the video you'll understand how to create an application that can answer queries using real-time web data while avoiding some of the common scraping issues that you can run into if I put in a query like what are the top five stories on Hacker News right now you see here I didn't put in this URL this portion here we're going to be using structured output which just came out from open AI which is a really great way to get a reliable schema back from an llm once we've extracted the link what we're going to be doing is we're going to be using the web unlocker feature from Bright data and what that will allow us to do is with a simple request will be able to proxy the data through their service and there are a ton of benefits to this which I'll be going through in the video but the cool thing with this and how it's a little bit different than something like perplexity or the initial implementation of this project is this will be able to Target websites directly you'll be able to Target accurately what you're looking to extract information from if I go over to Hacker News so we see the top five stories let's say what is the latest article from Ben Thompson about there we see I put in Ben Thompson and it knows that the blog is Strater this is really neat because all of a sudden you can leverage the strengths of an llm when llm is is able to determine and make that correlation and connection with the Strater blog as well as Ben Thompson then by leveraging the web unlock feature from Bright data will'll be able to get that information we get the best of both worlds right we get the llm which is good at being able to make those connections especially when you're using a good llm and then we'll be able to have an upto-date and accurate depiction of what the latest article was so let's get into the technicals the first thing that you can do is if you don't have an account on bright data already you can head on over to their platform and once you're logged in you can go through the onboarding steps once you're all onboarded you'll be able to create your proxy and scraping infrastructure just like it says here it's as simple as clicking add and then you can determine the different offerings that you like to use in this case what I'm going to be showing you is the web unlocker feature within here you can name your Zone whatever you'd like so I'm going to say example for YouTube from here you can decide on the types of domains that you like to access whether it's all domains e-commerce premium asynchronous requests in this case what I'm going to do is I'm just going to leave it to all domains we're going to click add here and then what we're going to do there we're going to click yes and it's going to be pretty quick to spin up our Zone here so once that's all spun up we'll see the host the username as well as the password what I'm going to be using here is the request promise library in node.js and this is all that you need to proxy your request through bright data first I want to touch on some of the benefits of using using a service like bright data what they're using behind the scenes is something akin to Puppeteer or playright and what these tools do is it essentially is a tool that emulates an actual browser instead of just making a curl request or a fetch request to the page what this will do is it will actually go to that page it will invoke all the JavaScript and it will render as if you were an actual user the other benefit of this is you don't have to worry about scaling whether it's for a hobby project or if you want to scale up to an industrial app of whatever you're building you'll be able to leverage a service like this but one of the big ones is the experience in hasslefree automated proxy management they handle all of the IP rotation and retries say if you were querying from an ec2 instance or from whatever service that you're using and it has a particular IP if there is a vendor and they just decide to block you for whatever reason and it might even be a firewall that automatically blocks you say if you are a service like perplexity and a company like wired says okay we don't want you to scrape here anymore we're going to go ahead and block that IP that you're scraping from all of a sudden you're not going to have access to that data what the IP rotation does is they effectively handle all that IP rotation where you don't have to worry about the potential of the IP of the server that you're using to run your scraping on being blocked the other thing that's nice with this is you're only going to be paying for successful request if it doesn't return a meaningful payload you're not going to actually pay for that that's also a nice benefit finally I wanted to show you one example where you can leverage their web scraping capabilities which is arguably one of the most powerful features I'm going to say I want a pair of Nike shoes I want a men's style what it's doing is we still have that structured output from open AI but what we're doing here is we're actually going through a multi-step process with Puppeteer if you haven't used Peter before what it allows you to do is synthetically control a browser instead of clicking around and typing out commands like you would with your mouse and keyboard you can do this all programmatically this can be powerful for a ton of different workflows it allows you to not just be limited to that top level first page you can interact with the page say if you have to type in credentials and user information or fill out a form all of a sudden you're able to have that workflow all built out if there are particular pages that you do want to parse and maybe there are multiple steps to ultimately get to the information that you want you're going to be able to actually interact with that page if you want to click things or go through a checkout flow or if you want to book a flight or book a hotel or maybe you want to get some information and you have to get through authentication what Puppeteer allows you to do is you could write out that workflow and then it will run through the steps now in terms of the coding portion way that this is set up is we have a standalone server for our bright data interaction and this is going to be something that you can deploy basically anywhere where you can deploy node.js and the reason that we're using nodejs for this route is there are some websocket protocols that we are going to be using to interact with things like the Puppeteer browser and then what we're going to do is we're going to implement all of the different logic to communicate with this within the answer engine project which runs on the edge the first thing that we're going to do on our bright data server and I'm going to put this all within a links within the description of the video where you'll be able to access this code to set this up yourself but what we're going to be importing is request promise turn down as well as Puppeteer turn down is going to be how we convert the HTML to markdown and then Puppeteer core is a light version of Puppeteer that allows us to write all the Puppeteer syntax and all of that while being able to actually run the code through the web sockets on the bright data infrastructure and then finally request promise is the recommended implementation for node this is going to be how we use the web unlocker the first thing that we're going to do is we're going to do a simple configuration for our markdown then we're also going to specify that we're going to be removing scripts style TS as well as iframes and no scripts because those don't really add any additional benefit to what we're generally trying to accomplish with the answers we're asking for from a web page next we're going to set up a simple limitation we're going to say limit it to 500,000 characters now you can change this depending on the token window of the llm that you're using say if it's a smaller token context you can shrink this if it's larger you can increase this you can play around with this number a little bit now the thing to not with this is the application isn't going to fail but it is going to let you know that it is a particularly large payload that it's processing so that's potentially going to incur extra cost in terms of when you actually send that to the llm to be processed after that check we're just going to go ahead and convert that HTML to markdown once we've done that simple warning and check we're going to convert the HTML to markdown so next what we're going to do is we're going to get the websocket end point for our browser connection so to set this up on bright data all that you have to do is log in go to your proxies and scraping infrastructure you can click a scraping browser here you can name it whatever you'd like whether you'd like to have it Sol captas which is obviously a really helpful thing if you do include that and then once you're there you're going to be able to grab your credentials here an environment variable as well if you'd like what we're going to have here is we're going to have a simple post request and then we're also going to have some helper functions just some examples to show you how you can use Puppeteer we're going to D structure two different things the URL as well as the query the URL is going to be what we get from the structured output that we get from openai which will cover in the application portion and then the query is going to be whatever the user has requested if they say I want to buy a pair of men's Nike sneakers on Amazon that's going to be the query we're going to console log what the request is and then the first thing that we're going to do in this example since I have it set up within Puppeteer to have a workflow for Amazon is if it detects the URL that has been pared from the structured outputs if the URL is Amazon specific this is going to be what triggers our Puppeteer workflow this could be something like a booking site or a travel site or something proprietary otherwise if it's a general request what we're going to be doing here is we're going to be sending that request through the unlocker to get set up with the web unlocker what we can do is if we go ahead and click add just like we did for puppeteer you have a few different options here that you can go and decide on what you like and then you can similarly spin up the instance here so to get the string that we're going to be using and requesting from the easiest way to get this is just from the proxy URL here so if you just grab this proxy URL here that's going to be what we plug in within here and then all of those requests will be sent through the web unlocker for the request itself we are going to be using the request promise Library once that's all set up we're going to make the request with our configuration then finally as soon as the promise resolves we're going to convert the HTML that we get back to the markdown like we had discussed within that function that we went through and then finally we're going to send the content back and if there is an error we're going to log out the error now for the Puppeteer logic itself this is going to be very unique to the particular page that you're scraping for instance within Puppeteer there are going to be some repeatable steps that you can use throughout different web pages when you connect to that instance and establish that websocket connection you're going to be able to use this from website to website you can reuse this the new page you can think of this as if you're opening up a new Chrome browser on your computer and then it's very intuitive to use you can also leverage a lot of these AI tools whether it's chat GPT CLA or cursor to help you generate some of these scripts for puppeteer it's been around for a long time it's very hardened in the way that you use it in other words like it's not changing a whole lot there's not a lot of breaking changes a lot of these l do really understand quite well on how to write out this type of logic that can certainly help you along the path of whatever you're trying to scrape or the different interactions that you're trying to have on the page if we just go through this quickly we're going to make a new page we're going to go to amazon.com we're going to wait for all of the Dom content to be loaded then we're going to search for the product the cool thing with Puppeteer is you can wait for selector say if there are buttons that render at different times you can specify which selector or elements that you want to to be visible or be able to be interacted with by specifying the ID or class as well as soon as the search bar is detected we're going to type it and that's another cool thing with Puppeteer is there are synthetic methods it can emulate as if a human's typing out within the input form and that can help deter it being flagged as a bot on different websites depending on what they have set up in terms of monitoring and all of that and then what we're going to be doing here is we're going to submit that button the search button on Amazon that little magnifying glass and then we're going to wait for the navigation and again for the whole page to load then once we're on the results page this is going to be how you return those results of all the different fields that you want to have within the response that you're sending back here we're specifying we want the name price ratings reviews as well as the link and the way that this is going to be sent back is as a Json stringified payload if I just go back here we're going to finally close out the browser and then we're going to return that data and this is the data that ultimately gets sent to our application that's pretty much it for our bright data implementation to set this up within our application you can pull down the answer engine repo here I am going to publish this to the repo you'll be able to see this once the video goes live and to understand what's happening here first we're going to add a new ad mention that's going to be this line here you can name this whatever you want you can name it web unlock and Puppeteer or whatever it's going to essentially detect the different words of whatever you type in that input box and that's going to be how it determines which ad mention that you want to use once it populates on the screen there you can select bright data and then what that will do we have the logo and everything is this is going to be how it binds to the function that we've declared once we have the mention tool config as well as the mention tool added we're going to go within the structured unlock summarize file and this is going to be where we have our bright data web scraper function the first thing that we're going to do is we're going to to import our dependencies open AI as well as Zod Zod is going to be how we declare the schema for the URL that we want to have extracted but what's really great with Zod is this is how easy it is to declare the schema that you want to have returned and the thing with structured outputs like I mentioned earlier is it's 100% guaranteed from open aai that it will return this valid Json schema here it's going to return an object with the key of URL that is a string that's going to be consistent across the board every time we are going to get a response like that now the one thing to note with the URL it still can be prone to hallucinations that's one thing to be mindful of given that we are using an llm for the application next we're going to be specifying the request to open AI here it still is a beta feature within their SDK and this is going to be how we send in that method to be able to parse the payload on the way out at time of recording the structured output feature from open AI is still in beta you do have to specify the particular model string of GPD 40 20 24806 it's not going to just work if you pass in gp40 quite yet we're going to specify extract the most likely valid URL from a natural language query the other reason why Zod is great is now we have this type validation that we can reference all throughout our application here if we do use that string of the URL in different parts we'll be able to see if there are any type errors within our application that we can quickly go ahead and resolve if they crop up next this is going to be how we extract the valid URL from our query and if we don't get back a valid query we're just going to stream back a response to the UI to the user to let them know that there isn't a valid URL that was found next this is going to be our loading state to let the user know that something is happening we're going to say extracting information and then we're going to get that URL sent back to the front end to show them what URL is being parsed then here this is going to be where we make that API request to the server where we set up the bright data logic here I have it on localhost 31 API bright ddata this is going to be different in your application if you're in a production setting you can swap this out to whatever that endpoint is if you reference that same logic that I did in the previous file within this we're just going to make a simple post request and we're going to be passing the payload of the URL that we got from open Ai and then we're going to be passing in the query as well if we have any errors we're going to log those out once we get the response back we are going to parse the response and then with the result this is going to be where we send it in to and this can really be any model at this point it doesn't even need to be open AI you can send it in to a model and specify what we'd like in this case since we are using a markdown renderer we are going to be specifying that we want valid markdown and we're simply requesting here is the context and again the context is going to be the markdown or the Puppeteer result in this case from Amazon but it could be anything that you put within the XML tags here for the context and then we're just going to ask it to respond to the user's query and then to send in the result to the front end since we do have the streaming effect we're going to be listening for all those tokens to come back from om and as they come through we're going to be invoking the streamable update method from the versel AIS SDK as well as the key of llm response and this is going to be what gets concatenated to create the front end that we see visually and then finally with the i you just have to make sure that you do invoke the done method on a streamable once you are done and then finally if there are any errors we do have some error handling here we have some nice messages and what have you you can play around with us if you'd like but otherwise that's pretty much it I want to thank bright data for collaborating on this video and giving me the opportunity to write all of this out for all of you and allowing me to open source this example for you I encourage you to check out bright data sign up for an account you will be able to get some free credits to try all of this out otherwise if you found this video useful please like comment share and subscribe otherwise until the next one
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.