hello everyone so in a recent poll 47% of you have voted for a web scraping and capture solving tutorial using playwright and since your wish is my command that's exactly what we'll do today so in addition to learning the basics of playwright we will also create a mini project downloading a whole bunch of research papers using Code only no hands and the cherry on top is a state-ofthe-art capture solving system so if you're ready let's roll now let's begin with a basic syntax where we import the class of sync playwright and then we initialize it with syncore playwright followed by an empty set of round brackets and calling the start method on it then we will assign this expression to PW as in playright now if we' like to access a website through our code we will first need a browser object in my case pw. Firefox on which we will call the launch method then we will assign it to browser now since we are initializing a browser object we will also need to collapse it as soon as we are done using it so at the very bottom of our code we will call browser. close okay but what should we do in between well how about we start with a new tab which we can do with browser.
newor page which we will assign to page then we can use this tab to navigate to any kind of website in my case page. go to to which we will pass a string of HTTP google. com now to make sure everything worked we will go ahead and print the content of the page with print page.
content followed by an empty set of round brackets and this will print the entire source code additionally we can go ahead and print the page. tile and we can even do a nice screenshot of the page with page do screenshot giving it the path of example. png you can choose any kind of name let's save this file and let's go ahead and run it in our terminal in my case I have a WSL terminal embedded into my IDE and because my head is going to block most of it let me enlarge it like so okay so the first thing we do here is we create a nice work working environment in my case with cond create dasn I will call this environment scraper and I will install python 3.
11 in it then we will go ahead and activate our new environment with cond activate scraper now once we are inside our brand new environment we'll go ahead and install playright with Pip install py test- playright but that's not all we will also need to install playwright's browsers with playright install and even though we can select a specific browser in my case Firefox I'll just install all of them why not now once the browsers are installed we will also need to install some dependencies in my case with playrite actually let's pull our recent command and after playright install we'll just add a dash and then depths as in dependencies and beautiful now playright is officially installed with all these nice little commands and we can run our code in my case with Python 3 quick start. py and beautiful here's our source code you can verify it later it's very long and here's our page title which is Google yay now what happened to our screenshot let's go ahead and open the containing folder let's refresh it and beautiful here's our example PNG and perfect everything worked like a charm but the only problem is we didn't actually see a Firefox browser popping and navigating to Google so in technical terms we've done something called headless browsing where we fetch the results without observing the actual process now if you'd like to physically see how your browser is being automated you will need something called a web driver in the case of Firefox it is called the gecko driver and you can find it on GitHub just navigate to the release section and choose the release that best suits your operating system in the case of WSL that would be linox 64 so let's go go ahead and click it then we will extract it in the same folder as our quick start let's just drag and drop it there you go and now once we have a web driver the only changes in code would be adding a headless property inside our Firefox launch method let's go ahead and edit headless and we will set it to false now since at this point in time our code doesn't really do much we will add another property called slowmo and we will set to 2,000 milliseconds and that way our web driver will pop and it will wait at least 2 seconds before it collapses if we don't add this slow motion we may not even see it okay so let's save it and let's rerun our code once again let's just fetch it from our previous commands and beautiful here's our Firefox web driver it pops and it is gone after 2 seconds as I mentioned earlier we have the same results um as before just much more Visual and great we are done with a quick start we can officially move on with selecting elements now let's say we'd like to scrape some research papers from a site named archive for this we will navigate to archive. org search and let's have a look at the source code so with a right click we will select inspect we will then click on this arrow button and we will highlight the input field which in our case would be an input element that has the placeholder text of search term now let's quickly copy this placeholder text and then back in our code we will first update the URL from google.
com to Riv that's how they spell it. org search and then we can select our input to do this we will type page. get by placeholder to which we will pass the text we just copied now if we're already here we might as well fill it with some text in my case I will call the fill method and I will pass it the text of neural network now once we have filled our input with some text we will need to submit it of course so then back in our browser let's have a look at this search element and in our case we are dealing with a button that has the inner text of search easy peasy back in our code right after we select by placeholder we will select another element with page do get by roll this time and we will pass it the roll of button but in our case we're not just looking for any kind of button we are looking for a very special one so we will chain it with another selector this time get by text and we'll of course pass it the text of search with a o search with a capital S I believe and there you go yeah it is a cap capital S okay now once we select our button we might as well click it so it's call The Click method on it now to make sure our terminal stays nice and neat we will get rid of this page content print statement let's save everything and let's give it a nice little run and there you go here's our archive page here's our neural network search term and oh no something is wrong with our button let's see exactly what going on here and okay it looks like instead of selecting a single button we selected two of them so let's see how it looks like and okay it looks like we have two buttons with the text of search we have a red one at the very top and we have a blue one right below it so how exactly do we click on the second button and we ignore the first one so right before we call the click method let's add another method called nth as in nth okay and this an method receives an index in our case the index of the second button which would be one because we start counting from zero of course let's save it okay and this time it should work okay let's give it a run okay no hands okay it worked we are successfully searching and navigating to a new page now let's say we want to download all these PDF documents that we got from the search results and based on their source code let's click on this PDF link we can see that we're dealing with a bunch of anchor elements and if we click on another one let's say this one we can also see that they have a very similar URL they all start with archive.
org archive. org PDF great so let's select all of them now in order to select them we will use something called an xath which is probably the most accurate way of describing elements it can Target elements ments that have specific property values so for example all the paragraphs that have the class of authors now another benefit of X paath over other selectors is that it accepts approximate values so we can use it to Target elements that start with and with or simply contain a specific set of characters which is exactly what we're looking for in our case so back in our code let's use an ax paath to select all the PDF anchors on the page for this we will type page do locator to which we will pass a string of xath equals dou SLA which will select all the anchors on the page but since we don't really care about all of them let's narrow it down with a set of square brackets and inside them we will call the contains function now this function receives two arguments the first one is the property of at a graph and the second one is the pattern that we are searching for for or the set of characters that our value contains in our case a set of single quotes and inside them arkive. org PDF now let's quickly assign this expression to links and then to make sure it worked let's print them below Now spoiler alert the first time it's not going to work and we will see shortly why so for Link in links we will print the link doget attribute in our case Adra now let's save it let's give it a quick run or actually before we do it let's turn off the slow motion because we already know it works we don't need to see it time and again right let's save it again and now let's run it and okay as I promised we are getting an error specifically a type error because a locator object is not iterable so let's turn it into one simply by calling the all method on our locator now if we save everything and we give it another run we are now getting a whole bunch of anchor elements in return there you go a very big list now let's copy one of them let's make sure we get a scientific article back and beautiful we do and if we scroll down sure looks like it relates to neural networks so great now the only task on our list is to download all these PDF documents and store them on our system so let's create a new directory for our files we will do so with Mech deer and we will call it data then instead of printing our arra attribute we will assign it to a local variable named URL and then we will use a library named URL lib to download all these URLs in the form of file so at the top of our code we will type from URL li.
request we will import URL retrieve which is a nice little function that I hope I spelled right okay let's copy it and then we will call it at the bottom of our for Loop now this function receives two arguments the first one is the URL itself and the second one is the name of our file which in our case starts with data slash to which we will concatenate something unique so how about the serial number at the end of our Adra attribute okay we have this 13763 and I believe this one is unique I sure hope so okay so back in our code we will focus on the last five characters of our URL string with URL from which we will slice all the characters before our index of minus5 then we will of course concatenate the extension of PDF okay let's save it and if I have no typos then it should work let's give it a run and this one will take a bit of time because we're downloading 50 articles into our system and great it looks like we are done so let's navigate to our project folder let's refresh it and we have a data directory with a whole bunch of PDF files so let's have a look and beautiful it is an article that sure looks like it has to do with neural networks how about some other files let's see the first one okay and that's a PDF research paper as well how about something in the middle fantastic they are all research papers amazing so we are done with the introduction to playright so let's move on with something a bit more professional now so far we've been dealing with a very forgiving website it never blocked our IP and it never stopped us with a capture but what happens if we're scraping a website that does so for example here's one that uses Cloud flare protection and essentially it will always know that we're not human or another example is a popular website like walmart. com that just slaps us with a capture as soon as we are trying to access it but it's not going to stop us we'll just use a very powerful tool named web unlocker now this tool uses a combination of proxies and capture solvers and all kinds of goodies to bypass site security to use it we will navigate to Bright dat. com using the link in the description why because it gives you $15 of credits Which is far more than what you need to follow along with this tutorial I've been using the web unlocker for about a week and so far it cost me like 15 cents or something ridiculous now let's go ahead and click C on product followed by the web unlocker API we will then scroll down and we will start our free trial we of course log in we will not change our password this time and we will click on proxies and scraping we will then scroll down and we will get started on the web unlocker API we will then choose a name for our Zone let's say Maria's Zone and then we will make sure that our capture solver is is enabled now if you're trying to scrape one of these websites check out this not so long list okay you will need some extra unblockers these are extra problematic domains so you'll need to enable this premium domain section in my case I'm not scraping one of them so I'll just click on ADD and I will confirm great now one more thing we need to do is we need to generate an API key which we can do from our account settings where we simply click on ADD token we give ourselves some administrative Privileges and we'll click on Save now since I will delete this token as soon as I'm done filming this video I'm just going to click exit usually you would save it on your system in a secure place in my case it doesn't really matter we will dismiss it now once we are done we will simply click on proxies and scraping we will navigate to Maria's Zone and all the credentials we need are here under this access details section great so how do we specify them in our code let's go back there and let's create a new dictionary named proxies now this dictionary has a key of server as well as a key of username and lastly a key of password now let's copy our credentials from Bright data so back in our access details we will copy the host which would be our server then we will copy the username which is this long long string over here and we will paste it as our username of course then lastly we will copy our password and we will paste it as well beautiful now the last thing we need to do here is we need to specify our proxies when we launch our Firefox browser for this we will add a property of proxy and we will set it to proxies okay and we will get rid of this space to keep things consistent beautiful now there's two more things I've added off camera I basically located the search input and I filled it with some text there you go it should say testing now right underneath instead of clicking on some buttons or submitting a form I simply press the enter key which should work as well okay you can do it too great now let's save it okay oh and there's one more thing um you need to notice since we're using a bunch of proxies and some security bypassing mechanisms it kind of makes sense to scrape the HTTP version of a site rather than the https version which is the secured one okay so make sure you do so with HTTP okay let's save it and let's give it a run and there you go here's the ugly HTTP version of Walmart and we have a bit of a delay here okay we have a slow motion set to 5 Seconds okay so here's our testing text and there you go we are navigating to a new page amazing folks we now know exactly how to use proxies so let's try them on the other site we've been trying to scrape before the one with a cloud flare protection okay so instead of walmart.
com it was something with geeks let me check hold on I forgot okay it was geek time let's do it geek time.