hello and welcome to this data analysis tutorial with me james from matador software today we're going to be looking at web scraping online data sets the easy way so we'll be using python wikipedia and google cola so we'll be using python through our google colab online ide to automatically scrape our web data for a usable data set and we're going to convert and save this to a csv file all with the power of python we'll use wikipedia to get those html tables and this allows us to use a simpler python package known as pandas and
not depend on the more complex beautiful soup where we need to open up our developer tools and look for more specific areas and more complex installs google colab is a product from google and it allows anybody to write and execute python code through the browser it saves us going through the hassle and rigmarole of installing ides and text editor and it's really well suited to data analytics and machine learning so we're going to be pulling the data from just a wikipedia article i found in soccer or football and you'll notice as we scroll through the
article suddenly we're presented with lots of tables but they don't necessarily look very structured so i'm going to display how easy it is to scrape these tables access them and save them to a csv fill by no means that's the data you may be using in the end but it's a good example and also we're going to access google collaboratory google collab for short through the browser and we just have to initiate this by loading up google or whichever search engine you use and when we go to create a new notebook online will be presented
with the option to sign in just like that once we've created a new notebook we'll sign in through a google account we're ready to write python code execute it and look at how we can easily scrape web data sets so whether you have or haven't used python before i'm going to attempt to explain this in the same level of depth for everyone to get an understanding the first thing we need to do is import pandas and pandas is a fast and powerful flexible open source data manipulation and analysis tool built on top of the python
language so we just need to import it into our project and we import pandas as pd and then now what i'm doing in creating an object or a variable called scraper that we're going to use to call and we've seen variables before in my dax videos and we need to assign this to pd.read underscore html and we simply take the https url the url from wikipedia for our page and when i type in scraper which is the same as calling print and scraper the print function and scraper we show everything that our web scraper is
already pooling with three lines of code which is quite amazing but the issue is we need to get some sort of structure here so how do we do that well there's a very simple method that i'll go through here below so what we need to do is assign an index to our tables and we also want to have some sort of text or symbols that separates the tables so what we can do is use this for i and index table and we want to enumerate through the scraper and we can print these lines to separate
the tables we can also print i which is just the index of the counter and each table so if you look now we've assigned some lines and you can see i see the lines in 0 and 1 2 3 4 5 whatever that may be but every table underneath the lines the separators has an index so you'll notice if i was to just type in scraper and the index 0 i would print the first table i also want to comment out this block of code because it's not going to work properly if i have that
there so to multi-line and multi-common outlines of code in python you just need to press command and question mark on a mac and i believe it would be control and question mark on a windows machine and now you see if i comment that out and type print scraper in that index 0 i get this lovely table so that's how easy it's been with just a few lines of simple code to scrape data and get the relevant table and index that we want so that's fantastic but now what if i want to assign this and save
it to csv file so i've assigned the scraper 0 our index that first table to df for data frame it's a python data frame and now i'm going to create a function and object that saves that data frame to csv i'm just going to give it a name so premiere underscore league that's the data set dot csv and just a standard index is equal to false lastly i can check that our whole process has worked i can create a variable the df underscore script file which is just our our scrape data frame and i'm referencing
that csv that i've saved in the line above just to check that everything works and when i call that i print that variable df underscore script underscore file i get the nice table so our whole process has worked and i've been able to read through the power python call that csv i've managed to do all of this scrape a specific table within a few lines of code so this is a great way if you're thinking about requiring more data sets scraping more data sets a great way to do it i actually have a previous video
on doing this for power bi and sql data as usual if you like this content please feel free to like subscribe comment and share thank you