Python Part 19 - Pandas Office

Python Part 19 - Pandas Office Welcome to this weissel tutorial on using pandas within python so here's what you'll learn during this tutorial so we'll begin by looking at what pandas actually is and its pros and cons and whether it's for you or not we're then going on to look at how you can install the pandas module and then we'll look at how you can create the basic building blocks data frames and series we're then going to look at how you can show data frame information or information about your data frames.

And then look at how you can read from and write to csv files excel workbooks and sql server tables and i hope you'll see just how easy this is to do then we'll look at indexing data frames to make it easier to get at individual rows or quicker and then we'll look at sorting and filtering data frames we'll look at showing statistics including grouping at which point the syntax will get a little more difficult perhaps a tiny bit we'll look at how you can create columns.

Using expressions and we'll look at how you can rename columns how you can join tables together and how you can remove duplicates just three things which didn't seem to fit in anywhere else and finally we'll show a quick introduction to matplotlib showing how you can create charts based on pandas data frames at the top right of the screen a link will appear about now and you'll be able to click on that at any time to download any files or exercises to do with this tutorial and you can get the same link from the youtube page for this tutorial.

But that's enough of me it's bye bye and over to sven so let's get started let's start with the fundamental question why is pandas called pandas here's the wikipedia page for pandas at the time of speaking and you can see from this that the panda's name is derived from the term panel data or it's a play on the phrase python data analysis itself.

And the only problem with this theory is neither of those gives anything remotely like pandas so my theory as to why pandas is called pandas is because everyone wanted to have pictures of nice fluffy pandas all over the place but anyway so what is pandas um it's like excel and the easiest way to understand it i think is in comparison to excel which i hope is a concept people are familiar with so there's an excel spreadsheet listing out some films and there's the equivalent pandas data frame as it's called.

So in excel you can import data you can create formally you can create charts and you can create pivot tables or pivot your data and much more besides admittedly and in pandas you can import data from excel csv wherever it may be you can create formally far more so than in excel actually you can create charts using a separate module called matplotlib which i'm going to touch on right at the end of this tutorial but we won't cover in any more detail and you can pivot your data although.

Python Part 19 - Pandas

Admittedly the results won't be as easy to interpret because you won't have the nice glossy interface of a pivot table so that's what pandas is let's look at the building blocks of it there's two main ones data frames and series so here's a data frame it consists of a set of table of rows of films and it consists of five separate series so i've highlighted there one of the series a series consists of the data itself in a column but excluding the column title.

So let's look now at why pandas beats excel and then we'll look at why excel beats pandas before we get started so reasons pandas is better than excel it's less limited by the size of the data excel is limited to one million rows but not only that it can run very slowly long before that whereas because pandas doesn't have any visual interface to maintain it can run much more quickly behind the scenes secondly it's got far more tools for analysis and for data cleaning and thirdly it's free.

You don't have to pay a license fee to microsoft and of course it gives you a chance to practice your python against that here's why excel might be pandas uh the number one is you can see what you're doing on screen so there's a the visual interface is much clearer you don't have to keep showing your data frame again it's got a much bigger user base so you're going to find more people who know excel than have the skills to use pandas and finally and most controversially it's quicker to do things i realize this could be construed as a matter of.

Personal preference but surely if you know excel and pandas inside out it's still quicker to use excel to do most things i'll leave you to debate that yourselves but now i think it's time to start using pandas beginning by installing it so in order to be able to use pandas you need to first install it you can see i haven't got it installed on my machine and that's why i'm getting the error message so to do this you need to go to your terminal window and what i'm going to do.

Is just prove that i've uninstalled all of the other modules i think all of the other modules that i installed in making all the other tutorials in this series so i've almost got a virgin copy of visual studio code so what i'm going to do is install pandas and to do that you type pipspace install space pandas when you press return it will say what it's installing and you can see it's installing pandas but also installing numpy so numpy is a.

Package or module which allows you to work with arrays and it's also installing something called six which i must admit i know little or nothing about though you'll very rarely see the line import pandas even though it's now going to work instead most people give it an alias and say import pandas as pd so from now on you're going to include every single program we create will include that line at the top and having installed pandas i think it's time to start using it.

    So to show creating data frames and series i've created a file called a dash getting started dot py

    And accompanying this tutorial is a file called exampledataframe.png which is just a picture of the data frame that we're going to create so in order to be able to do this we need to be able to create a dictionary and the dictionary will have keys and values so the first key will be the word id and that will be used to pick up on the values of all the numbers in the column the second key will be the title.

    And the values for that dictionary key will be all the films and so on so what we'll do is create that dictionary and use it as a basis for our data frame although normally you would import a data frame from another file which is what we'll do shortly in this tutorial so to do this the first thing i need to do is create a new dictionary i'll call it films and i'll put a comment in above it to denote what i'm doing so i'll call it films and it's a dictionary so it needs curly.

    Brackets like that what i can then do is specify the first key which is the id and the values for that will be a list of values i'll make it and they go one two three comma five comma six for some reason i missed out four i'm not quite sure how that happened and then i could go on to the next key the title and so on now it's not very interesting watching me doing that so let's paste that in from the clipboard to get my dictionary so there's my films what i can now do is.

    Use those to create a data frame get a panda's data frame based upon upon this and to do that i'm going to call my data frame df for data frame and what i'll do is take the pandas module which i've given the alias a pd and i will create a data frame based on that and the argument i'll pass to it is a dictionary so it's going to be the films and what i want to do then is to prove this work so i'm going to print out two.

    Things the first thing i'll do is print out the type of that because i want to show you what you've got and then we'll print out the excuse me the data frame itself so if i run that program you'll see i get the type of it it was a data frame i'm pleased to see and the actual data frame itself so that's one way in which you can create a data frame i'll just comment those two lines of code out so i don't print them out again and what we'll now do is create a series so an individual column which is the other building block if you remember.

    I'm going to base my series on this list of films so i'll just copy that to my clipboard and what i'll do now is create a list so i'll call this film titles and set it equal to my list of films and then i'll create a pandas series based on this to do that i'll create a variable called film series.

    And i will do pd dot series exactly the same sort of command and then i can supply any sequence or list like object i think it's said in auto completion there or in the help so based on my film titles and then what i'll do is do the same thing to prove this works so i'll print out the type of this increase and then i'll print out the object itself or the film series itself.

    And if i run this to prove this worked i will get two things again this time i've got a series which is a subclass or a subset of a data frame and i've got the actual series itself so that's how you can create data frames and series although as i say you would normally import them from other files which is what we'll do shortly in this series we're going to look now at showing both information about the data frame and parts of it and to that effect i've created a file called b showing dataframe info.py and i've copied in the information from.

    The previous example we did so i've got a dictionary of films and i've used that to create a data frame so the first thing i might want to ask is uh show me about my data frame so you can print out general information about a data frame by printing out the name of the dataframe.info an info is actually a functional method so i need to open and close brackets after it and if i run that you'll see i get some general background information quite useful stuff on my data frame i can see what it is.

    I can see how many columns i've got and i can see for each column what the data type is

    I can even see how much memory it takes up which for large data frames could be very useful not quite sure what that none is at the bottom so you could also let's just comment that line out be more specific and just get information on data types and to do this is very similar but instead of using the info you just use the data types property and if i run that you'll see i get the data types my five columns.

    So two of them it was able to pin down as integers the other three just contain strings or dates or something less specific so i'll comment that out you can also get summary statistics now much more on specific statistics later on this tutorial but for the moment it's useful to know that there's something called describe which is a method and needs brackets and if you run this you will get a quick summary of statistics so you can see for all of the numerical columns in.

    The data frame it's showing how many there are the mean the standard deviation and all sorts of other interesting statistics like the 25 percentile that's what you get by default in a template obviously you can fine-tune that and get exactly the statistics you want so that's how you can show information about a data frame you can also share information or show part of the data frame so you can show top and bottom rows for example.

    So i might decide i just want to look at the first three films so to do that i can use a head function and in brackets i can specify how many rows i want to get so if i put three in there and then run that program you can see i just get the top three rows you can also put negative numbers in here slicing like so if i put minus three i will get the going up from the bottom i will miss out the bottom three rows and just get the top two but there's no real reason i.

    Think to use minus signs because instead of that you can also use the tail function so what i could do is show the bottom three that would show the last three films so head shows the top rows tails shows the bottom one interesting thing about this is if i put in a ridiculous number in there it doesn't actually crash it just shows all of the rows and likewise if i put in a ridiculously nega large negative numbers so that can't show anything it just shows an empty list so it's quite quite forgiving.

    So that is how you can do heads and tails and the last thing i was going to do in this section is to show how you can show specific columns so the first thing i'll do is just show a single column so we'll take the data frame and we'll specify which key you want to pick out and i'll just show the title column so you're accessing a key in the dictionary all of which makes perfect sense to me and i run that i'll just get a single column in fact what i'm doing there is.

    Returning a series so that makes sense the next bit may be less so let's suppose i want to show the id next to that so i'll create a list which contains the two keys and pass that in as well so this should show a data frame with two columns if i try running that it will get an error message and not just an error message but a difficult to understand error message i must admit i can't see anything in there which is telling me what the problem is the problem is that when you're using this syntax you need to pass in you need one square bracket to say.

    Here's what i'm using and then if i'm passing in a list i need to denote it as such so i need to include two square brackets one after the other at the beginning and the end and in my experience of using data frames this is the hardest thing about them the syntax of what things expect to be passed can be quite um unpredictable and you often find yourself having sequences of brackets you need to think very carefully about what you're doing to avoid syntax errors but anyway if i now run that you'll see it gives me my two.

    Columns what i want to do now is show how ridiculously easy it is to read and write csv files within python i've hinted at this in previous tutorials saying don't bother reading csv files in line by line just use pandas to import them into a data frame so we'll start with writing the dictionary i've created out or the data frame i've created out to a csv file so i've created a new file called see read and write csv dot py it contains this data frame containing a.

    Variable called df and i'm just going to write this out to a csv file and to do this i'll just put in a comment to do this i can take my data frame and i can use the dot to csv what could be easier to get the path i would normally just paste it in but i thought i'd show you how i've been doing that you can right click on the folder to which you want in which you want to store it and you can choose to copy the path.

    And what i'll do is just paste that in here with a little r and some quotation marks and then add on the file name so let's call it films.css that is that it really is if i now run that program you can see it creates a file called films.csv now there's only one issue with that which is this rather weird comma let's just highlight it there the top left the preceding comma and the reason that's coming in i know.

    The solution i'm not quite sure i understand it but that's okay the solution is to add an additional argument saying the index f argument should be set to false i think it's actually got a little iron there and if i run that again you can see it gets rid of the eye i've read up a bit it i still don't understand the reason but the solution works fine so that's good enough for me you could also if you like have a play about with some quotation marks.

    So you can set a quoting variable to specify whether you're going to use quotation marks or not so there's various values for this if i use 2 for example then you can see i get quotation marks around just the text i think if i'd used one i would get quotation marks around everything and if you don't like those quotation marks you could go back in and add another argument the um quote character and i could say i'll use a pipe symbol instead it's at the bottom left corner of your keyboard if you want to play.

    Around with it if you run this again you'll see i get different quotation mark characters so they couldn't have made it easier to create a csv file i'm reading it in it's just as easy so what i'm now going to do is read in the csv file i've just created into a new data frame so i'll create a variable called new df do is take the pandas module and apply the read csv uh method.

    I need to know where i'm getting it from so what i'll do is just copy this from the line above then to see if that's worked i can print out my data frame i've just created i run that you can see it gives me my data frame unfortunately it's included the quotation characters because i haven't specifically said not to so what i need to do is add an additional argument saying that my quote character.

    Is a pipe let's see if that's sufficient to do it it is that's how you can write and read csv files it's so easy in looked at csv files what we're now going to do is look at how to read and write excel so i've created a program file to that effect which creates my standard data frame listing out the films and what we'll do as we did with csv is to write it out to an excel workbook and then read it back in again now in doing this you'll find that.

    Writing and reading excel files is nearly as easy as csv files and you realize why i kept giving warning messages about this in previous tutorials pandas really does make it easy to load and export data so we'll start with writing out the data so to do this i can write the data to an excel workbook and because i'm going neat i know i'm going to need the name of this path i'll just right click on the folder and choose copy path and to do this what i can do is take my.

    Data frame called df and i can send it to excel and the argument i'll specify is the folder location so to do that i'll just put it in quotation marks and i'll call it films dot csv films dot xlsx beg your pardon and then i'll try running that and when you run it i don't think it will work and the reason is it's saying it's got no module named open pi excel in order to be able to communicate.

    With excel it needs some sort of library or module enabling it to do this and to use the modern version of xll with um since excel 2007 i think xlsx files you're better off actually ignoring that and using the x-less writer module instead so what i'm going to do is install that so if i type pip space install space xlsx writer xlsx writer and press return that will install that.

    And what i can then do is to import that module to the top of my program and then i can specify that this is the engine i'm using so when i write it out that's a module which will help me do so so i can specify that that's the module i'm using but that needs to go in quotation marks so let's try that again if i run that again this time you can see it's worked or at least it hasn't generated an error message and if you look at my films file you can.

    See that if i right click on it and choose open preview then you can see the excel data just a quick reminder of what i just did uh visual studio code doesn't have a native uh way of viewing excel files but if you go to your extensions and if you install the excel viewer extension then what you'll be able to do as i just did there is to uh right click on a file like the films file films workbook and choose open preview and you'll be able to view it within visual studio code.

    Now that was really good probably just two changes i'd like one is i don't particularly want this first column giving the row number it's called the indexing of it and the second thing is i don't really want it to be called sheet one i could do with a better name so what we'll do is just tweak this by adding two more arguments so the first one is as for csv our old friend the index argument and i'll set that to be false and that will stop the row numbers appearing on the left hand side and the second argument i can specify is.

    Sheet name and i'll call it list of films there are many many other arguments you can specify to say for example which columns you're exporting etc but that will do for me so i run that again again it seems to worked i need to close the excel workbook down and right click on it and preview it again to see the latest results and you can see my worksheet has got a better name and i've lost my row numbers.

    So that's how you can write to excel what we're now going to do is read back from it so i will just comment out all of that we'll read in the file so to do this i'll create a new variable called new data frame or new df and i'll do exactly what i did with the csv file i'll take the pandas module and instead of using read csv i'll use read excel and then in brackets i'll specify my path paste that in.

    Then i should just be able to show the data frame i've just imported so does it work i think possibly not if i try running it you'll see it's miss it's missing the dependency open pi excel so i think xlsx writer is for writing information i need to read it in and to do that i need open pipe excel so it's back to the terminal window hit install open pi excel we use this in a previous tutorial to write information to excel.

    If i install that what i should then be able to do is to import that into my um program and try running this again and the next time i run it you can see it gives me the information i wanted i think the reason it's automatically picking up excuse me on the family on the correct worksheet is because that's the first and only one in the workbook if i wanted to be more specific i could go and add the sheet names sheet name.

    And say it's called list of films so i could choose which worksheet i was going to import and if i run that one more time you'll see it gives me exactly the same information which looks to me to prove it worked so that's how you can read from and write to excel workbooks finally in this mini series let's have a look at how you can read from and write two sql server tables what we're going to do now is to read and write to and from sql server and after the euphoria of working with excel.

    And csv files this is going to be a bit disappointing i'm afraid i've created a file called e-read write sql server dot py which contains our usual data frame and what you're probably expecting me to say is that you can use the dot the two underscore sql method to write your sql server database and you can but if you go to the pandas page on it what you'll see is it says that databases supported by sql uh alchemy are supported and that means anything else isn't.

    So you can use that but you'll need to create something called a sql alchemy engine first and while that's a perfectly feasible way of proceeding i think on balance it's easier to use it to you to proceed without that and so not use this method at all so what we're going to do is just write the rows out one by one from a data frame as follows i've created a sql server table called tbl film which contains all the necessary columns and one more as well.

    DISCLAIMER: In this description contains affiliate links, which means that if you click on one of the product links, I'll receive a small commission. This helps support the channel and allows us to continue to make videos like this. All Content Responsibility lies with the Channel Producer. For Download, see The Author's channel. The content of this Post was transcribed from the Channel: https://www.youtube.com/watch?v=WrC-KrO3CxQ
Previous Post Next Post