Excel VBA Introduction Part 47.1 - Browsing to Websites and Scraping a Web Page

Excel VBA Introduction Part 47.1 - Browsing to Websites and Scraping a Web Page Welcome to this wise L excel vba tutorial in this video we're going to look at a technique referred to as scraping web pages which really is a case of using VBA to open up a web page read its HTML and then get the useful bits of information out of the HTML into a readable format we're going to start by looking at how to use Internet Explorer to browse to a web page and then we can inspect the HTML of that page using a couple of useful techniques of the browser will mention briefly the document object model which is the sort.

Excel VBA Introduction Part 47.1 - Browsing to Websites and Scraping a Web Page

Of standardized way that web pages are built and show you where you can find out some useful references for how that works but we start looking at how to use code to refer to the various HTML elements on the page which lets us manipulate the page so we'll do things like fill in text boxes and input boxes on our web page and then identify buttons that we can click on and try to follow hyperlinks once we've done all that for gonna have a really quick look at a simple technique using URL query strings which involves concatenating the address of the website you want to visit.

But passing in various different values of parameters which affects the results the page returns not all websites allow you to do that but we're going to use an example a really good example of one that does rather than using Internet Explorer there's another useful sort of more efficient technique for getting HTML returned to your VBA code it's case for using something called an H xmlhttprequest and that will become clear when we look at that technique in the video essentially it's a way to avoid having to open up a browser and.

Wait for the page to load it's a much more efficient quicker technique for the final parts of video we're going to look at the actual scraping part I suppose we're going to be looping over various elements of an HTML page specifically focusing on looping cover tables rows and cells so I'll have lots of nested for each loops looping through various elements of a currency exchange rate website we'll wrap up the video by looking at a quick simple user interface using a basic user form which gives a drop-down list for us to select a currency enter an amount and then return.

A set of exchange rate tables to various different worksheets so it's quite a long detailed video hope you wrote for this one let's get started we'll start the video by looking at how to get Internet Explorer just to browse to a simple website so I'm beginning with a brand new blank Excel workbook the only thing I've done so far is saved it as a macro enabled workbook so from there I can head into the Developer tab of the ribbon choose visual basic and we'll start a brand new module just to get this to work the first ability we're.

Going to create in here I'm going to call something like browse to site if we want to control Internet Explorer we'll need some kind of variable which can hold a reference to the application this is a lot like the technique we use to control Microsoft Word or PowerPoint in previous videos and you might remember from those videos on Word and PowerPoint etc that there were two different ways you could control an application you can use a technique called early binding or a technique called late binding just to very quickly cover though the difference.

Between those two techniques if I was using late binding all I would need to do is start by declaring a variable which can hold a reference to any generic object so the type of the variable will be object what we can then do is set that variable to refer to the result of the create object function in VBA I can say create object and then open some parentheses and then I need to state the name of the class of object that I'm trying to create so in this case what I'm trying to do is create something that refers to an Internet Explorer which I can just about spell.

Dot application what I can then do is control that application by using ie dot but sadly I don't see any intellisense I don't get any help here whatsoever because technically speaking the ie variable could contain any object so you have to know what properties and methods the application has so just to show you a couple of very very basic ones just to get started I know that I can make the Internet Explorer application visible so I can change its visible property to true and I can also make it navigate to.

A page so I'm going to say navigate and then pass in a simple URL so let's go for let's go for wise out of humans so www.weiu.net in that in every single video ok so having just done that I can execute the the code by pressing f5 and we ought to see a simple Internet Explorer window popping up brow to the Wisel homepage now although late binding works is not necessarily the best technique when you're first learning how to use an application so.

Let's just close down that instance of Internet Explorer and then let's show you how you can use early binding instead so early binding requires you to set a reference to an object library so in this case what we're going to have to do is head to the Tools menu and then choose references and the particular library that we're looking for in here is called Microsoft Internet controls there's quite a long list of libraries mmm as I say remember this from previous videos in the series so let's just find out to composite there it is Microsoft Internet controls very important that we.

Check the box next to that library so then I can click OK and what I can do now is rather than referring to the Institute Explorer as an object I can reference it as a specific class now if you'd like to see which new classes and methods and properties you have access to having referenced this library the simplest thing to do is head into the object browser so unitedly up from the View menu and choose object browser or indeed just press the f2 key on your keyboard using the drop-down list at the top of that window you can select the.

New library that you've just set a reference to now it's name isn't that obvious in this list it's such a called SH doc VW I'm not quite sure what that abbreviation is for but anyway but I choose SH doc VW this gives me a list of all the classes that are available to me from that library and the one we're really interested in here is one called Internet Explorer so if I select that we can see it's got a list of methods and properties and this will help us with the intellisense when we come to start writing code let's just close down the object browser and see how we can manipulate the code now to use this new class.

Posts Related:

    The first thing we can do is replace the word object in the variable declaration

    With a reference to the specific class we want which in this case is Internet Explorer so if I press ctrl in space you'll see that if I look for Internet Explorer it now appears in the list along with all the other classes sometimes it's useful to to precede the name of the class with the name of the library that it belongs to this is particularly important when you've got duplicate classes defined in in multiple libraries that you've referenced so the name of the other library is called SH doc VW as we've just seen then I enter a.

    Full stop there that limits the classes presented to me to just those defined in that library so I can say SH doc whew dot internet explorer what I can then do is modify the set statement so rather than using the create object function what I can do instead is say set ie equal to a new instance of and sh t VW dots Internet Explorer if I could spell doc VW Internet Explorer there we go so that essentially really produces the.

    Same functionality that we saw in the previous example one massive advantage of this is that when I say I a dot on another line I don't just get old I don't just have to guess what the methods and properties are anymore I get a lovely intellisense they're showing me all the methods and properties off I'm the visible property in there somewhere down towards the bottom of course there it is and I'll find the navigate method as well so the two things that we've just done I also have another opportunity to save a little bit of time in space with my code here as well so rather than having to set a new instance.

    Or create a new instance of the application in a separate statement like this I can actually combine this functionality in the variable declaration if I say dim ie as new SH doc view Internet Explorer I don't technically now need to explicitly create a new instance of it at any point this is referred to as an auto instancing variable so it doesn't actually create the new internet explorer in this declaration what it does is it waits until the variable name.

    Is used in code and then it checks to see if that variable reference is something yet if it doesn't they automatically new instance as I'm lunch required so you can save a little bit of time and effort when you're using this er this early binding technique anyway having done all that and hopefully you're happy with the difference between early binding and late binding just to prove that it all works in just the same way the end result will be the same if I run the sub routine I'll end up back on the Y's L homepage now although we're not.

    Going to use internet explorer for the entire video it it's worthwhile just seeing a few of the basic things you can make it do so I'm going to close down this instance of Internet Explorer if we want to manipulate a web page once it's loaded it's really important that we wait until Internet Explorer has navigated to the page so whatever code I write after the ie navigate would ordinarily try to take place immediately and that might have happened before the page is finally loaded just to demonstrate that I'm going to write a quick little debug print statement so I.

    Can say debug print and we're going to say ie dot location name sorry ie rather mighty but pardon ie lot location name for that comma and then ie dots the location URL who we're going to try to print out two bits of information about the page we just navigated to I'll need to display the immediate window which I can do from view immediate window we'll just press ctrl + G and then if I were to execute the code what we should see is at the bottom of the screen in the immediate window we ought to see the.

    Values that I've requested but you can clearly see that it hasn't printed anything if I close down Internet explorer nothing has appeared there at all so to make Internet Explorer wait until the page is finished loading we're going to add a couple of lines of code in between navigating to the page and then trying to do something with it so a common approach to doing this is you can either a do until or a do-while loop I'm going to do while loop what we're going to do here is test the ready state of the Internet Explorer application so as they.

    Do while ie dot already state is not equal to then there's five different states that Internet Explorer can be in I'm going to choose the one that says ready state complete so I'm going to carry a looping round until also break button while the the ready state is not complete now you can write some lines of code inside this loop but it's not actually necessary common things you'll see a thing like application dot wait so you make it wait until a particular time of day the common approach to say now plus a time value of one so one.

    Second that would add one second to the time right now another thing you would do is use do events to what the

    Literature says that yields the the control to the system to make the system do anything that it need to do but you don't technically need to do any of those things it's perfectly sufficient just to her just have the loop going round testing that condition once that has been met then the code will continue so having done that all I'm going to do now is run the application again run submitting again and we'll see this time that we do actually print out some information about the webpage ok now that we've got.

    A web page successfully loaded let's look at a couple of the very basic things we can make Internet Explorer do to it just to demonstrate this technique I'm going to change the website that web browsing to I'm going to use Wikipedia it's just a simple little example so I've actually already browsed to Wikipedia using Google Chrome so I'm just going to head back there I'm going to copy the URL from the address bar so I select the entire URL and copy that just in case you were wondering why I'm not using Chrome I'm using Internet Explorer for this demonstration is.

    Simply because Chrome and Firefox and other browsers don't provide a VBA object library to use there are some third-party tools you can use to make chrome and firefox etc work but I've never used them myself so Internet Explorer is reasonably reliable but we're not going to be using Internet Explorer for the rest of this for most of the rest of this video we'll show you another technique that means you can avoid using web browsers all together but we'll get onto that a little bit later on anyway having copied the URL for Wikipedia I'm going to paste that in in place of wise owl and just to quickly.

    Mention because we're using Internet Explorer here it doesn't really matter if you haven't got fully formed URLs so when you type in websites in the address bar in internal for your web browser you don't tend to find yourself writing HTTP etc so because we're using it into Explorer there's validation in place that will add all the necessary prefixes so you don't really need to write fully formed URLs okay let's just test that one works by running the subroutine.

    And having done that you can see down the bottom we printed out some information about Wikipedia now the reason I'm using Wikipedia is because it provides us with a nice little opportunity to manipulate the page using VBA code what we're going to do is get our VBA to write a search phrase in the search box and then click the Go button to search for whatever we've typed in now that does require knowing at least a little bit about the structure of the page fortunately every web browser lets you see the HTML code that makes up the.

    Page simply by right-clicking in its background somewhere and then choosing the options similar to view source it might be slightly differently in different browsers so I choose view source in Internet Explorer it gives me a right um up section at the bottom of the page showing me all the HTML code same sort of things in chrome just to just to make sure you're happy with it so if I right-click on the background of the chrome page and then save you page source in this case I'll get a slightly different way so he put actually takes me to a separate page altogether separate tab in chrome now picking.

    Through all the code in here is a little bit tricky to do so rather than try to work out exactly where the the code for that this particular search box is located what we can also do if I switch back to Internet Explorer what I can also is actually right-click on a single specific item in the page and choose to inspect element so what that will do in this case again in Internet Explorer it takes me to the exact same panel at the bottom of the page and it takes me to this Dom Explorer so Dom is short for document object model which is a.

    Standardized way of building page or referring to the items in a page so what we're actually going to do is we're going to make Wikipedia search for the document object model now the important thing about having inspected this element is it takes us to the exact placing code where that element is defined so you can see that for this input box this text box it's got a tag here called inputs that's that type of object that it is it's got a specific name attribute called search that's also part of a form so this form is another.

    Tag which has an ID of search form so what we're going to do is write some code to change the value property of the sir element of the search form form in the entire document so let's just close down or sit in that let's just switch back to into our VB editor and we'll add a little bit of code in after we've printed out the location and URL of the page to start with of course we'll need to reference the internet explorer.

    Application and then inside there we're going to refer to the document property and the document property simply reverse to the HTML document that builds up the web page now sadly at this point the intellisense breaks down a little bit if I type in a full stop I don't get any further help whatsoever we're going to see a much more sophisticated way to inspect the HTML of a web page we're going to use a document object model to to sort of really get into the detail of a page later on in the video but for now just as this first simple example you.

    Can kind of treat items on the page as though they were VBA collections so much in there was a collection of form objects which we could legitimately refer to as forms inside there we could refer to a form by name and I've just found out that the name of that form was search form that's what we saw in the inspect element option in Internet Explorer so within there a form has got a bunch of sub elements so we've got own elements collection inside the form and one of those was simply called search so that was the name of the input box that.

    We could type our search phrase into inside there we've got a property called value and again without knowing be in debt without having the intellisense you just kind of have to trust that this is the case and then we can make that equal to and then in some double quotes let's search for document object model okay so what I should do is fill in the value of that search box on the Wikipedia page I'm just going to close down the currently open Wikipedia page I'm going to close down the incidence of of.

    Internet Explorer and then I'm going to execute the code just to see if this works so there we go we've got the page we opened up and it's now got document object model typed into that search box at the top the next thing to do is work out how to click that button so again if I right click on that button and I choose inspect element what we should see is it takes us to the exact part of the code that defines what that item is so it's another part of the same form so you see it's a form ID search form but within that form we've got another nested element to name is go so knowing.

    That if I just close down this this instance of Internet Explorer I'm just going to essentially copy and paste most of this line here so I'm going to copy all of that line that refers to the individual items maybe a with block would be a sensible thing to use here but I'm not interested in the search element this time I'm interested in the go element what I'm going to do in this case is rather than try to change its value I'm going to apply a method to it so I know that I can apply a click method to certain objects on the page ok.

    So having done that at this point I'm going to run this one one more time and then we should see that Wikipedia opens up it types a search phrase into the box and then click the Go button to browse to it it is probably worthwhile having a bit of a read of this page at some point I'm not stressing right now I'm not going to go through everything that's on here but it gives you a nice bit of background information about the techniques we're going to use going forwards in this video and there are lots of nice links as well to other pages that will give you more detailed breakdowns of how the document object model works so feel free to have a look.

    At the references of them on the page and the external links have a quick read of those at some point maybe in your spare time ok so at this point I'm just going to close down that instance of Internet Explorer and we're going to go a little bit further and see how we can actually get references to the document itself rather than having to rely on predicting what the individual names of the items on the page are so before we do this let's have a separate example I'm just going to copy and paste essentially the entire subroutine that have already got and then I'm going to paste it in into let's.

    Say a new module just to keep things nice and neat and tidy I'm going to insert a new module and then I'll paste that subroutine in I'm going to change the name of the subroutine this time to something like get HTML document and then we'll change a few little things here as well I'm going to clean up the immediate window first of all so I'm going to click in there press ctrl a and then deletes to get rid of it will use the immediate window again shortly and then I'm going to get rid of a few unnecessary bits of this code so I don't need to see the the explicit set ie to a new instance we've already got that a.

    Lot of instancing variable we don't need the debug print statement here and we don't need the the bit about the about changing the values of various elements on the page on where those lines as well what I want to do next is get a reference to the the web page that were browsing to as an actual HTML document rather than the basic document property of Internet Explorer now just to make sure that we get as much help as possible with this we're going to set a reference to another object library so.

    Let's head up to the Tools menu and choose references again and then if we scroll down far enough what we're looking for this time is some Google Microsoft HTML object library so let's make sure I don't go pass it there it is Microsoft HTML object library make sure you check the box next to that and then click OK if you want to get an idea as to how much or how many classes are defined in this library have a quick look in the View menu and choose object browser again and then change the drop-down list at the top to read MS HTML that's the name of the library.

    We've just referenced there's a huge number of different classes listed in this library it's a pretty sophisticated tool we're of course not going to go through every single last item in here in this video there would last forever we're going to go through enough of these just to give you a starting point for manipulating HTML documents let's just close down the object browser and let's write a little bit of new code now to get a reference to the HTML document what we'll do first is declare a.

    Variable which can hold the HTML document we're going to retrieve that say something like dim HTML doc as ms HTML dot HTML document okay having done that what we can do is once the web page is loaded and Internet Explorer is ready we can say set HTML doc equals ie dot document so that stores a reference to the document in that new variable now this is useful because it gives us much finer control over the items in the HTML document just to get a bit of a clue I.

    Say HTML dot dot you get an idea of how many methods and properties we've got access to in here so we can get references to items in the page there's a whole bunch of methods to do with getting things getting elements which we'll look out fairly shortly and all sorts of other useful methods and properties that we're going to work with in other parts of this video so just to demonstrate why this is potentially useful I'm going to get rid of that little line there and I'm going to change the web page that will browse to so I'm going to go back to the wise owl web page in fact let's change the URL back to wise owl code at UK notice here.

    I'm missing out the www part as we said the Internet Explorer handles the missing parts of a URL so having done that I'm going to run the subroutine to browse to wise owl what I'd like to do is a similar thing to what we did with Wikipedia I want to be able to type something into this search box and then click the Go button but we've got a little bit of an issue with this if I right click on that search box and choose inspect element what we'll see this time is that the input in here although it has a name is called what.

    Isn't it's part of a form which doesn't have a name so I haven't got any easy convenient way to refer to the elements of this page like I did with Wikipedia I can't say forms for named elements element name so what going to do is that is use some of the HTML library's techniques for referencing that specific object by its ideal by its name close down that instance of Internet Explorer and then we'll write a little bit of code to get a reference to it let's declare a variable that's going to.

    Hold a reference that search box so I'm going to declare a new variable at the top which I'm going to call let's call it HTML input spell HTML correctly first of all HTML input and I'll declare this as an MS HTML a dot I HTML element so an IH tml element allows us to hold a reference to any individual item on an entire webpage in an entire HTML document having done that I'm going to set my HTML input down to the bottom set HTML input equal to HTML dog dot and.

    Then what I'm going to do is use one of its methods to get an item by its ID so the method in here called gets element by ID or by name so if I say get element by ID and then open some parentheses and then open some double quotes as we've just seen the name of that element on the webpage in there in the web page of sauce when we inspected the element was called what so I don't close the double quotes and then close the parentheses what I can now do is change the value of that object so I can say HTML input dot.

    Value equals only less search for less that's your Excel VBA and that's what we're that's what we're dealing with okay so having done all of that let's run the subroutine just by pressing f5 and we should see that we've now typed in Excel VBA into that search box now the next thing I want to be able to do is click on the Go button but we've got another problem with this as well if I right-click on the Go button and choose to inspect the element we'll see that when the code does finally pop up.

    In the document object model Explorer then we'll see that although it definitely is a button needs either the class off sorry the M the tag is a button it doesn't actually have a name or ID attribute so that gives us a little bit of an issue what we're going to do because we can't reference it by ID as it doesn't have one we're going to reference it by is tag so the first thing we're going to do is try to return a collection of all of the buttons on the page and then loop through those buttons to see which ones have belonged to the collection and see if we can.

    Identify this specific one so there are a few few unique bits of information about this button since it says class name type and value so maybe we can use those instead so having done having established that we want to do that let's close down this instance of Internet Explorer the first thing we need to do is declare a variable which can hold multiple HTML elements so I'm going to call this one dim HTML let's call it buttons as an MS HTML dot I HTML element collection so rather than type.

    DISCLAIMER: In this description contains affiliate links, which means that if you click on one of the product links, I'll receive a small commission. This helps support the channel and allows us to continuetomake videos like this. All Content Responsibility lies with the Channel Producer. For Download, see The Author's channel. The content of this Post was transcribed from the Channel: https://www.youtube.com/watch?v=dShR33CdlY8
Previous Post Next Post