Web Scraping

I.T. is like any other industry, it has fashions and trends that come and go. Some last a long time like the relational databases or T-shirts, others go away quickly like the Web-OS or Crocks. Every now and again I find something that you look at on the surface and think to yourself “that will never catch on” but on later inspection you realise that there is a lot of potential in the idea and wonder why more people aren’t using it. There is a third and darker category which is you think its brilliant and are glad more people aren’t using it, but we will get onto that a little later.
Web Scraping at its core is getting a program to read a web page and extract the information you want from it. The first time I was ever shown such a thing was in a first year programming course. Python was the language at hand and the lecturer had already set up the program for the demonstration. In short it went to a specified URL, downloaded the page as html, converted this into a string and then printed everything out that was in the paragraph tags of the HTML. At it’s core this is how web scraping works.

I understand that this is not for everyone, who wants to create a program just to download some static information from the internet? But this leads me into the applications of the technology. Has your work ever asked you to go onto a website and they want the name and catalogue number of every item on a competitors website or even worse a selection of items form the entire catalogue. Perhaps something easier they want you to find every camera camera made by Nikon of Japan and what year it was released. For both of these examples you could easily open up your web browser and your favourite spreadsheet software and start copying and in fact this is what most of us do.

A lot of the time this is the best method to attack the problem you have to weigh up how long this will take you vs the time it will take you to program a system to do this for you. When you should seriously consider training a Web Scraper or Spider is when you need to do this on a regular basis. If for no other reason than the fact that it is easier on your sanity to program a system to do this for you rather than manually copy text for a few thousand entries every week.

To get started I would recommend that you stay clear away from the complicated Web Scraping tools. If you are an engineer and want to scrape all of the different metal types available from seven different suppliers and compare the costs on a per Kg basis you may need to look at something like scrapy or automation anywhere. For those of us who mostly deal with needing to pull information from a table over and over again you should just install google’s web browser chrome and then install the extension for it called web scraper. There are three videos that demonstrate what you use the scraper for. After the first video you will get the general idea of what is going on and how the program works. Be aware that the better the website is coded the better this plugin works, for instance the demo videos show a web page that looks like it was built with the assistance of bootstrap. In fact websites that are built with these sorts of frameworks often are easier to scrape as they have repeatable patterns and clearly defined classes.

Once you have set up your spider you run the script and then wait for it to finish, hopefully you are being productive at the same time and not just spinning around in a swivel chair. Earlier I mentioned maybe it was best that people aren’t using this kind of technology and there is a simple reason for it. It’s kind of like the current state of ad blocking on the internet. If too many people start using it companies start doing something like putting up subscription services. A similar thing can happen if many people start web scraping, the short of the argument is if you are doing something that you would have done manually anyway don’t worry scraping is fine. If you start trying to scrape information form social network sites in large quantities don’t be surprised if your IP gets blocked for a while if not pernemently.


