Web Scraping With R

Overview

An important aspect when dealing with data in our days is that, very often, they can be obtained from the web although this is not necessarily straightforward, that is they need to be downloaded and go through some preprocessing and extraction processes, which depend on the format in which they are stored in the web.

This course explores some of these formats jointly with the methods and tools used to retrieve data from the web and extract the desired information.

The first part introduces some common web technologies, their relationship and some tools to manipulate and extract the information such as regular expressions. Next common formats for storing web information (HTML, XML, JSON) are presented, as well as tools to extract it, as XPath and CSS selectors. Finally we introduce some R packages suitable to process Web information and use them in some case studies.

Teaching Staff

Francesc Carmona. PHD. Departament de Genàtica, Microbiologia i Estadística. UB — Professor Francesc Carmona.

Alex Sanchez-Pla PhD. Departament de Genàtica, Microbiologia i Estadística. Universitat de Barcelona — Professor Alex Sanchez Pla