Course Presentation

Overview

An important aspect when dealing with data in our days is that, very often, they can be obtained from the web although this is not necessarily straightforward, that is they need to be downloaded and go through some preprocessing and extraction processes, which depend on the format in which they are stored in the web.

This course explores some of these formats jointly with the methods and tools used to retrieve data from the web and extract the desired information.

The first part introduces some common web technologies, their relationship and some tools to manipulate and extract the information such as regular expressions. Next common formats for storing web information (HTML, XML, JSON) are presented, as well as tools to extract it, as XPath and CSS selectors. Finally we introduce some R packages suitable to process Web information and use them in some case studies.

Objectives

Specifically at the end of the course students should:

Be familiar with the main technologies to deal with information stored in the web.

Be able to recognize the different formats that can be used for storage.

Know how to extract information from these formats using specific R packages

Contents

  1. Introducing technologies. Web scrapping and web scrapping projects.

  2. Data representation in the web HTML, XML, JSON. Other technologies.

  3. Parsing HTML using rvest

  4. More powerful parsing of HTML and XML using CSS selectors, Regular expressions and XPath.

  5. Parsing data using APIs

  6. Case studies: (1) Parsing data from semi-structured documents. (2) Scraping Twitter for Sentiment Analysis. (3) Gathering data from commercial sites