roughsleeping.Rmd

---
title: "Sleeping  Rough: R analysis"
author: "Paul Bradshaw"
date: "25 January 2017"
output: html_document
---

The data for this code comes from [DCLG statistics on rough sleeping](https://www.gov.uk/government/statistics/rough-sleeping-in-england-autumn-2016). 

We are going to be dealing with an Excel document, so first we need the packages to be able to read that:

```{r}
install.packages("readxl")
library(readxl)
```

Like many government datasets, this is an ugly spreadsheet with unnecessary empty rows which we need to get rid of when we export it:

```{r}
#At some point it would be nice to adapt this code to grab it directly from the webpage, but for now you need to download it first
roughsleeping16 <- read_excel('Rough_Sleeping_Autumn_2016_Final_Tables.xls', sheet=1, skip=3)
```

Even with this, the data still has problems, including summary rows at the top that are not about numbers but percentages, and rows containing notes at the end.

It's easy to overlook those notes if you're using script to pull summary statistics, so let's make sure we grab those.

To do that, let's check how many rows have been grabbed. The function `nrow` tells us the number of rows in a dataset:

```{r}
nrow(roughsleeping16)
```

We know there are 326 authorities in this dataset, and 11 rows before those begin so let's grab any rows after row 338 which should contain notes.

```{r}
notes <- roughsleeping16[c(338:nrow(roughsleeping16)),]
```

What the notes reveal is a really problematic aspect of this data: the methodology can change from authority to authority, from year to year, and both. So all data must be treated with caution. To quote The Guardian report on this data: 

> "The robustness of official government rough sleeper figures was [questioned just over a year ago by the UK Statistics Authority (UKSA)](https://www.statisticsauthority.gov.uk/publication/statistics-on-homelessness-and-rough-sleeping-in-england-department-for-communities-and-local-government/). It ruled that they lacked 'trustworthiness', were vulnerable to political manipulation, and did not meet the quality standard to be national statistics."

The notes also use *colour* to encode data, which is a big no-no. On the spreadsheet the note "*Denotes Local Authority has conducted a street count.*" is in a grey-highlighted cell. But you have to *see* the cell to understand that this means 'cells in grey denotes a street count'. In other words it is not machine-readable. At least "*Figures in red denotes authorities who reported an estimate of rough sleepers without consultation with any agencies.*" mentions the colour red - although again there is no way of distinguishing figures which use red text from those in grey-highlighted text, and those in neither. This is a particular problem with government data releases. [There are ways of sorting by cell colour in Excel](https://support.office.com/en-us/article/Sort-data-in-a-range-or-table-62d0b95d-2a90-4610-a6ae-2e545c4a4654?ui=en-US&rs=en-US&ad=US#__toc246836973) but doing so in R... well I don't know if you can.

## Sorting to grab the 10 authorities with the highest rough sleeping rates

We want to sort the data by the final column. This has quite a long name - *"2016 Rough Sleeping Rate (per 1,000 households)"* - and that might change. So here's an easier way to find out what it's called:

```{r}
colnames(roughsleeping16)[11]
```

Or, if you didn't know there would be 11 columns, you can count them using `ncol` and use that as an index instead:

```{r}
colnames(roughsleeping16)[ncol(roughsleeping16)]
```

Now rename that column, and then sort by it like so:

```{r}
##Change the name of the last column to 'rate'
colnames(roughsleeping16)[ncol(roughsleeping16)] <- 'rate'
#use the order function to order by that column, descending (the minus)
sortedbyrate <- roughsleeping16[order(-roughsleeping16$rate),]
```

To focus on the top 10 we can put those in a separate object using `head` and specifying the first 10 rows:

```{r}
top10 <- head(sortedbyrate, 10)
```

## Adding data on changes over time

Our data includes figures for each year from 2010 to 2016 - but it doesn't include a column telling us how much it changed over that period. This code creates that column based on that calculation:

```{r}
roughsleeping16$change10to16 <- roughsleeping16$`2016.000000` - roughsleeping16$`2010.000000`
```

To calculate a percentage change you divide the change by the original figure. In other words, if something has gone from 100 to 110 the change is 10, and the original figure is 100. 10/100 = 0.10 or a 10% rise.

```{r}
roughsleeping16$percchange10to16 <- roughsleeping16$change10to16/roughsleeping16$`2010.000000`
```

Note that if the figure in 2010 or 2016 was 0 then the result will be `Inf`, because you can't divide anything by 0. Likewise we should be careful about any years where there is no data, and making sure that isn't interpreted as 0 (a very different thing).

And now that some new data has been created, export it as a CSV file:

```{r}
write.csv(roughsleeping16,'roughsleeping2010_16.csv')
```

## Creating charts to show the results

A basic histogram will show the *distribution* of results. The function `hist` creates one based on a specified column:

```{r}
hist(roughsleeping16$percchange10to16)
```

The results show that most results are between 0 and 5 (500%), or below 0 (less than a 100% rise)

To create a **bar chart** we need to first put the data in a table - and then use `barplot` to show that table as a bar chart:

```{r}
changeovertime <- table(roughsleeping16$percchange10to16)
barplot(changeovertime)
```