Introduction to the Dataset

The Data

The UK department of Transportation keeps detailed records of all traffic incidents. Fortunately for us they made this data available to the public in the form of three csv files that contain information about the accidents, casualties, and vehicles involved.

System-dependent Alert

The path of the dataset shown below may not be the same on your WMR system. It is correct for this WMR server:

selkie.macalester.edu/wmr

These files are located on wmr in the /shared/traffic folder and are named Accidents7409.csv,  Casualty7409.csv and Vehicles7409.csv respectively.

Working with the Data

Each line in the files contains several fields separated by commas, to access these values, it is necessary to call key.split(',') (or the equivalent in whatever language you’re using) to get an array of values. If you want, you can turn these values into an object, however it’s faster to simply refer to them by their index

index Accidents7409.csv Casualty7409.csv Vehicles7409.csv
0 Accident Index Accident Index Accident Index
1 Location Easting OSGR Vehicle Reference Vehicle Reference
2 Location Northing OSGR Casualty Reference Vehicle Type
3 Longitude Casualty Class Towing/Articulation
4 Latitude Sex of Casualty Vehicle Maneuver
5 Police Force Age Band of Casualty Vehicle Location Restricted Lane
6 Accident Severity Casualty Severity Junction Location
7 Number of Vehicles Pedestrian Location Skidding/Overturning
8 Number of Casualties Pedestrian Movement Hit Object in Driveway
9 Date Car Passenger Vehicle Leaving Driveway
10 Day of Week Bus/Coach Passenger Hit Object off Driveway
11 Time Pedestrian Road Maintenance Worker 1st Point of Impact
12 Local Authority (District) Casualty Type Was Vehicle Left Hand Drive
13 Local Authority (Highway) Casualty Home Area Type Journey Purpose of Driver
14 1st Road Class   Sex of Driver
15 1st Road Number   Age Band of Driver
16 Road Type   Engine Capacity
17 Speed Limit   Propulsion Code
18 Junction Detail   Age of Vehicle
19 Junction Control   Driver IMD Decile
20 2nd Road Class   Driver Home Area Type
21 2nd Road Number    
22 Pedestrian Crossing Human Control    
23 Pedestrian Crossing Physical Facilities    
24 Light Conditions    
25 Weather Conditions    
26 Road Surface Conditions    
27 Special Conditions    
28 Carriage Hazards    
29 Urban or Rural Area    
30 Did Police Officer Attend Scene    
31 LSOA of Accident Location    

Most of the values are determined by special codes which which can be found in the pages of this spreadsheet

Example Job

Let’s use what we’ve learned to answer a quick question. Between 1974 and 2004 were there more casualties per incident in rural or urban accidents?

Our mapper will need to emit a key that represents whether the accident was rural or urban and the number of casualties as the value.

Our reducer will need to sum the casualties for each type of accident and divide them by the total number of accidents.

Given that the code that tells whether a crash was urban or rural is stored at index 29 of the accident csv and the number of casualties is stored at index 8 our code looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def mapper(key, value):
  data = key.split(',')
  casualties = data[8]
  urbanOrRural = data[29]
  Wmr.emit(urbanOrRural, casualties)

def reducer(key, values):
  count = 0
  total = 0
  for value in values:
    total += int(value)
    count += 1
  Wmr.emit(key, total / count)

Note

Does this reducer look familiar?

Run this job on wmr using cluster path /shared/traffic/Accidents7904.csv You should get the following output:

1 1.2805146224316546
2 1.5105844913989401
3 1.4071045576407506
-1 1.3062582787269292

A quick glance at the spreadsheet reveals that 1 stands for Urban, 2 for rural, and 3 for unallocated. -1 means that neither was reported. It appears that on average rural accidents tend to involve more casualties.