Creating Data from Text – Regular Expressions in OpenRefine
Although data can take many forms, when generating visualisations, running statistical analyses, or simply querying the data so we can have a conversation with it, life is often made much easier by representing the data in a simple tabular form. A typical format would have one row per item and particular columns containing information or values about one specific attribute of the data item. Where column values are text based, rather than numerical items or dates, it can also help if text strings are ‘normalised’, coming from a fixed, controlled vocabulary (such as items selected from a drop down list) or fixed pattern (for example, a UK postcode in its ‘standard’ form with a space separating the two parts of the postcode).
Tables are also quick to spot as data, of course, even if they appear in a web page or PDF document, where we may have to do a little work to get the data as displayed into a table we can actually work with in a spreadsheet or analysis package.
More often than not, however, we come across situations where a data set is effectively encoded into a more rambling piece of text. One of the testbeds I used to use a lot for practising my data skills was Formula One motor sport, and though I’ve largely had a year away from that during 2013, it’s something I hope to return to in 2014. So here’s an example from F1 of recreational data activity that provided a bit of entertainment for me earlier this week. It comes from the VivaF1 blog in the form of a collation of sentences, by Grand Prix, about the penalties issued over the course of each race weekend. (The original data is published via PDF based press releases on the FIA website.)
The VivaF1 site also publishes a visualisation summarising penalty outcomes incurred by each driver:
The recreational data puzzle I set myself was this: how can we get the data contained in the descriptive sentences about the penalties into a data table that could be used to ask questions about different rule infractions, and the penalty outcomes applied, and allow for the ready generation of visualisations around that data?
The tool I opted to use was OpenRefine; and the predominant technique for getting the data out of the sentences and in to data columns? Regular expressions. (If regular expressions are new to you, think: search and replace on steroids. There’s a great tutorial that introduces the basics here: Everday text patterns.)
What follows is a worked example that shows how to get the “data” from the VivaF1 site into a form that looks more like this:
Not every row is a tidy as it could be, but there is often a trade off in tidying data between trying to automate every step, and automating steps that clean the majority of the data, leaving some rows to tidy by hand…
So where to start? The first step is getting the “data” into OpenRefine. To do this we can just select the text on the VivaF1 penatlies-by-race page, copy it an paste it in the Clipboard import area of a new project in OpenRefine:
We can then import the data as line items, ignoring blank lines:
The first step I’m going to take tidying up the data is to generate a column that contains the race name:
The expression if(value.contains('Prix'),value,'') finds the rows that have the title of the race (they all include “Grand Prix” in their name) and creates a new column containing matches. (The expression reads as follows: if the original cell value contains ‘Prix’ , copy the cell value into the corresponding cell in the new column, else copy across an empty string/nothing that is, ”) We can then Fill Down on the race column to associate each row with particular race.
We can also create another new column containing the description of each penalty notice with a quick tweak of the original expression: if(value.contains('Prix'),'',value). (If we match “Prix”, copy an empty string, else copy the penalty notice.)
One of the things that we notice is there are some notices that “Overflow” on to multiple lines:
We can filter using a regular expression that finds Penalty Notice lines that start (^) with a character that does not match a – ([^-]):
Looking at the row number, we see serval of the rows are xsecutive – we can edit thesse cells to move all the text into a single cell:
Cut and paste as required…
Looking down the row number column (left hand most column) we see that rows 19, 57 and 174 are now the overflow lines. Remove the filter an in the whole listing, scroll to the appropriate part of the data table and cut the data out of the overflow cell and paste it into the line above.
By chance, I also notice that using “Prix” to grab just race names was overly optimistic!
Here’s how we could have checked – used the facet as text option on the race column…
Phew – that was the only rogue! In line 56, cut the rogue text from the Race column and paste it into the penalty notice column. Then also paste in the additional content from the overflow lines.
Remove any filters and fill down again on the Race column to fix the earlier error…
The PEnalty Noptice column should now contain blank lines corresponding to rows that originally described the Grand Prix and overflow rows – facet the Penalty Notice column by text and highlight the blank rows so we can then delete them…
So where are we now? We have a data file with one row per penalty and columns corresponding to the Grand Prix and the penalty notice. We can now start work on pulling data out of the penalty notice sentences.
If you inspect the sentences, you will see they start with a dash, then have the driver name and the team in brackets. Let’s use a regular expression to grab that data:
Here’s the actual regular expression: - ([^\(]*)\s\(([^\)]*)\).* It reads as follows: match a – followed by a space, then grab any characters that don’t contain an open bracket ([^\(]*) and that precede a space \s followed by an open bracket \): all together ([^\(]*)\s\( That gives the driver name into the first matched pattern. Then grab the team – this is whatever appears before the first close bracket: ([^\)]*)\) Finally, match out all characters to the end of the string .*
The two matches are then joined using ::
We can then split these values to give driver and team columns:
Learning from out previous error, we can use the text facet tool on the drive and team columns just to check the values are in order – it seems like there is one oops in the driver column, so we should probably edit that cell and remove the contents.
We can also check the blank lines to see what’s happening there – in this case no driver is mentioned but a team is, but that hasn’t been grabbed into the team column, so we can edit it here:
We can also check the text facet view of the team column to make sure there are no gotchas, and pick up/correct any that did slip through.
So now we have a driver column and a team column too (it’s probably worth changing the column names to match…)
Let’s look at the data again – what else can we pull out? How about the value of any fine? We notice that fine amounts seem to appear at the end of the sentence and be preceded by the word fined, so we gan grab data on that basis, then replace the euro symbol, strip out any commas, and cast the result to a number type: value.match(/.* fined (.*)/).replace(/[€,]/,'').toNumber()
We can check for other fines by filtering the the Penalty Notice column on the word fine (or the Euro symbol), applying a number facet to the Fine column and looking for blank rows in that column.
Add in fine information by hand as required:
So now we have a column that has the value of fines – which means if we export this data we could do plots that show fines per race, or fines per driver, or fines per team, or calculate the average size of fines, or the total number of fines, for example.
What other data columns might we pull out? How about the session? Let’s look for phrases that identify free practice sessions or qualifying:
Here’s the regular expression: value.match(/.*((FP[\d]+)|(Q[\d]+)|(qualifying)).*/i).toUppercase() Note how we use the pipe symbol | to say ‘look for one pattern OR another’. We can cast everything to uppercase just to help normalise the values that appear. And once again, we can use the Text Facet to check that things worked as we expected:
So that’s a column with the session the infringement occurred in (I think! We’d need to read all the descriptions to make absolutely sure!)
What else? There’s another set of numbers appear in some of the notices – speeds. Let’s grab those into a new column – look for a space, followed by numbers or decimal points, and then a sapce and km/h, grabbing the numbers of interest and casting them to a number type:
So now we have a speed column. Which means we could start to look at speed vs fine scatterplots, perhaps, to see if there is a relationship. (Note, different pit lanes may have different speed limits.)
What else? It may be worth trying to identify the outcome of each infringement investigation?
value.match(/.*((fine)|(no further action)|([^\d]\d+.place grid.*)|(reprimand)|(drive.thr.*)|(drop of.*)|\s([^\s]+.second stop and go.*)|(start .*from .*)).*/i).toUppercase()
Here’s where we’re at now:
If we do a text facet on the outcome column, we see there are several opportunities for clustering the data:
We can try other cluster types too:
If we look at the metaphone (soundalike) clusters:
we notice a couple of other things – an opportunity to normalise 5 PLACE GRID DROP as DROP OF 5 GRID PLACES for example:
value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS')
Or we might further standardise the outcome of that by fixing on GRID POSITIONS rather than GRID PLACES:
value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS')
And we might further normalise on numbers rather than number words:
value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS').replace('TWO','2').replace('THREE','3').replace('FIVE','5').replace('TEN','10')
it might make sense to tidy out the (IN THIS CASE… statements:
value.replace(/ \(IN THIS.*/,'')
Depending on the questions we want to ask, it may be worth splitting out whether or not penalties like GRID DROPS are are this event of the next event, as well as generic penalty types (Drive through, stop and go, grid drop, etc)
Finally, let’s consider what sort of infringement has occurred:
If we create a new column from the Infraction column, we can then cluster items into the core infraction type:
After a bit of tidying, we can start to narrow down on a key set of facet levels:
Viewing down the list further there may be additional core infringements we might be able to pull out.
So here’s where we are now:
And here’s where we came from:
Having got the data into a data from, we can now start to ask questions of it (whilst possible using those conversations to retrun to the data ans tidy it more as we work with it). But that will have to be the subject of another post…