I attended the EARL Conference in London on Sept 14th and 15th 2016.
This document contains all the notes I took over the two days with minimal spell-checking which I will fix over time.
Here is the short version…
Thanks to STHDA for the wordcloud tutorial.
Please also check out the very good review from the Data Laborer
I am a relatively new user of R, so I found the conference extremely interesting. The people I met were all from very diverse backgrounds but seemed to be getting something out of the content. Full credit to Mango Solutions for setting up and running the event very smoothly. Here are some of my key takeaways;
R use is growing
Not surprisingly, this was a common theme. Mike Smith from Microsoft shared an interesting presentation that touched on some of the reasons this is, and included some statistics including a recent survey showing R to be the 5th most popular programming language - not bad for a specialist language designed by statisticians. Mike said that R’s ‘ecosystem’ was a big reason for it’s success and a reason for Microsoft’s investment in being as compatible as possible with R - the integration of R into SQL2016 shows this.
R as glue
Of course there was much praise being heaped on R during the conference, but it was acknowledged many times, and in different ways, that R is best considered as a connector, or glue, to hold together a human focused approach to programming and data science. R as a tool in a tool-kit was the key takeaways for me, as opposed to a single tool to answer all problems.
A highlight for me was this video that Garret Grolemund showed on the second day, before going into some details on the history of the R language, and what he (and others) see as the right direction for the language. While it may seem like an obvious thing for data scientists (or any scientist), the usefulness of reproducible research turned up in a lot of the presentations throughout the conference from a variety of companies. It’s clear to me that there is appetite in corporate decision-making circles to be able to see into the ‘black box’, and tools such as RMarkdown Notebooks will continue to make this easier.
I found Garret’s explanation of ‘The Tidyverse’ useful and it is encouraging to see people trying to steer the development of packages and approach to data science in a constructive way. ‘The Tidyverse’ contains;
My three favourite presentations
Marcus Gaul’s talk about developing metrics was quite inspiring. The approach he shared involved looking at ‘systems of systems’ and networked metrics to be able to clearly display a complex environment to key analysts and decision makers. To set this up would require a lot of co-operation and buy-in from the people pulling together the base data, but the case for doing so was compelling.
Alice Daish presented a case study on transforming the British Museum to be data-driven. She is a very good presenter, and covered lots of ground in an informative and entertaining way. She also shared lots of specific tips and tricks which was a key feature of the best talks. She ended by pointing out there were 55,000 museums in the world, highlighting the opportunity that exists for data science in general in years to come.
The presentation from Amanda Lee from MEC was full of specific detail and clearly showed the use of R in solving a very specific problem (no analytics provided by Instagram). It was a good case study to take in, in particular because of the very robust approach she and MEC took in establishing success metrics and pointing out the value of the activity - something media agencies are generally good at, but this one still stood out.
(Bonus) I would like to mention Vincent Warmedam’s talk about R and Spark. He was one of the best presenters and embodied the collaborative spirit of the R community by being very generous with his code and showing very clear examples of how to improve performance right away. I will be checking out his blog in future for sure.
Matt Aldridge from Mango Solutions.
Conference in third year running.
R usage growing.
Joe Cheng, CTO RStudio.
RStudio is a small team based in Boston (35 people).
Founded 2009 to create world class R tools.
Four main goals;
Cool new things.
- Allows for code chunks to be executed easily and independently in a report (no need to knit to see output). - Output is displayed below chunk.
- Can be sent as html link with display/hide code option.
Flexdashboard - Markdown layout similar to dashboards (grids).
Shiny bookmarkeable state - can save a state of Shiny app and send as link (no need to recreate a state in app).
David Smith, R Community Lead, Microsoft.
R is great despite some of its shortcomings - why?
“The best thing about R is that it was written by statisticians. The worst thing about R is that it was written by statisticians.” Bow Cowgill
Is a quirky language with many idiosyncrasies, designed for stats which is now the coolest thing ever because of Data Science explosion.
There is an ‘ecosystem’ which surrounds R which is one reason for its popularity and future.
R is very popular which is interesting because it is a specialist language rather than a general purpose language.
76% of analytics professional use R, with 36% using R as their primary tool.
5th most popular programming language! According to IEEE. The most popular ‘specialist’ language.
The language itself, although quirky, is robust and well maintained by the R Core Group. 8000+ packages available via CRAN.
There is strong governance.
R is free and has an ever-increasing talent pool - and is popular in academia which is providing some of the industry leaders of the future.
The Consortium is like a cross between NATO and the Jedi Knights.
Lou Bajuk-Yorgan and Gabor Csardi talked through some of the projects that are sponsored by this group.
Key takeaway - consider the following when making submissions for funding insert here
Garret Grolemund, Chief Instructor RStudio
Successful systems are seamless - the glue determines the power of the system.
R is a type of glue.
History of R.
R is an evolution of S created in Bell Labs.
John W Tukey.
SCS Library - stats computing subroutines - lots of code to fetch algorithm.
People, algorithms, hardware… needs glue. Which is what S was.
Structure of R.
Use profvis package to analyse and optimise your code.
Compiled code does match human thought well.
R focuses on what rather than how and focuses on interactivity.
R resembles the structure and sequence of human thought.
UNIX - manage and access programs.
C - optimise computer performance (meet the computer on its own terms).
S - optimise human performance (R, S, Julia?).
R is effective at tasks that other languages are not.
DSL (domain specific languages) examples ggplot2, shiny, dplyr
Flexible (LISP) <———————-> Conformist (GO)
Different approaches to programming.
Python - write once, maintain forever.
S - write many times, maintain never.
R - write well, maintain forever, use easily.
3 ways to improve R.
1. Tidy data
2. Tidy tools
Focused, chainable and composable, each function does one thing well (like words).
Compose steps with pipes.
Make it easy to call functions from within functions.
Create a website to teach people how to use your packages.
Kenneth Cukier, Data Editor, The Economist
The idea driving development of AI is ‘more’.
More can be different, not just volume.
Humans have stored information for approx 70k years.
Big Data - applying classical statistics to areas where we have data, and previously we didn’t.
Spot subtle patterns from data in the wild.
AI - Divine what to do and continuously improve (mostly) from data alone.
Machine learning - computer translation, speech recognition etc.
Learn, not know
Example - DeepMind plays Atari video games
Not just automation or efficiency, also new knowledge.
Implications of rise of robots.
Algorithmic decision making.
New business models. Shift nature of some goods into services (recurring revenue, lease vs sale).
Learn, not know.
Roomba robot. Course correct (while learning) vs set of detailed instructions.
Jobs. Realignment. Difference between jobs and work.
Transparency. Example - Uber knows when customer battery life low, will accept surge pricing.
Causality. Why do things work? Need to explain in some instances (auditing etc.).
Privacy. A side-effect of transparency.
Measuring everything, new disruptions.
Short term risk of mis-use of tools.
Full agenda with all the names of speakers and talks is here.
Hayfa Mohdzaini UCEA
People think automation is everywhere.
But, there are barriers to automation.
Time. Skill. Tools.
UCEA case study - 17 staff serve 170 higher education institutions.
Pay negotiations, research and surveys, comms, pensions etc.
Annual report - senior staff remuneration survey. Bespoke benchmarking service. Data collected manually.
26000 salaries, 146 institutions.
1993 - Excel + VBA
2013-14 - SPSS + VBA (2 days to produce) 2015 - R and VBA
An easier way to generate report and make it easier to maintain.
Looked at SPSS and R.
R has no licence fee. Mango Solutions magic (Mango has a cat!)
Input: 26,000 rows, 15 variables.
Output: 700 word tables, 10,500 Excel rows.
Important to define requirements for report. Input/output variable descriptions. Combinations. Factor levels.
Need minimum effort to change input/output values.
Suppress data for small samples. (approach to confidentiality)
Special sampling rules. (how to not count people twice) Follow style formatting.
+ VBA to handle complex formatting (interesting)
Compare institution data versus survey.
Highlight above/below market/whole sample.
Used Tablea to visualize the comparison.
Automation = budget approved!
Clear requirements helped.
Pseudo data has limitations (confidentiality is a challenge)
Expect rounds of UAT
Tharsis Souza from UCL
Use of alternative data.
Huge volume of data. Applications within Financial industry to take advantage but creating cutting edge strategies.
Simple models do not capture the complexity/dimensions of additional information available.
How to combine traditional (market data) with unstructured ‘Big’ data, analyse and predict behaviour (with a financial impact).
Example - google search public interest in stock exchange related search correlates to financial crisis. But more interestingly, there were some peaks before the crisis.
Example - mentions in Financial Times vs transactions of companies mentioned.
Example - look at search for keyword ‘debt’. How could this influence investment strategy?
Example - when text predicts stock movements - headlines and twitter mentions to predict sales and stock moves.
Example - Data Crawl (Community, Forum posts, Money and transaction), Data Refinement, Prediction Model.
Financial Quant industry is investing in Text Mining.
1. Event Detection.
Past prices do not influence future prices, but events do.
Connect to news outlet, how to define an ‘event’ and classify it vs new replicated from the past, something that is not ‘new’.
Identify event and named entities of the event. Easier said than done.
What is an event?
Categorise event (example ‘patent infringement’).
Investigate significance of event to the entities involved.
Sentiment Analysis alongside Sales data to observe correlation or a ‘normal’ state.
2. Text Mining
Standard text mining is now being applied to financial cases.
Packages include tm, kernlab, caret.
Dictionary approach - need to use a specialised dictionary.
OR train a model via machine learning (need labelled data).
3. Complex Networks
Packages - quantmod, igraph, entropy.
Provide holistic overview pf market structure and inter-relationships.
Example - Network Contagion within Market Correlation Structure (psychsignal.com)
Example - (yewno.com) Complex Networks + Computational Linguistics + Big Data - layers of networks (Social, Political, Financial)
2 years ago all using SAS.
No all using R.
Directive to use open source came from CEO, because it is ‘the future’ for data science and analytics.
Why R versus other open source?
Had some expertise (tenured analysts).
Working with Revolution R (now Microsoft R) who provided advice.
Keen to use extra features available through R such as custom visualizations, packages.
Example - text mining used to analyse notes taken by call centre staff to determine which queries could/should be handled better on web.
Three main issues SAS to R.
Took 2 years to transform.
First proof of concept - take algorithm from SAS and deliver via R.
Then allowed analysts to ‘play’ in R.
Then created bespoke internal training for R. In-house, specific, relevant to augment the standard public training.
Decision made to move all analytics from SAS to R.
Cross function team set up to run the transformation.
Failed on virtual machines (too slow) and decision to move to physical.
Prototype platform launched.
Convert code from SAS to R, update tools, monitoring.
Internal R ‘Datathon’ day.
Analytics Platform - 4 physical servers, R, R Studio Pro, Shiny, SQL-Lite.
Users move through a network load balancer manages access to servers.
Shared file system.
Popular packages downloaded and pre-installed.
Benefits - enthusiasm, learn new skills, access to latest cutting edge tools, go from good to great, collaborate with data science leaders.
Data visualization is a gateway to getting people to understand the value.
Giving back to community (spirit of R).
Nottingham R meetups, women in tech, code club, speaking events.
Developing bespoke packages.
Alice Daish British Museum Data Scientist.
2M year history.
No list of data sources. No data access. No databases. No data warehouse.
‘What does Big Data mean to the museum?’
Explaining Data - Excel to Odin (image database of all items).
250 data sources.
Bubbles envy - d3 visualization, there is an r package to help.
d3.js bubble chart htmlwidget for R then into Illustrator for a ‘data audit’.
Business problems = Data opportunities.
Who are our visitors? What do they do? Are there opportunities to generate revenue?
using R in a non-Data organization (different to SAS - R example).
Silos and wrangling. ‘Silo’ means looking at only one source of data vs various (Data Science holistic approach).
CSV exports from external platforms. No SQL.
Email 100’s columns trimmed, online shop nested by timeslots, Wi-Fi split first/surname.
Top 500 website search made into a wordcloud. Dabble in Shiny - needs a place to store apps.
Use a ‘did you know’ approach for internal comms.
Visitor movement. 62 galleries 6M+ visitor.
Connected R with CISCO Presence API.
Excel heat map.
Can we predict ticket sales.
Mixed effect modelling. lmer()
Future - continued data wrangling, improve data pipeline, IoT
R potential - 55,000 museums need transforming!
Daniel from Brett Landscaping.
Improve forecasting and inventory management.
Existing planning tools needed improving.
Reviewed commercial planning software.
Gaps between existing models and commercial software( SPSS, SAS, Forecast Pro).
Forecasting the hardest obstacles - time series forecasting.
R packages - RSQLite, reshape2, forecast
SQL data Warehouse —> R Forecast —> excel model for senior audience to play
Traffic light and exception reporting. Safety Stock and Re-order points.
Once forecasting model is robust, then what?
Distribution Boundary Optimisation - optimize for profit.
Package - lpsolve (previously using Excel Solve)
Simulate shift patterns.
How to use excess capacity?
Opportunity cost of a lost sale.
Maths degree, used R in stats courses.
Supports decision making especially on future force structures.
Watched dplyr tutorial at useR and a light went on.
Who we are.
100% contract funded.
Accountable to MOD and Parliament.
Manage MOD research.
OR is ‘Operational Research’. Then turned into OA (Operational Analysis).
Maximise impact of science and technology for defence and security.
Application of mathematical and scientific methods to support executive decision makers.
‘How many ships, tanks, aircraft do we need?’.
Defence Policy —> Scenarios —> Wargaming/Simulation —> Concurrency Modelling —-> Procurement Decisions
‘Now, what kind of tank?’
Concurrency analysis. Input —> ConcurrencyAanalysisTool —> Excel Output —> Bespoke Excel Analysis
Because R can do things Excel can’t…
Example ggplot2 - violin plots
Data processing - using principles of tidy data for richer insight
Reproducibility - when something changes (always does) changes are error prone and undocumented. With R, just update script.
Running R alongside previous process - adding Charts, Markdown, Shiny - don’t throw away something that is working.
No admin/access rights to internet, lengthy approvals.
Virtual machines isolated from main network was a game changer.
DSTL moving towards a bigger data science program with bigger cohort of skilled users.
How to change? Start small, show a cool chart. Incremental steps, build evangelists.
Code control, versioning (Git) are new problems that come alongside the R model.
NCI Agency (NATO) 25 people in team.
5 years in Afghanistan.
Lots of interventions/events/humanitarian/financial/political.
Assessment - what is it? Continual monitoring and evaluation (feedback cycle).
NATO calls this ‘Operations Assessment’.
Evidence to support decision making.
A warning panel, where to focus attention.
The world as a system of systems.
Find vulnerabilities, dependencies, weaknesses, relationships. Layers of networks.
Super complex examples (Afghanistan).
Systems theory. Empiricism.
Metrics, metrics, metrics. Perform continuous measurements.
MOP (performance) did something happen vs MOE (effectiveness) to desired effect.
Evolution from current to desired state.
Metrics are interconnected naturally or assessed. PMESII.
Traditional approaches to developing metrics underestimate the complexity, often neglect lessons learned, fail to consult SME, use too few previous examples. Often applied in mechanistic checklist approach. List methodology can reinforce anchoring bias.
How can R/Shiny help overcome these challenges.
Phase 1 - methodology, data collection, initial metric repo, exploration and demonstrator development.
Packages - htmlwidgets, shinyjs, shinydashboards, DT, networkD3, visNetwork
Multiple SME add metrics to separate sources which are accessed via a Shiny app which performs metric analysis.
Enhance analytics, extend scope.
Key word mining, auto categorization, metrics clusters.
Tested approaches to metrics development and measurement.
R and Shiny effective.
Secrecy/classification a big issue - using classified metrics for example.
Metric Visualizations (dots are MetricsOfPerformance or MetricsOfEffectiveness).
Set up base data in the right way will make things easier!
Nick Masca from BGL
Price comparison websites.
Messages from various vendors…
Unleash the real Value of Data!
Are you ready to analyse All your Big Data?
Harness the power of Big Data.
Analyse it all.
How important is Volume?
Highlight advantages of a considered approach to data analysis vs stalling because data is not ‘big’.
Example - Fraud Detection
Fraud scorecard to identify fraud at point of quote or asap post sale.
Past versions build with GLM.
More recently, using a machine learning approach enabled more frequent updates, broader range of predictors.
But automation can lead to questions…
How much data to use?
How long a time frame?
How often to be updated?
Sampling schemes. Up or down sampling?
Data usage - as many predictors as possible or be selective?
Samples across different time periods.
Fitted penalised logistic regression models.
Considered 700 parameters.
Evaluated model in terms of AUC, True Positive, False Positive rates.
Packages - RODBC, glmnet, doParallel
Increased number of rows after 50k had neglible effect.
Using ALL data isn’t always the best approach. Identifying long terms trends vs recent changes in behaviour.
Up-sampling (preferring) yielded better results. Analysing ALL data (no sampling) gave worse performance.
Example 2 - Project Tangerine
Machine Learning to predict insurance prices.
Fitted penalised GLM using quote info.
Packages - data encoded to svmLight format using RSofia, h2o
Sqrt of mean square error being used to evaluate.
More data was worse result.
Example 3 - NPS (Net Promoter Score) Text Analytics
Classify (automatically) responses into complaints categories. Previously classification done manually (9-13 variables).
Training data was category and count.
Multiple docs —> exclude punctuation, tokenise, exclude stop words, lower case, stemming
Create document term frequency
Packages - tm, SnowballC, glmnet
Using more data doesn’t necessarily give advantage. Analysing ALL is similarly not always the best approach. Small can be ok.
What is Spark?
Use a bomb. Use a bigger bomb. (or many smaller bombs?).
Connect machines, store data on multiple machines.
Bring code to data - not the other way around.
Spark facilitates parallel computing to wrangle/analyse large datasets.
Kaggle WoW set.
Installation is easy (devtools).
300M searches per day.
1B live listings.
Lots of data!
How to optimize ad spend in offline environments - do some experiments.
Observational studies. Account for factors and create model.
A/B test, estimate causal effect.
Experimental design can be tricky.
No clear control regions.
One approach - synthetic controls.
Create an artificial control unit. (package - synth)
Synthetic controls + ML + Randomized control
Many data partners (sources of data to inform activity).
Simple code for data transformations.
Classic visual of data platform.
Hybrid era - API, data hourly.
Unified era - single source, multiple tags.
Ben Downe, BCA
redhat6, visual studio, MRAN (Microsoft Azure Machine Learning Studio).
How to take models and make production-ready?
One button API publish. Simple create/destruct.
Develop and train models on R Studio Server, then test and deploy on Azure ML.
Speed, easy, IP remains with BCA, data is local.
Optimal budget allocation.
Marketing Mix is a top down approach.
Shiny app to solve part 2 (optimization).
Shiny app can embed links on how to use etc (any HTML).
Show actual versus optimal size by side.
James Lawrence, RSA
EDA (Exploratory Data Analysis)
Purpose (for a modeller)
Identify predictive variables
Identify data issues
Medium sized data - 100’s, 1000’s variable
A response variable
Data combined from different sources
plotly, rCharts, ggvis, htmlwidgets, shiny
Philosophy - simple in simple cases, with customisable options for power users.
Identify response variable
Each column has data
Show correlation between variables
Most correlated numeric and factor
Output split for multiple columns
Remove columns with no information?
Campaign reporting, what went well, or didn’t.
Product demand can vary according to external factors.
Use Demand History + Online search and weather to predict volume. (using more than just demand history)
Collected, 3 years’ sales, weather and search.
Harmonised into one cleaned and simplified data set - in R.
Panel linear model built. (nuances of region)
Standard linear models. Achieved R squared 0.86
sandwich - covariance estimation
lmtest - coefficiant testing
Used mean percentage error and mean absolute percentage error
When prediction runs over thresh-hold, activity kicks in.
Visualization/sandbox - traffic light approach to when something needs to happen.
Run on a weekly basis.
A shared environment.
Based around notebook.
GitHub is behind the scenes. rgithub package
Sits next to data lake, accessible by web. (browser based notebook)
Model boiler breakdowns.
Non parametric, learns complex behaviour, does not assume underlying model structure.
Aggregate multiple models for one prediction.
Inputs(from neurons), Process, Outputs(to neurons).
ANN (Artificial Neural Network).
Pick a non-linear activation function. package - rsnns
Classic Neural Network has multiple inputs through multiple layers of fully connected hidden neurons. Has at least one output.
Always forward feeding.
Input variables I
Hidden layers H
Specify train and test %
Always use set.seed
Is a ‘black box’
Long training time.
For a single layer MLP - use function Garson which will return the relative importance of each variable.
Recurrent Neural Network for use when sequence of data is important. RNN
State at time
t depends on
t - 1 (Elman/Jordan networks)
Back Propagation Through Time BPTT
Elman - feedback loop in hidden layer H
Jordan - feedback loop in output layer O
Book - The Grammar of Graphics
Book - Strategy: A History
Tool - Tableau, for dashboarding has full integration with R
Tool - Mathematica, no integration with R, but was suggested as an interesting a powerful visualization tool Not high priority, but looks fun
Tool - GitKraken…because Krakens!
Technique - network graphs - exploring system of systems
Technique - metric network Analysis
Technique - time series forecasting
Technique - machine learning, decision tree, random forest > All models are wrong. Some are useful.
Technique - Markov Chains
Technique - parallel computing (working with large data)
Technique - geo-spatial (map for timetracking)
Technique - experiment design - want to read more about this
Technique - VBA, I saw it being used in a couple of places alongside R - good for Excel heavy environments to have an idea
Technique - start small, create something reproducible and easy to follow before applying to larger data
plotly (allows two concurrent y axes)
profvis - a great way to see the impact of your code