Are the mathematical sciences fundamental in data science? Yes! Well, we think so anyway, and a newly formed and quickly growing data science community agrees. From mathematical thinking to complex analytics to method validation, maths in all its stages can add value and create impact.

Data science is the field of study concerned with the collection, preparation, analysis, visualisation and management of data. It is an area which builds upon and incorporates the expertise of many different disciplines to successfully extract meaning from data; mathematical sciences are one such discipline. Data science incorporates many areas of study that are primarily mathematical in nature; such as statistics, pure mathematics, mechanics, machine learning and computational science.

Data science and data analytical problems are becoming more prominent as we experience an increase in data volumes. There are more data about the way we live than ever before and this will continue to increase. There is an increase in the deployment of sensors to track information across a vast range of areas from transport to banking, retailing to farming, science to energy. Social media alone is providing remarkable insights into customer behaviour and emotional preferences.

Industry is continually facing the challenges of dealing with large, complex and sometimes fast-moving data sets that are difficult to process and learn from. There is also demand from various sectors across the UK to know more about different types of data analysis, the methods involved and techniques applied. The demand is driven from the value that using data more intelligently might add. For example, supermarkets analyse customer shopping habits using data stored on loyalty cards to personalise offers and target customers. Knowing how to tailor their services and which products to stock is important to keep and attract new customers.

However, it is recognised that it is all too easy to misunderstand the data and its structure or to apply inappropriate analytical techniques which consequently draw flawed conclusions. This has already happened in some high profile cases, such as “Google Flu Trends” whereby Google tracked the outbreak of flu by finding a correlation between people’s online flu searches and whether they had flu symptoms. Initially, their methods revolutionised the ability to track the spread of influenza across the US as they could provide information much faster than the Centres for Disease Control and Prevention (CDC). Unfortunately, Google’s predictions became overstated and it was soon realised that their estimations did not reflect the actual spread. Google had no way of knowing what actually linked peoples searches about flu to its spread, but making the statistical inference caused incorrect insights to be drawn. We can all learn from these types of issues; Google have, they release improved models yearly to track flu and will continue to be prominent in paving the way for big data analysis.

If applied carefully, data analysis can help make new scientific discoveries, develop market-changing products, increase transparency, improve decision making and enhance services. Couple significant volumes of data with the benefits of good data analysis and industry, academia, and government have the opportunity to enhance the competitiveness of the UK and grow the economy. The UK is particularly well placed to do this because it has a very strong mathematics and statistics community. This community has much to offer and can add value to business through:

  • guiding the application of appropriate mathematical and statistical techniques to extract maximum value from data;
  • providing proofs of concept in new technical areas to demonstrate the methods’ value to industry;
  • formalising and quantifying hypotheses about how the data arose as well as validate, compare and refine those hypotheses;
  • capitalising on the advances made in computation which allow greater flexibility and choice of mathematical and statistical methods;
  • creating a data science community, rich in mathematical and statistical knowledge, that can boost the growth of UK companies.

Couple this with the enthusiasm and motivation that the mathematical science community have, now is a great opportunity to capitalise on this skill set. A more detailed report based on discussions at a cross-sector workshop on the application of the mathematical sciences to the underpinning foundations of data science can be found here.

Of course, there are a whole range of mathematical and statistical techniques that, with the right expertise, you can tap into to extract value from data. Depending on the problem and the data you have, there will be a variety of techniques that will be applicable, methods include:

  • Probability Theory
  • Hidden Markov Models
  • Evolving and Multiplex Networks
  • Deep Learning
  • Bayesian Analysis
  • Classification and Clustering
  • Data mining
  • Graphical Models
  • Topological Data Analysis
  • Tropical Geometry
  • Dynamical systems
  • Machine learning
  • Sparse Tensor Methods
  • Stochastic optimisation tools
  • Large-scale Linear Algebra

For example, if you have a dataset representing a list of customer records, (\(x_1\),\(x_2\), … ,\(x_n\)), where each record is a d-dimensional real vector that describes various attributes about the customer such as name, age, demographic information, and purchasing habits. Clustering can be used to explore the dataset to group related records together and segment the data by attribute values, and in turn drive marketing and promotional strategies to target specific types of customers. There are many different types of clustering algorithms available which can collect datasets into sensible groups. One common algorithm is k-means clustering. This algorithm iteratively refines the clusters to find the local optimum of partitions in the data by minimizing the Euclidean distance, giving \(k\) sets \(S =\) {\(S_1\), \(S_2\), … , \(S_k\)}. In other words, the objective is to find:

\(argmin_s \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{x \in S_i} \| x – \mu_i \| ^2\)

where \(\mu_i\) is the mean of points in \(S_i\).

This can be very useful as you may not know in advance the patterns or relationships between customers and their attributes and what to necessarily look for. K-means clustering requires no prior knowledge of the data and performs well with large datasets. Methods such as this are relatively simple to implement and can provide useful insight into the data and the decisions that need to be made.

If you’ve got a data science problem, we strongly encourage the use of mathematical and statistical techniques. Correct mathematical rigour must be applied, but you do not need to be an experienced mathematician to incorporate the mathematical sciences into your data science problems. There are many mathematical scientists who are able to help and provide advice.

If you have questions you can always get in touch; we can try to help or point you in the right direction.