Mining for gold in mountains of data

April 15, 2015

Big Data is the new hope, the new gold in Silicon Valley. Big data is the data sets of the large internet companies, the energy suppliers, the communications companies, these are the databases that are created by the increasingly digital recording of life.

There is no strict definition for Big Data, but the three Vs usually serve as key points for demarcation: Volume, Velocity and Variety. It's about data that is generated en masse (volume), is constantly and constantly being created (velocity) and is complex and disordered (variety).

Success breeds many children and in the case of big data, other Vs are modern, such as veracity (reliability), viability (usability), value (value) or verisimilitude (plausibility). But these have always been important properties for conventional databases too.

Where do the new masses of digital data come from? One prerequisite was the decline in the cost of computer hardware: storage space and computing power have become increasingly cheaper and the number of devices and their networking via the Internet is constantly increasing. This creates the basis not only to record, process and store data electronically at low cost, but also to collect it in previously unknown levels of detail. For example, Pacific Gas and Electric (PG&E) has been using smart meters across California since 2012. These are intelligent electricity and gas meters that record consumption values several times an hour and send them electronically to the headquarters, where the employee used to travel once a year with a clipboard and pencil, write down the values and break off the refill in the process.

Another major factor in the flood of data is the spread of smartphones. We use this to trace data through our cities and document where we are and who we meet (GPS), who we are friends with and what we like (Facebook), what we are looking for (Google) or what we think (Twitter) and of course with who we are calling. We also surf and shop on the Internet, collect bonus points, use discount or customer cards and pay electronically.

The data that is created is big data.

Big data brings new challenges in collecting, securing and processing due to the sheer volume of data. At Facebook, for example, the access of over a billion active users worldwide has to be coordinated. On YouTube, over 500 years of film are watched every hour and 72 hours of film are uploaded every minute. Google receives over a billion searches a day.

But big data does not just mean a new technology for supplying the world with information. Big data is linked to the hope that it will be increasingly possible to use the massive amounts of existing and emerging data. And it does so sensibly, quickly and effectively. Even if the data does not have common characteristics, is unstructured and comes from different sources and was originally collected for a completely different purpose. This places considerable demands on data processing, statistical methods, but also forms of visualizing information. Tasks for new, demanding jobs.

The hope is also that new and deep insights into social structures will be possible, into information flows, flows of goods, traffic flows, into the movement of people through the city or their migration between countries. Insights into connections between traffic flows throughout the day, electricity consumption in the districts, the utilization of communication networks, between weather, travel, friendships and the spread of disease.

Until now, we have had at best a vague idea of how some of these things might be connected. Thanks to big data, we now have more and more information about this. And the challenge now is to learn how to evaluate and process this. The goal is for us as a society to understand that we can predict the future and thereby make better use of resources.

An example of an application of big data is Google Traffic. Google's maps show the current traffic situation in color code, from green for moving traffic to black for standstill. The information comes from road sensors, but also from users of the Google navigation software. This reports back to central servers where the respective user is and how quickly they are progressing. The sum of all users creates a picture of the situation on the streets - and it is pleasingly accurate. At some point this should not only work in real time, but also with foresight. After all, the question is not “What is the situation now?” but “What is the situation when I’m there?”

Another application is the evaluation of Twitter messages, for example with the keyword “Feeling sick,” taking into account friendship relationships to predict the spread of epidemics. The GermTracker is already online for selected major cities in the USA and London.

It is conceivable to evaluate social networks according to the satisfaction of the residents of a city, a district or a street, according to the geographical distribution of friendship or family relationships or according to migration movements. And all of this in real time. Then in the future, 1.5 million inhabitants of Germany will no longer “suddenly” be there.

The downside is that big data also provides insights into the lives of individuals, often much more precisely than we expect. Because while we still think of ourselves as individualists, the statisticians armed with masses of data have long known better. And the knowledge is valuable because it enables assessments about the risk of accidents or illness, future professional success, preferences or other personal circumstances. So far, the data has mainly been used for advertising purposes. The American department store Target has already been noticed for sending advertising for baby clothes to an underage girl before she even knew she was pregnant.

A connection between the intelligence quotient and a Facebook “like” for certain brands or products has already been scientifically proven. Since then, it has been assumed that anyone who thinks Harley Davidson is good or clicks on “I love being a mom” is not particularly intelligent. It is only a matter of time before further connections are uncovered in the huge data dumps.

We will have to learn to deal with it, set boundaries and control the technology. That was already the case when we discovered the fire.

Today it is still unclear in which direction and to what extent big data will change our world. The trend is still young, but its potential can be recognized.

Finally, let's take a look at what Big Data knows about itself: An evaluation of Big Data about Big Data based on all Google search queries since 2004. "Big Data" came up as a search term in mid-2011 and has been on an upward trend ever since. And another insight glitters like a little gold nugget in the pan: the term comes from India, from the city of Bangalore. Where many American companies operate call centers. Is this a groundbreaking new finding? Probably not. But it is information, the gold of our time. Who knows how much of it will be found and what will come of it.