4.11 Big Data

Big Data

Content

Additional information

Know that 'Big Data' is a catch-all term for data that won't fit the usual containers. Big Data can be described in terms of:

  • volume - too big to fit into a single server
  • velocity - streaming data, milliseconds to seconds to respond
  • variety - data in many forms such as structured, unstructured, text, multimedia.

Whilst its size receives all the attention, the most difficult aspect of Big Data really involves its lack of structure. This lack of structure poses challenges because:

  • analysing the data is made significantly more difficult
  • relational databases are not appropriate because they require the data to fit into a row-and-column format.

Machine learning techniques are needed to discern patterns in the data and to extract useful information.

'Big' is a relative term, but size impacts when the data doesn’t fit onto a single server because relational databases don’t scale well across multiple machines.

Data from networked sensors, smartphones, video surveillance, mouse clicks etc are continuously streamed.

Know that when data sizes are so big as not to fit on to a single server:

  • the processing must be distributed across more than one machine
  • functional programming is a solution, because it makes it easier to write correct and efficient distributed code.

Know what features of functional programming make it easier to write:

  • correct code
  • code that can be distributed to run across more than one server.

Functional programming languages support:

  • immutable data structures
  • statelessness
  • higher-order functions.
Be familiar with the:
  • fact-based model for representing data
  • graph schema for capturing the structure of the dataset
  • nodes, edges and properties in graph schema.

Each fact within a fact-based model captures a single piece of information.