Introduction

This article is about Big Data and Distributed Computing. My goal is to give you the vocabulary to discuss Big Data outcomes in terms of effective Distributed Computing execution, data integrity, and performance bottlenecks.

Distributed computing is the area of computer science concerned with running software on multiple computers at once, effectively distributing the load with software and processes deployed across multiple servers.

Audience

This is an accessible article for technology managers, business people, software programmers and non-programmers alike.

“A connecting principle,
Linked to the invisible
Almost imperceptible
Something inexpressible.
Science insusceptible
Logic so inflexible
Causally connectable
Nothing is invincible
We know you, they know me
Extrasensory
Synchronicity”

The Police / Synchronicity

Ten Key Terms

When I started learning about Big Data platforms like Hadoop and Apache Spark, I realized I needed to brush up on my Distributed Computing concepts and connect the dots.

1. Concurrency

Concurrency
Concurrency is a big-picture word describing what happens when we break down activities into smaller tasks that run together. Things are not necessarily simultaneous; tasks are often sequential. The terms concurrent, multitasking, and multi-threaded mean similar things. Different Tasks need scheduling, starting, and stopping and waiting.

https://en.wikipedia.org/wiki/Concurrency_(computer_science)

2. Tasks & Threads

Tasks and Threads

A Task is an abstraction for a small unit of work, when you break it down. Hammering a nail is a task. Watering plants and picking up take-out are also tasks. Ordering an airline ticket is a task, and so is an airline company updating its ticket prices. In computer terms a task is a Thread.

A Thread is a set of instructions required to order an airline ticket. These instructions are executed alongside other sets of instructions required to order other airline tickets. These threads may be for tickets to different places for different people with different requirements. All threads operate unique tasks in this way individually, but simultaneously.

“In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems, but in most cases a thread is a component of a process.”

https://en.wikipedia.org/wiki/Task_(computing)

https://en.wikipedia.org/wiki/Thread_(computing)

3. Process

Process

Finding a process by Process ID (PID)
In The Windows Task Manager you can check to see what processes are running, and shut them down if necessary. But take care not to shut down any critical processes. see To view a PID, click on the Task Manager Details tab.

A Process is at the operating-system level, independent and may include multiple threads running in the same memory space.

“In computing, a process… is being executed by one or many threads. It contains the program code and its activity. Depending on the operating system (OS), a process may be made up of multiple threads of execution that execute instructions concurrently.”

https://en.wikipedia.org/wiki/Process_(computing)

4. Multitasking and Multithreading

Multitasking and Multithreading

We are all used to multitasking or doing more than one thing these days, but can you really watch a movie and drive at the same time? In actuality you’re doing context switching, meaning going back and forth between two different activities. One example of context switching might be driving along a long deserted highway while glancing every so often at a movie you’ve seen so many times that you don’t need to give it your full attention. Humans are slow at switching context; in multitasking mode, computers are practically speaking constantly switching context between two or more tasks.

In computer programs, because tasks are represented as threads, multithreading is the condition of two or more threads executing concurrently within a single program. When multiple threads are running, they could accidentally corrupt each other’s data or stop and start at the wrong times. They are managed through synchronization.

https://en.wikipedia.org/wiki/Computer_multi-tasking

5. Synchronization

Synchronization

Synchronized code can only be run / accessed by a single thread at a time, which means one program cannot access it while another is updating. Code synchronization is more like a queue, where threads wait their turn before going through the activity and each thread may execute the activity separately.

For example, as passengers board an aircraft, they take their seats. Each traveller can be thought of as a thread, and everyone combined together is a multi-threaded process. Each thread has a series of discrete tasks.

Some tasks have to wait until other threads complete their tasks. You don’t want two users updating the same data simultaneously. The result would be similar to two passengers buying a ticket for the same seat. For further information, see: https://en.wikipedia.org/wiki/Synchronization_(computer_science)

6. Stream

Stream

A Stream is a sequence of information data (bytes) which may have a beginning and end, but may also be without a defined limit. A stream is opened, read and processed, then closed. Streams can be processed on a single processor (CPU), processed in parallel on a multi-core processor or distributed on a cluster of many computers.

Even this web page is a stream. An example is the constantly updated data from a weather buoy or a stock-ticker. Streams of data can load independently on separate threads. Streaming video is also something we are familiar with.

https://en.wikipedia.org/wiki/Stream_(computing)

7. Parallel Processing

Parallel Processing

Parallel Processing allows multiple processors to execute tasks simultaneously, which increases performance.Parallel processing is also called parallel computing.

Memory may be shared on a single machine node or distributed on a cluster. Although the idea is that things will happen simultaneously, processes may need to wait for other processes in order to complete, as they are interleaved.

As parallel programming is still rather arcane, most software is not coded for parallel processing performance gains. The practice of converting software from running on single-processors to taking full advantage of parallel processing capabilities is called parallelization.

8. Nodes and Clusters

Nodes and Clusters

Nodes and Cluster DiagramA Node is a computer with a processor (CPU) and memory (RAM). The computer or node may be a physical machine or a virtual machine. The more CPUs and RAM a node computer has, the more program code and data that node can process.

A node may exist as part of a Cluster, a collection of managed nodes on a Network. How they are managed is beyond the scope of this article. Apache Spark manages clusters using the appropriately named Cluster Manager.

9. Network

Network

A Network consists of two or more computer nodes that are able to communicate with each other. Examples of networks include Local Area Networks (LANs) and Wide Area Networks (WANs). A network can be physical or virtual.

Computers communicate over the network using a variety of protocols, but typically Transmission Control Protocol (TCP), commonly referred to as TCP/IP. The network of computers or nodes wired together allows for Distributed Computing. For information, see:

Network: https://en.wikipedia.org/wiki/Computer_network

TCP: https://en.wikipedia.org/wiki/Transmission_Control_Protocol

10. Distributed Computing

Distributed Computing

Distributed Computing divides a single task across multiple nodes in a cluster, to spread out the work among multiple processors. Memory is distributed, and although messages may be passed between nodes, they don’t rely on one another. Software components work in synergy.

With distributed computing, each computer node in a cluster of computers can do parallel processing independently. Many jobs can be run, and each job can be divided into multiple tasks. Running the separated tasks on a cluster of machines multiplies performance.

“A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. The components interact with one another in order to achieve a common goal.”

https://en.wikipedia.org/wiki/Distributed_computing

Conclusion

Through becoming familiar with some key concepts we can approach the big data conversation. Using a distributed computing platform effectively results in increased performance and data consistency across all endpoints from web sites to API users, whether implementing open source or proprietary solutions from Amazon (AWS), Microsoft (Azure), Google (Google Cloud).

With a large cloud provider you will get up and running faster, and less maintenance overhead than an open source solution. An open source stack, particularly on your own dedicated servers, will provide the highest level of configuration.