Thinking in Streams with AWS Kinesis – A High Performance Best Practice For Big Data

McKinsey research suggests that

“the gap between leaders and laggards in adopting analytics, within and among industry sectors, is growing. We’re seeing the same thing on the ground. Some companies are doing amazing things; some are still struggling with the basics; and some are feeling downright overwhelmed, with executives and members of the rank and file questioning the return on data initiatives.”

According to Gartner, the majority of fortune 500 companies are not able to access big data for competitive advantage.  Just 3 years ago, 85% were not.

Here’s what Gartner means:

  • You’ve got devices deployed on the internet–software or hardware.  Shipping a new release? How long until you know the impact on your customers?  
  • Need to pivot quickly?  How long before you know where the market is at?
  • How quickly can A/B testing of your product tell you where to make your next refinement?  
  • Customer service has a user in trouble on the phone, how long until the data to diagnose the problem is available?

In most companies the answer to this question is anywhere from a day to a week.

Most companies are faced with a customer support issue that they can’t fix in-real-time because they can’t see the customers data.  As a result, customers are frustrated and loose support for the company.

What if you could see the latest data from a 250-million record-a-day system in under a minute?  The difference between a day and a minute is thinking in streams.

Whether it’s a data science project or an API the servers today operate at performance levels that were unthinkable 10 years ago.  Initially Hadoop’s Mapreduce took a step away from traditional transaction based batch processing by distributing that processes across multiple machines.  They still maintained the batch processing mantra by writing the data to disk and then processing it on the appropriate server but they moved in the right direction.

AWS Kinesis Streams gave a consistent way to pull loads of information off the internet but engineers who were well trained in traditional methods took this stream and reduced it to batch processing once it arrived–the way years of software engineering had taught them, carefully.  Verifying each step. High quality but low availability.

Preventing the standard temptation of storing data as soon as you get it is a difficult one to overcome.  But the next step in the industry was taken by ELK (Elasticsearch, Logstash and Kibana). Logstash would take a stream, modify it in flight and then pass the content on to the next process.  This is a major step forward because it provides a model for data processing that was rarely seen before Logstash–namely to transform data before writing it to disk.

Before Logstash, and still at the majority of big-data facility today, engineers are afraid of a system that they can’t completely verify…so even though they’re dealing with millions of data records a day, they still use the practices that helped them debug when there were only hundreds of records.  Namely, they write out the records as soon as they get them–this way they know they didn’t lose anything. Second, create a batch process to transform the data and third to debug the records by verifying that what they started with was not destroyed in the transform. This doesn’t work quickly with millions or 100’s of millions of records.

I’ll get into how to debug this type of system, but first, let’s look at the new type of system and why it’s so fast.

The final problem that had to be solved with stream processing was flexibility of CPU processing.  When you get a burst of data, you have to be able to expand your processing power quickly to accommodate the demand–or you have to batch the load and process it later.  Microsoft introduced Micro-Services and AWS introduced EC2 instances to handle rapidly changing processing needs.

So here’s the list of things that make stream processing preferential over batch processing:

  1. Streams of data like AWS Kinesis or to a lesser degree Hadoop Mapreduce don’t have the extra cost of writing and re-querying data.
  2. Stream processing techniques like Logstash allow you to pull data out of a stream modify it and put it back in the stream without the extra write and query steps.
  3. Advanced debugging techniques can verify millions of records a day which exceeds the abilities of engineers to view data.  Meaning that you don’t need that static record of what happened along the way.
  4. Elastic computing that can expand and contract instantly as demand changes thus trading scalable processing for batch processing.

 

A comparison of a batch mindset to a steam mindset:

The above looks like a series of streams but it is not.  The stream is written to a “data lake”. The data lake is queried and transformed into several data formats and additional queries are run from those systems.  This is an almost ubiquitous architecture when planning large scale data systems. The Data Lake is an S3 series of buckets of data. The data lake has to be written and then queried for each set of data.  The cost here is three fold: batch processing X disk space X CPU usage. Then the data is written to one of many formats. This picture has the data written 5 times and queried 4 times before it is available.  Creating 5 times the storage requirements, 5 times the CPU requirements and exponentially more time before the data is available. The impact on cost and performance is dramatic.

Streams provide a better way:

The above diagram shows a stream processing system where there is one write and a query in summary reporting.  It introduces the concept pre-processing known queries for near-real-time reporting while still allowing for summary reporting.  The elements in green are lambda processes that can be replicated to scale for accommodating load. Because of the fewer writes and queries, this system is remarkably faster and less costly than traditional batch oriented architectures.  Because data has fewer intermediate states it requires less storage and less overall CPU to handle the data.

With a few additional changes, stream systems can handle a theoretically limitless number of records.  One million complex records a minute have been achieved.

A couple of pages ago I said I’d explain how to debug this.  The answer is very simple: a machine learning processes can look at records as they come in and compare them to the data store to make sure that “data normalization” or “route and replication” doesn’t misroute, loose or mangle records.  The machine learning algorithm can start with 100% coverage and drop back to 0.00001% coverage once the system is validated. This is far more effective than sending engineers into a data lake to validate a 1 in 100,000,000 error.

Many engineering departments take an incremental approach to big data.

  • First, capture the data, to do this, you need to write it to disk.  
  • Second, normalize the data, here you query the data modify it and place it in the existing store or a new store.  Either way this requires tracking to make sure the data is processed properly and transactions to record the changes to the data.  
  • Third, create an ad-hoc query system to figure out and report what’s in the data.
  • Forth, use the ad-hoc query system to report the data back to the business.  Here the ad-hoc query system does two things.
    • One, it reports standard, know reports back to the business and
    • Two it becomes a research system to discover new data relationships.

 

The first part about getting your data in an actionable time frame is to create a stream to accomplish the first, second and third steps above.  Then, once the fourth step produces standardized reports, pull those reports out of the stream, in-real-time as they come into the organization.

 

To learn more, contact:

David Brian Ward

Telegraph Hill Software

535 Mission St, San Francisco, CA 94105

david.ward@thpii.com

 

David Urry

Telegraph Hill Software

535 Mission St, San Francisco, CA 94105

david.urry@thpii.com