Source: Screenshot from Official AWS Kinesis video
In my last post, we captured an overview of what Kinesis does and what it is capable of doing. In this post we will dive a bit deeper into the technical building Blocks of Kinesis. Yes you will still understand this Blog even if you’re not a Web Developer, Just to clear your Doubts! Anyways, so lets come back where we took off in the Kinesis Story. We know that Kinesis enables capturing of continuous stream of data; capable of processing data in real time! So what is this Stream made of? How does my data from different sources logically get processed without getting intermingled? That’s what we’re here to find out.
Shards – The Data Highway
So what is a Kinesis Stream made of? Well, a Kinesis Stream is made of a single or multiple number of “shards”.So now you might wondering what is a shard ?? Well AWS Cloud Defines Shards as;
“A Shard is a scaling unit for a stream. A shard is a uniquely identified group of data records in an Amazon Kinesis stream”
Woah ! That may have been a Tangent for many of you, don’t worry it’s not as complicated as it sounds. Let me explain it this way: For now,lets consider shards as carriers of data in a stream. In the previous post we spoke about the lumber jack example remember? We considered the logs of wood as data and the water stream as the Kinesis Stream.
[Curious about Big Data with AWS? See how the Lifecycle Works]
The Highway Example to the Rescue !!
So now lets take Another example, this time of a highway. The highway is our Kinesis Stream and the vehicles running on it are our data. For our scenario, let us consider it to be a single lane highway(where all the vehicles are travelling in one direction). As time passes, the number of vehicles on the highway keeps increasing, eventually leading to traffic congestion and then a traffic snarl-up!!!! So what’s the Solution??? Simple, increase the capacity of the road. As the road is widened,the lanes in the road increase i.e our first single laned highway is widened and made into a two lane highway. What if there’s further traffic congestion?? Simple, widen the road and make it a 4 lane highway! So the lanes here represent the “shards”. Want to increase the capacity of the roads, just increase the number of lanes; or want to increase the capacity of the Kinesis Stream, just increase the number of shards!
How do Shards make your Job a Walk in the Park?
So now we clearly understand the meaning of the sentence “Shards are the scaling unit for a stream”.
Fortunately increasing the capacity of a Kinesis stream is not such a backbreaking task as building a new lane for a highway!! Adding shards to a Kinesis stream is just a matter of a few clicks!! ( We will be including a tutorial for a Step by Step Guide on how to Build your Own Kinesis App in our future blogs, so stay tuned!)
Courtesy: AWS Website | Amazon Kinesis High Level Architecture
Here are a few facts about Shards that will make your brain ponder:
- Each shard is capable of ingesting 1MB / sec of data and upto 1000 TPS (transaction per second)
- HTTP “Puts” can range from 1 KB to a max of 50 KB
- Data in a shard will be stored for a maximum of 24 hrs i.e data is available for read, re-read, backfilled, and analyzed, or moved to long-term storage within this time-frame
This adding and removing of shards is not only easy, but can be done without disturbing the running application or the stream i.e one can reduce or increase the number of shards in “real-time” thus fulfilling its qualities of “Scalability” and “Manageability”.
[Reinvent Big Data with Hadoop in the Clouds!]
Partition Keys & Sequence Numbers- Just Like Lane Markings & Number Plates!
We have almost covered everything we need to know about shards.But wait, just a tiny bit remaining about- “Partition Keys” So now what are these partition keys!!?
AWS Cloud defines them as,
Partition Keys are like Lane Markings
“The partition key is used to group data by shard within the stream”
So, lets get back to the highway example:
We can map partition keys to be the lane markings( those white markings) on the highway. Its like traffic going to destination “abcd” will be on this lane and to “efgh” will be on that lane. It helps categorize the traffic, i.e partition keys help decide,which data should go into which shard. Consider,there is data coming into the Kinesis Stream from three sources like Twitter, Facebook and some other random site. Each of these sources have their unique “Source Id”. Ideally we would want that data from these sources be carried in separate shards. In such situations partition keys come to the rescue. If we set the partition keys (A partition key is specified by the applications putting the data into a stream) when inserting data as the source-id, the data will be put into their respective shards automatically.
Partition keys are Unicode strings with a maximum length limit of 256 bytes. An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards.
Along,with partition keys, there is something called as the sequence number. Like every vehicle on the road is uniquely identified by its registration number,every data( blob of data through a single “put” call ) is identified through its sequence number.
Well there, now you know what Shards are! stay tuned for our Next Blog where we discuss How to Set up your own Kinesis Application in Detail !
Don’t miss it, or any of our other post! Subscribe to our Blogs.