AWS Analytics: Kinesis, EMR, Athena

Kinesis

Handles streaming data
Collect and analyse
Rapidly move data off producers and then processing

Data streams stores data for later processing, Firehose delivers data directly to AWS services

e.g. IOT data coming in through stream
Process in Data analytics: run SQL on incoming data

Use cases

Kinesis Data Stream

stream data is stored in Shards (these carry the data, held for 24 hours extendable to 7 days) Consumers are EC2 instances which get data from the shards, save to another service (called Kinesis streams applications) Shards have limits: 1Mb/sec input, 2Mb / sec output, and 1000 PUT records per second Stream capacity depends on number of shards pay per shard Re-sharding: adjust number of shares as data flow increases. Shard split: divide one shard into 2. increases capacity and cost Shard merge: combine. Reduces capacity.

Data stream partition keys: provide ordering When put onto a stream, specify the shard Can't guarantee ordering between different shard, but within shard ordering is maintained. SQS doesn't guarantee order on SQS standard queues, but it can if you use SQS FIFO.

Kinesis Firehose

Kineses Analytics

Sits over Data Streams and Firehose Runs SQL against streaming sources Gives real time analysis capability e.g. time series analytics, dashboards, alerts

Compare Kinesis / SQS / SNS

SQS

No ordering

SNS

Pub / sub model
data not persisted

Example Kinesis stream with Lambda

AWS tutorial
Need lambda to poll Kineses, so set up event source mapping

Example Kinesis use

Receive streaming data from IoT. Same partition key go to same instance, so e.g. segregate sources to different partitions. Can load data streams into Firehose then S3 a good place to store.

Amazon EMR

TODO: look at EMR docs

Amazon Athena and AWS Glue

Glue