Gravwell is an enterprise data fusion platform that enables security teams to investigate, collaborate, and analyze data from any source, on demand, all with unlimited data collection and retention. Ingest everything. Investigate anything.
Amazon’s Kinesis Streams service provides a powerful way to aggregate data (logs, etc.) from a large number of sources and feed that data into multiple data consumers. For instance, a large enterprise might use one Kinesis stream to gather log data from their cloud infrastructure and another stream to aggregate sales data from the web services running on that infrastructure. Once the data is in the stream, it remains available for up to a day (or optionally longer) for any number of applications to read it back for processing and analysis. This is particularly useful to customers that want to deploy and destroy virtual machines on a whim; data is stored in the stream, rather than the ephemeral VMs.
I wanted to let Gravwell users take advantage of the ease of deployment that Kinesis can provide, so I wrote an ingester for Kinesis streams. The power of Gravwell is that it ingests unstructured data; this means our users can push more or less anything into the Kinesis stream and Gravwell will happily receive and store it. Luckily for me, Amazon provides a Go API to interact with Kinesis, so I was able to get started quickly!
Implementing the Ingester
It turns out that the Kinesis stream model maps very easily to Gravwell. A Kinesis stream consists of ordered data records; each record is just a “blob” of bytes. Gravwell ingests data as “entries”; each entry is essentially a blob of bytes with a small amount of metadata attached (timestamp, source, and tag). The Kinesis ingester therefore has a rather simple task: read records from the Kinesis stream, then just wrap each record’s bytes in an entry and pass it to the Gravwell indexer for storage.
Each Kinesis stream is made up of a number of “shards”. Each shard provides a 1 MB/s worth of write rate (up to 1000 records written per second); thus, if a customer knows an application generates data at a rate of around 7 MB/s, they might provision a stream with 10 shards to make sure there’s enough throughput available. On the other end, each shard allows up to 2 MB/s of reads.
The application reading from the stream must read from each shard individually in order to access the data. Because Go is all about concurrency, I decided to create a separate goroutine to read data from each shard. Between Amazon’s Go API and our own ingest API (http://github.com/gravwell/ingest), the ingester boiled down to little more than glue code and error handling as it pulls in records, slaps each one in an entry, and shoves it down the pipe to the Gravwell instance.
Kinesis and Gravwell: Applications and Tradeoffs
Obviously, there are situations where ingesting data via Kinesis makes sense and situations where it doesn’t. I’ve tried to lay out a few scenarios on both sides to help you decide if our Kinesis capability might be useful to you vs. a native Gravwell ingester.
The most straightforward use case for ingesting via Kinesis streams is when your infrastructure already depends on Kinesis. Because Kinesis allows multiple consumers to read the same data from a stream, you can point Gravwell’s ingester at an existing stream and get analytics on your data without any changes to other parts of your system.
Another neat thing about Kinesis: it’s just HTTPS in both directions. Under the hood, Kinesis records are written with a POST operation and read with a GET request. Depending on the locations of your data sources and your Gravwell instance, HTTP may be about the only method of communication available--IT policies can be strict! With Kinesis, the application writing data to the stream is just an HTTP client, and so is the Gravwell ingester on the other end.
On the other hand, if you’re at a company that runs some servers in-house and you want to aggregate logs into Gravwell for analysis, you’d be better served having each server send log entries to Gravwell using something like our FileFollow ingester. Using Kinesis in this situation wouldn’t make sense--the traffic would go from the servers out of your network to Amazon’s cloud, then get pulled immediately back into the internal network to the Gravwell ingester!
In general, Kinesis is a good option when you're working within AWS, when your data sources are already integrated with Kinesis, or in special circumstances where normal ingesters may not be suitable. If on the other hand you have less transient data generation points, or you’re generating data within the same network as your Gravwell instance, it may make more sense to get the data into Gravwell using one of our other ingesters. If you have a very large number of data generators, Gravwell’s Federator can help aggregate traffic, or you can work with us to write a custom ingester; it’s surprisingly easy!
Gravwell can now ingest data from Kinesis streams! If you’d like to try it, just email firstname.lastname@example.org and we can help you get it set up.
Written by John Floren
John's been writing Go since before it was cool and developing distributed systems for almost as long.