The purpose of this video is to illustrate a realistic scenario in which two discrete, but
complementary data sets are used to build a cohesive data pipeline. We will use a vectorized non-temporal lookup table in order to combine two data sets that are critical for cybersecurity investigators.
Query with your favorite tool or pick up a Gravwell CE license for free. Gravwell.io/ce
Compound Query: Non-Temporal Joins
Data Sources
https://github.com/kkneomis/kc7_data/blob/main/envolvelabs/FileCreationEvents.csv.gz
https://github.com/kkneomis/kc7_data/blob/main/envolvelabs/ProcessEvents.csv.gz
Purpose
The purpose of this content is to illustrate a realistic scenario in which two discrete, but complementary data sets are used to build a cohesive data pipeline. We will use a vectorized non-temporal lookup table in order to combine two data sets that are critical for cybersecurity investigators.
Prepared Searches
The following 4 searches should be prepared in separate tabs in the Query Studio, this obviates the need for copy-pasting from a notepad or typing manually during narration.
Key: Use the following DateTime range: 01/01/2022 1200am through 12/31/2022 1200am
Query 1: Sitrep of the FileCreationEvents data source
tag=envolvelabs-FileCreationEvents ax
| words "ba8a996a117702b946e07dd12d030956efddc159a5e775c18b1a7fb10df13902"
| limit 10
| table
|
Query 2: Sitrep of the ProcessEvents data source
tag=envolvelabs-ProcessEvents ax
| words "ba8a996a117702b946e07dd12d030956efddc159a5e775c18b1a7fb10df13902"
| limit 10
| table
|
Query 3: The Lookup Table
@files{
tag=envolvelabs-FileCreationEvents ax
| words "ba8a996a117702b946e07dd12d030956efddc159a5e775c18b1a7fb10df13902"
| table -nt filename sha256 path size hostname
};
dump -r @files
|
Query 4: The Vectorized Join
@files{
tag=envolvelabs-FileCreationEvents ax
| words "ba8a996a117702b946e07dd12d030956efddc159a5e775c18b1a7fb10df13902"
| table -nt filename sha256 path size hostname
};
tag=envolvelabs-ProcessEvents ax
| words "ba8a996a117702b946e07dd12d030956efddc159a5e775c18b1a7fb10df13902"
| lookup -s -r @files [parent_process_hash hostname] [sha256 hostname] (filename path)
| table
|
Narrative
Introduction and Positioning
Hello and welcome to “Down The Well” with Gravwell where we explore use cases and scenarios that readily apply to real-world security analysis.
In this short video, we are going to use a compound query containing a non-temporal lookup table to combine two discrete data sources in order to create a cohesive data pipeline. The resulting pipeline will contain what folder an executable was running from and what it actually executed.
We will be using data from KC7 which can be found at https://kc7cyber.com/. The data in KC7 was built by cybersecurity experts to simulate real adversaries; therefore it is synthetic but realistic. Since it was designed for educational use, we will use this data for this video in order to make it easily consumable by a broad set of users.
First, let's load up the Gravwell Query Studio. We have pre-populated the various tabs with the queries we will showcase today for sake of brevity.
Understanding what our Data Looks Like
Our first order of business is to orient ourselves to the two data sources we will be using today.
For this scenario, we have received an intelligence report containing a sha256 hash that corresponds to a known piece of malware that we suspect could be running rampant in our environment.
Here we have the FileCreationEvents table, which contains information about files written to disk. Since we are approaching this data blind we can make use of the “words” search-processing module, which enables a full-text search of the data source for the matching pattern, delimited by split characters. In this case, we’ll use it to search for the sha256 value that we are interested in.
[Query 1: Sitrep of the FileCreationEvents data source]
tag=envolvelabs-FileCreationEvents ax
| words "ba8a996a117702b946e07dd12d030956efddc159a5e775c18b1a7fb10df13902"
| limit 10
| table
|
Gravwell can intelligently extract enumerated fields from the underlying data source in this case as it is tabular data; thus we can use the “ax” search extraction module to create a well parsed, easy to consume data pipeline.
Since our primary objective at this point is to get a sitrep of what our data looks like, in context, we will “limit” our results to just 10 lines and use the “table” search render module to increase readability.
Of particular interest to us, as analysts, are the “sha256” and “hostname” enumerated fields. We’ll make note of these field names for later usage.
Next, we’ll perform our data exploration for the ProcessEvents table, which contains information about parent processes and their corresponding command lines. We will use the same underlying query as before, but change our data source.
[Query 2: Sitrep of the ProcessEvents data source]
tag=envolvelabs-ProcessEvents ax | words "ba8a996a117702b946e07dd12d030956efddc159a5e775c18b1a7fb10df13902"
| limit 10
| table
|
In this table, we are interested in “parent_process_hash” and “hostname”
Creating an ephemeral Look-Up Table
Now that we have oriented ourselves in our data, we can begin constructing the first part of our compound query, the non-temporal lookup table.
The “@<value>” nomenclature is used to tell Gravwell to create an ephemeral lookup table that can be referenced within the context of the compound query. No underlying data is being changed, nor is a new data source created.
The bulk of this query should be quite familiar; we are making use of the “words” search processing module and taking advantage of the “ax” extraction module. The primary differences can be seen in the parameters passed to the “table” search renderer. For this lookup table, we are using the “-nt” which tells the renderer to put the table in non-temporal mode which helps speed up the overall query since we are not relying on timestamps as our vector in the proceeding compound query. Since the table is going to be rendered in non-temporal mode, we only need the filename, sha256, path, size, and hostname available in the resulting lookup table.
[Query 3: The Lookup Table]
@files{
tag=envolvelabs-FileCreationEvents ax
| words "ba8a996a117702b946e07dd12d030956efddc159a5e775c18b1a7fb10df13902"
| table -nt filename sha256 path size hostname
};
dump -r @files
|
Once we have created our lookup table, we can use the “dump” search extraction module to see the underlying content of our new, ephemeral table.
All Together Now
As part of our previous step, we verified that the @files lookup table was behaving as expected. It's time to use our lookup table as part of a compound query to join our FileCreationEvents and ProcessEvents data sources.
The first part of this query should be familiar and is responsible for setting up our lookup table.
Instead, we will focus on the second part of our compound query. As before, we are using the “ax” extractor and “words” module to create our enumerated fields and search for the provided sha256. Following the invocation of these two modules, we begin by calling the “lookup” search processing module. For this module we pass the “-s” option, which requires all extractions to succeed in order for an entry to be rendered, and the “-r” option which tells the module where the static lookup table is stored, that would be our previously created “@files” table.
Next comes our vector notation. For our specific use case we want to match on the “parent_process_hash” and “hostname” fields of ProcessEvents with the “sha256” and “hostname” fields of our lookup table (rendered from the FileCreationEvents table). This notation creates a vectorized match of the pair of values we have specified across both data sources. Finally, we specify that we want to extract the “filename” and “path” fields from the lookup table when rendering our results.
[Query 4: The Vectorized Join]
@files{
tag=envolvelabs-FileCreationEvents ax
| words "ba8a996a117702b946e07dd12d030956efddc159a5e775c18b1a7fb10df13902"
| table -nt filename sha256 path size hostname
};
tag=envolvelabs-ProcessEvents ax
| words "ba8a996a117702b946e07dd12d030956efddc159a5e775c18b1a7fb10df13902"
| lookup -s -r @files [parent_process_hash hostname] [sha256 hostname] (filename path)
| table
|
When we run this compound query, we see that we have effectively created a join on both the relevant hostname and hash and rendered the two extracted fields from our lookup table to generate a much more complete final output that contains both process information and the location from which that process was executed.
We can quickly see in our results that the filenames and paths of execution differ between hosts which we need to account for if we are writing countermeasures or conducting incident response.

Wrap-Up
Compound queries in Gravwell unlock incredible potential for analysts who need to ask critical questions about their data. In our scenario today, using a compound query allowed us to understand the different locations from which an attacker was staging their malware and gain further insights into the attacker’s TTPs across our systems.
Thank you for joining us on today’s “Down The Well” and we hope you are able to make the most of your new-found knowledge to elevate your queries. If you would like to try Gravwell for yourself, the Community Edition, which was used in this video, is completely free and available at gravwell.io. To schedule a demo with our sales team, please contact sales@gravwell.io