We used this dataset to keep things simple for ingestion; Apache logs don't require a lot of configuration in Logstash to be parsed. The basic structure of our Logstash configuration will contain the following:
- A file input to consume the two files that are provided in the dataset; one for the month of July and one for the month of August
- Two filters: A grok filter to parse the log line and a date filter to transform the timestamp into a well-formatted timestamp
- One output to tell Logstash to send the data to our Elasticsearch cluster
Here is the complete Logstash pipeline configuration:
input { file { id => "nasa_file" path => "/Users/baha/Downloads/data/*.log" start_position => "beginning" sincedb_path => "/dev/null" ...