Reading and understanding the raw data
Files come in a variety of formats. Even a file that appears to be simple text is often a UTF-8 encoding of Unicode characters. When we're processing data to extract intelligence, we need to look at three tiers of representation:
- Physical Format: We might have a text file encoded in UTF-8, or we might have a GZIP file, which is a compressed version of the text file. Across these different physical formats, we can find a common structure. In the case of log files, the common structure is a line of text which represents a single event.
- Logical Layout: After we've extracted data from the physical form, we often find that the order of the fields is slightly different or some optional fields are missing. The trick ...