Defining the payload for the crawler

First things first, we need to define the payload that will be shared between the processors for each stage of the pipeline:

The first three fields, LinkID, URL, and RetrievedAtwill be populated by the input source. The remaining fields will be populated by the various crawler stages:

  • RawContent is populated by the link fetcher
  • NoFollowLinks and Links are populated by the link extractor
  • Title and TextContent ...

Get Hands-On Software Engineering with Golang now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.