First things first, we need to define the payload that will be shared between the processors for each stage of the pipeline:
type crawlerPayload struct { LinkID uuid.UUID URL string RetrievedAt time.Time RawContent bytes.Buffer // NoFollowLinks are still added to the graph but no outgoing edges // will be created from this link to them. NoFollowLinks []string Links []string Title string TextContent string }
The first three fields, LinkID, URL, and RetrievedAt, will be populated by the input source. The remaining fields will be populated by the various crawler stages:
- RawContent is populated by the link fetcher
- NoFollowLinks and Links are populated by the link extractor
- Title and TextContent ...