January 2020
Intermediate to advanced
640 pages
16h 56m
English
First things first, we need to define the payload that will be shared between the processors for each stage of the pipeline:
type crawlerPayload struct { LinkID uuid.UUID URL string RetrievedAt time.Time RawContent bytes.Buffer // NoFollowLinks are still added to the graph but no outgoing edges // will be created from this link to them. NoFollowLinks []string Links []string Title string TextContent string }
The first three fields, LinkID, URL, and RetrievedAt, will be populated by the input source. The remaining fields will be populated by the various crawler stages: