Partitioning in Hive can be best explained with an example. Suppose a telecom organization generates 1 TB of data every day and different regional managers query this data based on their own state. For each query by a regional manager, Hive scans the complete data in HDFS and files the results for a particular state.
The manager runs the same query daily for his own state analysis and the query gives the result in four hours on a 1 TB dataset. For analytics, the same query could be executed daily on a one-month or six-month dataset. The query would take ten hours on a month's data.
If the data is somehow partitioned based on state, then when a regional manager runs the same query for his state, only the data of that state is scanned ...