1.2. Using Set Semantics with Data

Problem

You would like to work with your collections using set operations for union, intersections, exceptions, and distinct items.

Solution

Use the Set operators provided as part of the Standard Query Operators to perform those operations.

Distinct:
	IEnumerable<string> whoLoggedIn = 
	    dailySecurityLog.Where(logEntry => logEntry.Contains("logged in")).Distinct( );
Union:
	// Union
	Console.WriteLine("Employees for all projects");
	var allProjectEmployees = project1.Union(project2.Union(project3));
Intersection:
	// Intersect
	Console.WriteLine("Employees on every project");
	var everyProjectEmployees = project1.Intersect(project2.Intersect(project3));
Exception:
	Console.WriteLine("Employees on only one project");
	var onlyProjectEmployees = allProjectEmployees.Except(unionIntersect);

Discussion

The Standard Query Operators are the set of methods that represent the LINQ pattern. This set includes operators to perform many different types of operations, such as filtering, projection, sorting, grouping, and many others, including set operations.

The set operations for the Standard Query Operators are:

  • Distinct

  • Union

  • Intersect

  • Except

The Distinct operator extracts all nonduplicate items from the collection or result set being worked with. Say, for example, that we had a set of strings representing login and logout behavior for a terminal services box for today:

	// Distinct
	string[] dailySecurityLog = {
	      "Bob logged in", 
	      "Bob logged out", 
	      "Bob logged in", 
	      "Bill logged in",
	      "Melissa logged in",
	      "Bob logged out",
	      "Bill logged out", 
	      "Bill logged in", 
	      "Tim logged in", 
	      "Scott logged in", 
	      "Scott logged out", 
	      "Dave logged in", 
	      "Tim logged out", 
	      "Bob logged in", 
	      "Dave logged out"};

From that collection, we would like to determine the list of people who logged in to the box today. Since people can log in and log out many times during the course of a day or remain logged in for the whole day, we need to eliminate the duplicate login entries. Distinct is an extension method on the System.Linq.Enumerable class (which implements the Standard Query Operators) that can be called on the string array (which supports IEnumerable) in order to get the distinct set of items from the set. For more information on extension methods, see Recipe 1.4. The set is produced by using another of the Standard Query Operators: Where. Where takes a lambda expression that determines the filter criteria for the set and examines each string in the IEnumerable<string> to determine if the string has "logged in." Lambda expressions are inline statements (similar to anonymous methods) that can be used in place of a delegate. See Chapter 9 for more on lambda expressions. If the strings do, then they are selected. Distinct narrows down the set of strings further to eliminate duplicate "logged in" records, leaving only one per user:

	    IEnumerable<string> whoLoggedIn =
	        dailySecurityLog.Where(logEntry => logEntry.Contains("logged in")).Distinct(
	);
	    Console.WriteLine("Everyone who logged in today:");
	    foreach (string who in whoLoggedIn)
	    {
	        Console.WriteLine(who); 
	    }

To make things a bit more interesting, for the rest of the operators, we will work with sets of employees on various projects in a company. An Employee is a pretty simple class with a Name and overrides for ToString, Equals, and GetHashCode, as shown here:

	public class Employee
	{
	    public string Name { get; set; }   
	    public override string ToString()
	    {
	        return this.Name;
	    }
	    public override bool Equals(object obj) 
	    {
	        return this.GetHashCode().Equals(obj.GetHashCode()); 
	    }    
	    public override int GetHashCode() 
	    {
	        return this.Name.GetHashCode(); 
	    }
	}

You might wonder why Equals and GetHashCode are overloaded for such a simple class. The reason is that when LINQ performs comparisons of elements in the sets or collections, it uses the default comparison, which in turn uses Equals and GetHashCode to determine if one instance of a reference type is the same as another. If you do not provide the semantics in the reference type class to provide the same hash code or equals value when the data for two instances of the object is the same, then the instances will, by default, be different, as two reference types have different hash codes by default. We override that so that if the Name is the same for each Employee, the hash code and the equals will both correctly identify the instances as the same. There are also overloads for the set operators that take a custom comparer, which would also allow you to make this determination even for classes for which you can't make the changes to Equals and GetHashCode.

Having done this, we can now assign Employees to projects like so:

	Employee[] project1 = {
	           new Employee(){ Name = "Bob" },
	           new Employee(){ Name = "Bill" },
	           new Employee(){ Name = "Melissa" },
	           new Employee(){ Name = "Shawn" } };
	Employee[] project2 = {
	           new Employee(){ Name = "Shawn" },
	           new Employee(){ Name = "Tim" },
	           new Employee(){ Name = "Scott" } };
	Employee[] project3 = {
	           new Employee(){ Name = "Bob" },
	           new Employee(){ Name = "Dave" },
	           new Employee(){ Name = "Tim" },
	           new Employee(){ Name = "Shawn" } };

To find all employees on all projects, use Unionl to get all nonduplicate Employees in all three projects and write them out:

	// Union
	Console.WriteLine("Employees for all projects:"); 
	var allProjectEmployees = project1.Union(project2.Union(project3)); 
	foreach (Employee employee in allProjectEmployees) 
	{
	    Console.WriteLine(employee);
	}

We can then use Intersect to get the Employees on every project:

	// Intersect
	Console.WriteLine("Employees on every project:");
	var everyProjectEmployees = project1.Intersect(project2.Intersect(project3));
	foreach (Employee employee in everyProjectEmployees) 
	{
	    Console.WriteLine(employee); 
	}

Finally, we can use a combination of Union and Except to find Employees that are only on one project:

	// Except
	var intersect1_3 = project1.Intersect(project3); 
	var intersect1_2 = project1.Intersect(project2);
	var intersect2_3 = project2.Intersect(project3); 
	var unionIntersect = intersect1_2.Union(intersect1_3).Union(intersect2_3);

	Console.WriteLine("Employees on only one project:");
	var onlyProjectEmployees = allProjectEmployees.Except(unionIntersect);
	foreach (Employee employee in onlyProjectEmployees)
	{
	    Console.WriteLine(employee); 
	}

See Also

The "Standard Query Operators," "Distinct method," "Union method," "Intersect method," and "Except method" topics in the MSDN documentation.

Get C# 3.0 Cookbook, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.