4The Most Strings With Few Bad Columns Problem

This chapter deals with an non-deterministic polynomial-time (NP)-hard string selection problem known as the most strings with few bad columns (MSFBC) problem. The problem was originally introduced as a model for a set of DNA sequences from a heterogeneous population consisting of two subgroups: (1) a rather large subset of DNA sequences that are identical apart from mutations at maximal k positions; and (2) a smaller subset of DNA sequences that are outliers. The goal of the MSFBC problem is to identify the outliers. In this chapter, the first and foremost problem is modeled by means of integer linear programming (ILP). Second, two variants of a rather simple greedy strategy are outlined. Finally, a large neighborhood search (LNS) approach (see section for a general introduction to LNS) for the MSFBC problem is described. This approach is currently the state-of-the-art technique. The LNS algorithm makes use of the ILP solver CPLEX as a sub-routine in order to find, at each iteration, the best possible neighbor in a large neighborhood of the current solution. A comprehensive experimental comparison among these techniques shows, first, that LNS generally outperforms both greedy strategies. Second, while LNS is competitive with the stand-alone application of CPLEX for small and medium size problems, it outperforms CPLEX in the context of larger problems. Note that the content of this chapter is based on [LIZ 16].

As already ...

Get Metaheuristics for String Problems in Bio-informatics now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.