Class AbstractSearchProcessorFactory

  • All Implemented Interfaces:
    SearchProcessorFactory
    Direct Known Subclasses:
    BitapSearchProcessorFactory, KmpSearchProcessorFactory

    public abstract class AbstractSearchProcessorFactory
    extends java.lang.Object
    implements SearchProcessorFactory
    Base class for precomputed factories that create SearchProcessors.
    Different factories implement different search algorithms with performance characteristics that depend on a use case, so it is advisable to benchmark a concrete use case with different algorithms before choosing one of them.
    A concrete instance of AbstractSearchProcessorFactory is built for searching for a concrete sequence of bytes (the needle), it contains precomputed data needed to perform the search, and is meant to be reused whenever searching for the same needle.
    Note: implementations of SearchProcessor scan the ByteBuf sequentially, one byte after another, without doing any random access. As a result, when using SearchProcessor with such methods as ByteBuf.forEachByte(io.netty.util.ByteProcessor), these methods return the index of the last byte of the found byte sequence within the ByteBuf (which might feel counterintuitive, and different from ByteBufUtil.indexOf(io.netty.buffer.ByteBuf, io.netty.buffer.ByteBuf) which returns the index of the first byte of found sequence).
    A SearchProcessor is implemented as a Finite State Automaton that contains a small internal state which is updated with every byte processed. As a result, an instance of SearchProcessor should not be reused across independent search sessions (eg. for searching in different ByteBufs). A new instance should be created with AbstractSearchProcessorFactory for every search session. However, a SearchProcessor can (and should) be reused within the search session, eg. when searching for all occurrences of the needle within the same haystack. That way, it can also detect overlapping occurrences of the needle (eg. a string "ABABAB" contains two occurrences of "BAB" that overlap by one character "B"). For this to work correctly, after an occurrence of the needle is found ending at index idx, the search should continue starting from the index idx + 1.
    Example (given that the haystack is a ByteBuf containing "ABABAB" and the needle is "BAB"):
         SearchProcessorFactory factory =
             SearchProcessorFactory.newKmpSearchProcessorFactory(needle.getBytes(CharsetUtil.UTF_8));
         SearchProcessor processor = factory.newSearchProcessor();
    
         int idx1 = haystack.forEachByte(processor);
         // idx1 is 3 (index of the last character of the first occurrence of the needle in the haystack)
    
         int continueFrom1 = idx1 + 1;
         // continue the search starting from the next character
    
         int idx2 = haystack.forEachByte(continueFrom1, haystack.readableBytes() - continueFrom1, processor);
         // idx2 is 5 (index of the last character of the second occurrence of the needle in the haystack)
    
         int continueFrom2 = idx2 + 1;
         // continue the search starting from the next character
    
         int idx3 = haystack.forEachByte(continueFrom2, haystack.readableBytes() - continueFrom2, processor);
         // idx3 is -1 (no more occurrences of the needle)
    
         // After this search session is complete, processor should be discarded.
         // To search for the same needle again, reuse the same factory to get a new SearchProcessor.
     
    • Constructor Detail

      • AbstractSearchProcessorFactory

        public AbstractSearchProcessorFactory()
    • Method Detail

      • newKmpSearchProcessorFactory

        public static KmpSearchProcessorFactory newKmpSearchProcessorFactory​(byte[] needle)
        Creates a SearchProcessorFactory based on Knuth-Morris-Pratt string search algorithm. It is a reasonable default choice among the provided algorithms.
        Precomputation (this method) time is linear in the size of input (O(|needle|)).
        The factory allocates and retains an int array of size needle.length + 1, and retains a reference to the needle itself.
        Search (the actual application of SearchProcessor) time is linear in the size of ByteBuf on which the search is performed (O(|haystack|)). Every byte of ByteBuf is processed only once, sequentially.
        Parameters:
        needle - an array of bytes to search for
        Returns:
        a new instance of KmpSearchProcessorFactory precomputed for the given needle
      • newBitapSearchProcessorFactory

        public static BitapSearchProcessorFactory newBitapSearchProcessorFactory​(byte[] needle)
        Creates a SearchProcessorFactory based on Bitap string search algorithm. It is a jump free algorithm that has very stable performance (the contents of the inputs have a minimal effect on it). The limitation is that the needle can be no more than 64 bytes long.
        Precomputation (this method) time is linear in the size of the input (O(|needle|)).
        The factory allocates and retains a long[256] array.
        Search (the actual application of SearchProcessor) time is linear in the size of ByteBuf on which the search is performed (O(|haystack|)). Every byte of ByteBuf is processed only once, sequentially.
        Parameters:
        needle - an array of no more than 64 bytes to search for
        Returns:
        a new instance of BitapSearchProcessorFactory precomputed for the given needle