Class NGramProfile


  • public class NGramProfile
    extends Object
    This class runs an ngram analysis over submitted text, results might be used for automatic language identification. The similarity calculation is at experimental level. You have been warned. Methods are provided to build new NGramProfiles profiles.
    Author:
    Sami Siren, Jerome Charron - http://frutch.free.fr/, Hendrik Schreiber
    • Constructor Detail

      • NGramProfile

        public NGramProfile​(String name,
                            int minlen,
                            int maxlen)
        Construct a new ngram profile
        Parameters:
        name - is the name of the profile
        minlen - is the min length of ngram sequences
        maxlen - is the max length of ngram sequences
    • Method Detail

      • getName

        public String getName()
        Returns:
        Returns the name.
      • add

        public void add​(StringBuffer word)
        Add ngrams from a single word to this profile
        Parameters:
        word - is the word to add
      • analyze

        public void analyze​(StringBuilder text)
        Analyze a piece of text
        Parameters:
        text - the text to be analyzed
      • normalize

        protected void normalize()
        Normalize the profile (calculates the ngrams frequencies)
      • getSorted

        public List<com.tagtraum.core.lang.NGramProfile.NGramEntry> getSorted()
        Return a sorted list of ngrams (sort done by 1. frequency 2. sequence)
        Returns:
        sorted vector of ngrams
      • load

        public void load​(InputStream is)
                  throws IOException
        Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)
        Parameters:
        is - the InputStream to read
        Throws:
        IOException
      • create

        public static NGramProfile create​(String name,
                                          InputStream is,
                                          String encoding)
        Create a new Language profile from (preferably quite large) text file
        Parameters:
        name - is thename of profile
        is - is the stream to read
        encoding - is the encoding of stream
      • save

        public void save​(OutputStream os)
                  throws IOException
        Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding
        Parameters:
        os - the Stream to output to
        Throws:
        IOException
      • main

        public static void main​(String[] args)
        main method used for testing only
        Parameters:
        args - args