文字列のエントロピーの計算


236

Stack Overflowの数か所で、文字列のエントロピーを低品質の記号として計算しています。

文字列内の一意の文字をカウントするこの単純な方法を作成しましたが、文字通り最初に思いついたのは文字通りです。それは「機能する最も愚かなこと」です。

/// <summary>
/// returns the # of unique characters in a string as a rough 
/// measurement of entropy
/// </summary>
public static int Entropy(this string s)
{
  var d = new Dictionary<char, bool>();
  foreach (char c in s)
      if (!d.ContainsKey(c)) d.Add(c, true);
  return d.Count();
}

文字列のエントロピーを計算するためのより良い/よりエレガント/より正確な方法はありますか?

効率も良好ですが、これを大きな文字列で呼び出すことは決してないため、大きな問題にはなりません。

98
public static int Entropy(this string s)
{
    HashSet<char> chars = new HashSet<char>(s);
    return chars.Count;
}

223

Won't this work too?

string name = "lltt";
int uniqueCharacterCount = name.Distinct().Count();

will return 2


12

You can probably expand this to something like bi-grams and tri-grams to get things like "sdsdsdsdsdsdsdsdsdsd" (although yours would catch this as well). Would a bayesian approach like spam filters do be appropriate for something like what you're trying to achieve?


18
  1. I don't understand the point of the bool. You never appear to set it to false, so we can use a List<T> instead.
  2. Given that you want just unique items, we can just use HashSet<T> instead.

Haven't tested, but this method should be equivalent and faster:

/// <summary>
/// returns the # of unique characters in a string as a rough 
/// measurement of entropy
/// </summary>
public static int Entropy(this string s)
{
    var hs = new HashSet<char>();
    foreach (char c in s)
        hs.Add(c);
    return hs.Count();
}

11

I'm going to assume this is English (seeing as that's all we do). Wouldn't it be better to keep a HashSet<string> of stop words (the most common words in English that don't convey meaning), tokenize the string into words, and count the number of words that aren't stop words?


77

in theory you can measure entropy only from the point of view of a given model. For instance the PI digits are well distributed, but actually is the entropy high? Not at all since the infinite sequence can be compressed into a small program calculating all the digits.

I'll not dig further in the math side, since I'm not an expert in the field. But I want to suggest to you a few things that can make a very simple but practical model.

Short strings

To start, distribution. Comparing characters that are the same is exactly this in some way, but the generalization is to build a frequency table and check the distribution.

Given a string of length N, how many A chars should I expect in average, given my model (that can be the english distribution, or natural distribution)?

But then what about "abcdefg"? No repetition here, but this is not random at all. So what you want here is to take also the first derivative, and check the distribution of the first derivative.

it is as trivial as subtracting the second char from the first, the thrid from the second, so in our example string this turns into: "abcdefg" => 1,1,1,1,1,1

Now what aobut "ababab"... ? this will appear to have a better distribution, since the derivative is 1,-1,1,-1,... so what you actually want here is to take the absolute value.

Long strings

If the string is long enough the no brainer approach is: try to compress it, and calculate the ratio between the compression output and the input.


15

Why not divide the number of unique characters in the given string by the total number of characters in that string. That would give a more accurate measure of entropy.

For example, going by your formula, an entropy of 3 for a string of 5 characters should be fine but an entropy of 3 for a string of 8 characters is poor. But, your formula wouldn't be able to differentiate between the two results. Whereas, the above given formula would do so to give a more accurate measure.


13

I think antirez is right in suggesting that an entropy approach needs a model. So assuming we're talking English, then examining the string's character distribution and how closely it aligns with "average" will likely show that the text is mostly English. But is this what you want to achieve? Presumable many things are code or pseudo-code. Compression is a great idea, but this'll give the highest entropy for random text - is high entropy bad? Low entropy would indicate lots of repetition, maybe verbosity, but one can write long drawn out sentences with frilly words and transmit little information (e.g. this comment).


8

I would try to count each character and verify that it roughly matches the normal frequency of English letters. It could be more precise (on sufficiently large inputs) than counting the number of letters.

If you sort letters by their number of appearences, you should, statistically speaking, get something like ETAONRISHDLFCMUGYPWBVKXJQZ. You could use the edit distance between this string and the letters sorted by order of appearance to give a rough measurement of entropy.

Also, you could possibly catch non-English posts that way. (If you do go this way, I recommend you exclude code fragments from the count...)


27

How about actually computing entropy? Also, it's not clear that character-level entropy will help, but here goes. It's in my mother tongue C++, but surely you can convert this to Java using Array instead of std::vector.

float CharacterEntropy(const char *str) {
  std::vector<unsigned> counts(256);
  for (const char *i = str; *i; ++i)
    ++counts[static_cast<unsigned char>(*i)];
  unsigned int total = 0;
  for (unsigned i = 0; i < 256; ++i)
    total += counts[i];
  float total_float = static_cast<float>(total);
  float ret = 0.0;
  for (unsigned i = 0; i < 256; ++i) {
    float p = static_cast<float>(counts[i]) / total_float;
    ret -= p * logf(p);
  }
  return p * M_LN2;
}

86

I also came up with this, based on Shannon entropy.

In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits.

It is a more "formal" calculation of entropy than simply counting letters:

/// <summary>
/// returns bits of entropy represented in a given string, per 
/// http://en.wikipedia.org/wiki/Entropy_(information_theory) 
/// </summary>
public static double ShannonEntropy(string s)
{
    var map = new Dictionary<char, int>();
    foreach (char c in s)
    {
        if (!map.ContainsKey(c))
            map.Add(c, 1);
        else
            map[c] += 1;
    }

    double result = 0.0;
    int len = s.Length;
    foreach (var item in map)
    {
        var frequency = (double)item.Value / len;
        result -= frequency * (Math.Log(frequency) / Math.Log(2));
    }

    return result;
}

Results are:

"abcdefghijklmnop" = 4.00
"Hello, World!" = 3.18
"hello world" = 2.85
"123123123123" = 1.58
"aaaa" = 0

25

Similar to zngu's answer, I think better than just counting the number of characters would be calculating the character-entropy of the message:

public double CalculateEntropy(string entropyString)
{
    Dictionary<char, int> characterCounts = new Dictionary<char, int>();
    foreach(char c in entropyString.ToLower())
    {
        if(c == ' ') continue;
        int currentCount;
        characterCounts.TryGetValue(c, out currentCount);
        characterCounts[c] = currentCount + 1;
    }

    IEnumerable<double> characterEntropies = 
        from c in characterCounts.Keys
        let frequency = (double)characterCounts[c]/entropyString.Length
        select -1*frequency*Math.Log(frequency);

    return characterEntropies.Sum();
}

It seems to work well with both code and text, but note that it is not calculating the actual entropy of the string, only the entropy of the character-distribution; sorting the characters within the string should reduce the entropy of the string, but it does not reduce the result of this function.

Here are some tests:

private void CalculateEntropyTest(object sender, EventArgs e)
{
    string[] testStrings = {
        "Hello world!",
        "This is a typical english sentence containing all the letters of the english language - The quick brown fox jumped over the lazy dogs",
        String.Join("", "This is a typical english sentence containing all the letters of the english language - The quick brown fox jumped over the lazy dogs".ToCharArray().OrderBy(o => o).Select(o => o.ToString()).ToArray()),
        "Won't this work too?\nstring name = \"lltt\";\nint uniqueCharacterCount = name.Distinct().Count();\nwill return 2",
        "Pull the entropy finding source from any compression algotithm, i.e. Huffman",
        "float CharacterEntropy(const char *str) {\n  std::vector<unsigned> counts(256);\n  for (const char *i = str; *i; ++i)\n    ++counts[static_cast<unsigned char>(*i)];\n  unsigned int total = 0;\n  for (unsigned i = 0; i < 256; ++i)\n    total += counts[i];\n  float total_float = static_cast<float>(total);\n  float ret = 0.0;\n  for (unsigned i = 0; i < 256; ++i) {\n    float p = static_cast<float>(counts[i]) / total_float;\n    ret -= p * logf(p);\n  }\n  return p * M_LN2;\n}",
        "~~~~~~No.~~~~~~",
        "asdasdasdasdasdasd",
        "abcdefghijklmnopqrstuvwxyz",
        "Fuuuuuuu-------",                
    };
    foreach(string str in testStrings)
    {
        Console.WriteLine("{0}\nEntropy: {1:0.000}\n", str, CalculateEntropy(str));
    }
}

Results:
Hello world!
Entropy: 1.888

This is a typical english sentence containing all the letters of the english language - The quick brown fox jumped over the lazy dogs
Entropy: 2.593

-TTaaaaaaabccccddeeeeeeeeeeeeeeffgggggghhhhhhhiiiiiiiijk lllllllmnnnnnnnnnooooooppqrrrsssssssttttttttuuuvwxyyz
Entropy: 2.593

Won't this work too? string name = "lltt"; int uniqueCharacterCount = name.Distinct().Count(); will return 2
Entropy: 2.838

Pull the entropy finding source from any compression algotithm, i.e. Huffman
Entropy: 2.641

float CharacterEntropy(const char *str) { std::vector counts(256); for (const char *i = str; *i; ++i) ++counts[static_cast(*i)]; unsigned int total = 0; for (unsigned i = 0; i < 256; ++i) total += counts[i]; float total_float = static_cast(total); float ret = 0.0; for (unsigned i = 0; i < 256; ++i) { float p = static_cast(counts[i]) / total_float; ret -= p * logf(p); } return p * M_LN2; }
Entropy: 2.866

~~~~~~No.~~~~~~
Entropy: 0.720

asdasdasdasdasdasd
Entropy: 1.099

abcdefghijklmnopqrstuvwxyz
Entropy: 3.258

Fuuuuuuu-------
Entropy: 0.892


Actually, I think it would be better to do some frequency analysis, but I don't know anything about the frequencies of symbols used in code. The best place to determine that would be the stackoverflow data-dump - I'll have to get back to you after it finishes downloading, in 2 years.


14

I just whipped this algorithm together, so I have no idea how good this is. I fear that it will cause an overflow exception if used on very long strings.

Key concepts of this algorithm:

  • When encountering a character for the first time, the maximum value is added to the un-normalized entropy total. The "maximum value" is the length of the string.
  • If a character is encountered again, then we count the number of positions between this occurrence and the last occurrence, then we subtract the total number of times this character has appeared in the string. We then add that value to the un-normalized entropy total.
  • Once the final un-normalized entropy total has been calculated, it is divided by the length of the string in order to "normalize" it.

    public static int Entropy(this string s)
    {
        int entropy = 0;
    
        var mapOfIndexByChar = new Dictionary<char, CharEntropyInfo>();
    
        int index = 0;
        foreach (char c in s)
        {
            CharEntropyInfo charEntropyInfo;
            if (mapOfIndexByChar.TryGetValue(c, out charEntropyInfo))
            {
                // If this character has occurred previously, then only add the number of characters from
                // the last occurrence to this occurrence, and subtract the number of previous occurrences.
                // Many repeated characters can actually result in the entropy total being negative.
                entropy += ((index - charEntropyInfo.LastIndex) - charEntropyInfo.Occurrences);
    
                // update the last index and number of occurrences of this character
                mapOfIndexByChar[c] = new CharEntropyInfo(index, charEntropyInfo.Occurrences + 1);
            }
            else
            {
                // each newly found character adds the maximum possible value to the entropy total
                entropy += s.Length;
    
                // record the first index of this character
                mapOfIndexByChar.Add(c, new CharEntropyInfo(index, 1));
            }
        }
    
        // divide the entropy total by the length of the string to "normalize" the result
        return entropy / s.Length;
    }
    
    struct CharEntropyInfo
    {
        int _LastIndex;
        int _Occurrences;
    
        public int LastIndex
        {
            get { return _LastIndex; }
        }
        public int Occurrences
        {
            get { return _Occurrences; }
        }
    
        public CharEntropyInfo(int lastIndex, int occurrences)
        {
            _LastIndex = lastIndex;
            _Occurrences = occurrences;
        }
    }
    

A quick test:

        var inputs = new[]{
            "Hi there!",
            "Hi there, bob!",
            "ababababababababababababab",
            @"We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality.

I whipped up this simple method which counts unique characters in a string, but it is quite literally the first thing that popped into my head. It's the ""dumbest thing that works""."
        };

        foreach (string s in inputs)
        {
            System.Console.WriteLine("{1}: \"{0}\"", s, s.Entropy());
        }

Resulting entropy values:

  • 7: "Hi there!"
  • 10: "Hi there, bob!"
  • -4: "ababababababababababababab"
  • 25: "We're calculating entropy of a string ..."

6

I have seen many answers that suggest counting the number of distinct characters. But beware that this only works for 16-bit characters!

A character in C# is a UTF-16 code unit. Extended unicode characters are stored in multiple C# characters. CharUnicodeInfo.GetUnicodeCategory allows us to detect whether a C# character represents a real character or whether it is part of an extended unicode character or combined character (UnicodeCategory.Surrogate).

Test (fake) entropy:

 public static void Main()
 {
     var value = "\U00020B20";

     // yields 2, even though \U00020B20 represents a single unicode-character '𠬠'
     var entropyTest = value.Distinct().Count(); 
 }

In order to count the characters (not the C# characters), we need to augment our algorithm. I am using a class called Grapheme to do the trick. This class is able to detect extended unicode characters and diacritics.

Test entropy:

 public static void Main()
 { 
     var grapheme = Grapheme.Parse("\U00020B20");

     // yields 1, as \U00020B20 represents a single unicode-character '𠬠'.
     var entropyTest = grapheme.Select(x => x.Glyph).Distinct().Count();

     // yields 2, as \U00020B20 is stored in 2 C# characters.
     var codeUnits = grapheme.Single().CodeUnits.Length;
 }

Final note:

Testing entropy on a string is not context-free. Some characters or combined characters result in the same glyph, depending on the Font used. So entropy can only be calculated in the context of a font. The Grapheme class does not take this into account, since different fonts would render different entropies. The Grapheme class is said to be context-free.

  • (A) two different characters might have the exact same glyph (homoglyph)
  • (B) combined characters might have the same glyph as another character

Examples:

  • A: \u0061 and \u0430 represent both the letter 'a' in certain fonts
  • B: Å is both the character \u00C5 and the combined character A with an overdot character

Appendix: Grapheme

public class Grapheme
{
    private char[] _codeUnits;
    private Grapheme[] _diacritics;
    private string _glyph;

    public Grapheme(string glyph) {

        Guard.NotNull(glyph, "glyph");
        _glyph = StringInfo.GetNextTextElement(glyph);
        Guard.Condition(_glyph.Length != glyph.Length, "glyph", "Invalid glyph specified");

        var codeUnits = new List<char>();
        var diacritics = new List<Grapheme>();
        var buffer = _glyph;

        if (buffer.Length > 0) {
            var cu0 = CharUnicodeInfo.GetUnicodeCategory(buffer[0]);
            switch (cu0) {
                case UnicodeCategory.Surrogate:
                    codeUnits.AddRange(buffer.Take(2));
                    buffer = buffer.Substring(2);
                    break;
                default:
                    codeUnits.Add(buffer[0]);
                    buffer = buffer.Substring(1);
                    break;
            }
            diacritics.AddRange(Parse(buffer));
        }

        _codeUnits = codeUnits.ToArray();
        _diacritics = diacritics.ToArray();

        if (_codeUnits.Length == 2) {
            Guard.Condition(!char.IsSurrogatePair(new string(_codeUnits), 0),
                "glyph", "Invalid surrogate pair specified");
        }
    }

    public static Grapheme[] Parse(string value) {
        Guard.NotNull(value, "value");
        return StringInfo.ParseCombiningCharacters(value).Select(i 
                        => new Grapheme(StringInfo.GetNextTextElement(value, i))).ToArray();
    }

    public static int[] ParseIndices(string value) {
        Guard.NotNull(value, "value");
        return StringInfo.ParseCombiningCharacters(value).ToArray();
    }

    public static Grapheme ParseNext(string value, int index) {
        return new Grapheme(StringInfo.GetNextTextElement(value, index));
    }

    public static Grapheme ParseNext(string value) {
        return ParseNext(value, 0);
    }

    public char[] CodeUnits { 
        get { 
            return _codeUnits; 
        }
    }

    public Grapheme[] Diacritics {
        get { 
            return _diacritics; 
        }
    }

    public string Glyph {
        get { 
            return _glyph;
        }
    }

    public Grapheme[] Flatten() {
        return new[] { this }.Concat(_diacritics.SelectMany(x => x.Flatten())).ToArray();
    }

    public Grapheme Normalize() {
        return new Grapheme(_glyph.Normalize());
    }

    public Grapheme Normalize(NormalizationForm form) {
        return new Grapheme(_glyph.Normalize(form));
    }

    public override bool Equals(object obj) {
        if (obj is Grapheme) {
            return string.Equals(((Grapheme)obj)._glyph, _glyph);
        }
        return false;
    }

    public override int GetHashCode() {
        return _glyph.GetHashCode();
    }

    public override string ToString() {
        return _glyph;
    }
}