Bläddra i källkod

'Overlap Coefficient' added which is very similar to Jaccard and Sorensen-Dice measures.

Gökhan Ercan 5 år sedan
förälder
incheckning
3cea9e2885
3 ändrade filer med 94 tillägg och 21 borttagningar
  1. 31 21
      README.md
  2. 28 0
      strsimpy/overlap_coefficient.py
  3. 35 0
      strsimpy/overlap_coefficient_test.py

+ 31 - 21
README.md

@@ -4,27 +4,31 @@
 
 A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Check the summary table below for the complete list...
 
-* [Download](#download)
-* [Overview](#overview)
-* [Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance)
-* [Shingles (n-gram) based similarity and distance](#shingles-n-gram-based-similarity-and-distance)
-* [Levenshtein](#levenshtein)
-* [Normalized Levenshtein](#normalized-levenshtein)
-* [Weighted Levenshtein](#weighted-levenshtein)
-* [Damerau-Levenshtein](#damerau-levenshtein)
-* [Optimal String Alignment](#optimal-string-alignment)
-* [Jaro-Winkler](#jaro-winkler)
-* [Longest Common Subsequence](#longest-common-subsequence)
-* [Metric Longest Common Subsequence](#metric-longest-common-subsequence)
-* [N-Gram](#n-gram)
-* [Shingle (n-gram) based algorithms](#shingle-n-gram-based-algorithms)
-  * [Q-Gram](#shingle-n-gram-based-algorithms)
-  * [Cosine similarity](#shingle-n-gram-based-algorithms)
-  * [Jaccard index](#shingle-n-gram-based-algorithms)
-  * [Sorensen-Dice coefficient](#shingle-n-gram-based-algorithms)
-* [Experimental](#experimental)
-  * [SIFT4](#sift4)
-* [Users](#users)
+- [python-string-similarity](#python-string-similarity)
+  - [Download](#download)
+  - [Overview](#overview)
+  - [Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance)
+    - [(Normalized) similarity and distance](#normalized-similarity-and-distance)
+    - [Metric distances](#metric-distances)
+  - [Shingles (n-gram) based similarity and distance](#shingles-n-gram-based-similarity-and-distance)
+  - [Levenshtein](#levenshtein)
+  - [Normalized Levenshtein](#normalized-levenshtein)
+  - [Weighted Levenshtein](#weighted-levenshtein)
+  - [Damerau-Levenshtein](#damerau-levenshtein)
+  - [Optimal String Alignment](#optimal-string-alignment)
+  - [Jaro-Winkler](#jaro-winkler)
+  - [Longest Common Subsequence](#longest-common-subsequence)
+  - [Metric Longest Common Subsequence](#metric-longest-common-subsequence)
+  - [N-Gram](#n-gram)
+  - [Shingle (n-gram) based algorithms](#shingle-n-gram-based-algorithms)
+    - [Q-Gram](#q-gram)
+    - [Cosine similarity](#cosine-similarity)
+    - [Jaccard index](#jaccard-index)
+    - [Sorensen-Dice coefficient](#sorensen-dice-coefficient)
+    - [Overlap coefficient (i.e., Szymkiewicz-Simpson)](#overlap-coefficient-ie-szymkiewicz-simpson)
+  - [Experimental](#experimental)
+    - [SIFT4](#sift4)
+  - [Users](#users)
 
 
 ## Download
@@ -55,6 +59,7 @@ The main characteristics of each implemented algorithm are presented below. The
 | [Cosine similarity](#cosine-similarity) 				|similarity<br>distance | Yes  			| No  		| Profile | O(m+n) |  |
 | [Jaccard index](#jaccard-index)				|similarity<br>distance | Yes  			| Yes  		| Set	  | O(m+n) |  |
 | [Sorensen-Dice coefficient](#sorensen-dice-coefficient) 	|similarity<br>distance | Yes 			| No 		| Set	  | O(m+n) |  |
+| [Overlap coefficient](#overlap-coefficient-ie-szymkiewicz-simpson) 	|similarity<br>distance | Yes 			| No 		| Set	  | O(m+n) |  |
 
 [1] In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the **dynamic programming** method, which has a cost O(m.n). For Levenshtein distance, the algorithm is sometimes called **Wagner-Fischer algorithm** ("The string-to-string correction problem", 1974). The original algorithm uses a matrix of size m x n to store the Levenshtein distance between string prefixes.
 
@@ -360,6 +365,11 @@ Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 in
 
 Distance is computed as 1 - similarity.
 
+### Overlap coefficient (i.e., Szymkiewicz-Simpson)
+Very similar to Jaccard and Sorensen-Dice measures, but this time the similarity is computed as |V1 inter V2| / Min(|V1|,|V2|). Tends to yield higher similarity scores compared to the other overlapping coefficients. Always returns the highest similarity score (1) if one given string is the subset of the other. 
+
+Distance is computed as 1 - similarity.
+
 ## Experimental
 
 ### SIFT4

+ 28 - 0
strsimpy/overlap_coefficient.py

@@ -0,0 +1,28 @@
+from .shingle_based import ShingleBased
+from .string_distance import NormalizedStringDistance
+from .string_similarity import NormalizedStringSimilarity
+
+
+class OverlapCoefficient(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):
+
+    def __init__(self, k=3):
+        super().__init__(k)
+
+    def distance(self, s0, s1):
+        return 1.0 - self.similarity(s0, s1)
+
+    def similarity(self, s0, s1):
+        if s0 is None:
+            raise TypeError("Argument s0 is NoneType.")
+        if s1 is None:
+            raise TypeError("Argument s1 is NoneType.")
+        if s0 == s1:
+            return 1.0
+        union = set()
+        profile0, profile1 = self.get_profile(s0), self.get_profile(s1)
+        for k in profile0.keys():
+            union.add(k)
+        for k in profile1.keys():
+            union.add(k)
+        inter = int(len(profile0.keys()) + len(profile1.keys()) - len(union))
+        return inter / min(len(profile0),len(profile1))

+ 35 - 0
strsimpy/overlap_coefficient_test.py

@@ -0,0 +1,35 @@
+import unittest
+
+from strsimpy.overlap_coefficient import OverlapCoefficient
+
+class TestOverlapCoefficient(unittest.TestCase):
+
+    def test_overlap_coefficient_onestringissubsetofother_return0(self):
+        sim = OverlapCoefficient(3)
+        s1,s2 = "eat","eating"
+        actual = sim.distance(s1,s2)
+        print("distance: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
+        self.assertEqual(0,actual)
+
+    def test_overlap_coefficient_onestringissubset_return1(self):
+        sim = OverlapCoefficient(3)
+        s1,s2 = "eat","eating"
+        actual = sim.similarity(s1,s2)
+        print("strsim: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
+        self.assertEqual(1,actual)
+
+    def test_overlap_coefficient_onestringissubsetofother_return1(self):
+        sim = OverlapCoefficient(3)
+        s1,s2 = "eat","eating"
+        actual = sim.similarity(s1,s2)
+        print("strsim: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
+        self.assertEqual(1,actual)
+
+    def test_overlap_coefficient_halfsimilar_return1(self):
+        sim = OverlapCoefficient(2)
+        s1,s2 = "car","bar"
+        self.assertEqual(1/2,sim.similarity(s1,s2))
+        self.assertEqual(1/2,sim.distance(s1,s2))
+
+if __name__ == "__main__":
+    unittest.main()