6 лет назад · 3cea9e2885
--- a/README.md
+++ b/README.md
@@ -4,27 +4,31 @@
 
				 
			
 
				 A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Check the summary table below for the complete list...
			
 
				 
			
 
				-* [Download](#download)
			
 
				-* [Overview](#overview)
			
 
				-* [Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance)
			
 
				-* [Shingles (n-gram) based similarity and distance](#shingles-n-gram-based-similarity-and-distance)
			
 
				-* [Levenshtein](#levenshtein)
			
 
				-* [Normalized Levenshtein](#normalized-levenshtein)
			
 
				-* [Weighted Levenshtein](#weighted-levenshtein)
			
 
				-* [Damerau-Levenshtein](#damerau-levenshtein)
			
 
				-* [Optimal String Alignment](#optimal-string-alignment)
			
 
				-* [Jaro-Winkler](#jaro-winkler)
			
 
				-* [Longest Common Subsequence](#longest-common-subsequence)
			
 
				-* [Metric Longest Common Subsequence](#metric-longest-common-subsequence)
			
 
				-* [N-Gram](#n-gram)
			
 
				-* [Shingle (n-gram) based algorithms](#shingle-n-gram-based-algorithms)
			
 
				-  * [Q-Gram](#shingle-n-gram-based-algorithms)
			
 
				-  * [Cosine similarity](#shingle-n-gram-based-algorithms)
			
 
				-  * [Jaccard index](#shingle-n-gram-based-algorithms)
			
 
				-  * [Sorensen-Dice coefficient](#shingle-n-gram-based-algorithms)
			
 
				-* [Experimental](#experimental)
			
 
				-  * [SIFT4](#sift4)
			
 
				-* [Users](#users)
			
 
				+- [python-string-similarity](#python-string-similarity)
			
 
				+  - [Download](#download)
			
 
				+  - [Overview](#overview)
			
 
				+  - [Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance)
			
 
				+    - [(Normalized) similarity and distance](#normalized-similarity-and-distance)
			
 
				+    - [Metric distances](#metric-distances)
			
 
				+  - [Shingles (n-gram) based similarity and distance](#shingles-n-gram-based-similarity-and-distance)
			
 
				+  - [Levenshtein](#levenshtein)
			
 
				+  - [Normalized Levenshtein](#normalized-levenshtein)
			
 
				+  - [Weighted Levenshtein](#weighted-levenshtein)
			
 
				+  - [Damerau-Levenshtein](#damerau-levenshtein)
			
 
				+  - [Optimal String Alignment](#optimal-string-alignment)
			
 
				+  - [Jaro-Winkler](#jaro-winkler)
			
 
				+  - [Longest Common Subsequence](#longest-common-subsequence)
			
 
				+  - [Metric Longest Common Subsequence](#metric-longest-common-subsequence)
			
 
				+  - [N-Gram](#n-gram)
			
 
				+  - [Shingle (n-gram) based algorithms](#shingle-n-gram-based-algorithms)
			
 
				+    - [Q-Gram](#q-gram)
			
 
				+    - [Cosine similarity](#cosine-similarity)
			
 
				+    - [Jaccard index](#jaccard-index)
			
 
				+    - [Sorensen-Dice coefficient](#sorensen-dice-coefficient)
			
 
				+    - [Overlap coefficient (i.e., Szymkiewicz-Simpson)](#overlap-coefficient-ie-szymkiewicz-simpson)
			
 
				+  - [Experimental](#experimental)
			
 
				+    - [SIFT4](#sift4)
			
 
				+  - [Users](#users)
			
 
				 
			
 
				 
			
 
				 ## Download
			
@@ -55,6 +59,7 @@ The main characteristics of each implemented algorithm are presented below. The
 
				 | [Cosine similarity](#cosine-similarity) 				|similarity<br>distance | Yes  			| No  		| Profile | O(m+n) |  |
			
 
				 | [Jaccard index](#jaccard-index)				|similarity<br>distance | Yes  			| Yes  		| Set	  | O(m+n) |  |
			
 
				 | [Sorensen-Dice coefficient](#sorensen-dice-coefficient) 	|similarity<br>distance | Yes 			| No 		| Set	  | O(m+n) |  |
			
 
				+| [Overlap coefficient](#overlap-coefficient-ie-szymkiewicz-simpson) 	|similarity<br>distance | Yes 			| No 		| Set	  | O(m+n) |  |
			
 
				 
			
 
				 [1] In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the **dynamic programming** method, which has a cost O(m.n). For Levenshtein distance, the algorithm is sometimes called **Wagner-Fischer algorithm** ("The string-to-string correction problem", 1974). The original algorithm uses a matrix of size m x n to store the Levenshtein distance between string prefixes.
			
 
				 
			
@@ -360,6 +365,11 @@ Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 in
 
				 
			
 
				 Distance is computed as 1 - similarity.
			
 
				 
			
 
				+### Overlap coefficient (i.e., Szymkiewicz-Simpson)
			
 
				+Very similar to Jaccard and Sorensen-Dice measures, but this time the similarity is computed as |V1 inter V2| / Min(|V1|,|V2|). Tends to yield higher similarity scores compared to the other overlapping coefficients. Always returns the highest similarity score (1) if one given string is the subset of the other. 
			
 
				+
			
 
				+Distance is computed as 1 - similarity.
			
 
				+
			
 
				 ## Experimental
			
 
				 
			
 
				 ### SIFT4
			
--- a/strsimpy/overlap_coefficient.py
+++ b/strsimpy/overlap_coefficient.py
@@ -0,0 +1,28 @@
 
				+from .shingle_based import ShingleBased
			
 
				+from .string_distance import NormalizedStringDistance
			
 
				+from .string_similarity import NormalizedStringSimilarity
			
 
				+
			
 
				+
			
 
				+class OverlapCoefficient(ShingleBased, NormalizedStringDistance, NormalizedStringSimilarity):
			
 
				+
			
 
				+    def __init__(self, k=3):
			
 
				+        super().__init__(k)
			
 
				+
			
 
				+    def distance(self, s0, s1):
			
 
				+        return 1.0 - self.similarity(s0, s1)
			
 
				+
			
 
				+    def similarity(self, s0, s1):
			
 
				+        if s0 is None:
			
 
				+            raise TypeError("Argument s0 is NoneType.")
			
 
				+        if s1 is None:
			
 
				+            raise TypeError("Argument s1 is NoneType.")
			
 
				+        if s0 == s1:
			
 
				+            return 1.0
			
 
				+        union = set()
			
 
				+        profile0, profile1 = self.get_profile(s0), self.get_profile(s1)
			
 
				+        for k in profile0.keys():
			
 
				+            union.add(k)
			
 
				+        for k in profile1.keys():
			
 
				+            union.add(k)
			
 
				+        inter = int(len(profile0.keys()) + len(profile1.keys()) - len(union))
			
 
				+        return inter / min(len(profile0),len(profile1))
			
--- a/strsimpy/overlap_coefficient_test.py
+++ b/strsimpy/overlap_coefficient_test.py
@@ -0,0 +1,35 @@
 
				+import unittest
			
 
				+
			
 
				+from strsimpy.overlap_coefficient import OverlapCoefficient
			
 
				+
			
 
				+class TestOverlapCoefficient(unittest.TestCase):
			
 
				+
			
 
				+    def test_overlap_coefficient_onestringissubsetofother_return0(self):
			
 
				+        sim = OverlapCoefficient(3)
			
 
				+        s1,s2 = "eat","eating"
			
 
				+        actual = sim.distance(s1,s2)
			
 
				+        print("distance: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
			
 
				+        self.assertEqual(0,actual)
			
 
				+
			
 
				+    def test_overlap_coefficient_onestringissubset_return1(self):
			
 
				+        sim = OverlapCoefficient(3)
			
 
				+        s1,s2 = "eat","eating"
			
 
				+        actual = sim.similarity(s1,s2)
			
 
				+        print("strsim: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
			
 
				+        self.assertEqual(1,actual)
			
 
				+
			
 
				+    def test_overlap_coefficient_onestringissubsetofother_return1(self):
			
 
				+        sim = OverlapCoefficient(3)
			
 
				+        s1,s2 = "eat","eating"
			
 
				+        actual = sim.similarity(s1,s2)
			
 
				+        print("strsim: {:.4}\t between '{}' and '{}'".format(str(actual), s1,s2))
			
 
				+        self.assertEqual(1,actual)
			
 
				+
			
 
				+    def test_overlap_coefficient_halfsimilar_return1(self):
			
 
				+        sim = OverlapCoefficient(2)
			
 
				+        s1,s2 = "car","bar"
			
 
				+        self.assertEqual(1/2,sim.similarity(s1,s2))
			
 
				+        self.assertEqual(1/2,sim.distance(s1,s2))
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    unittest.main()