|
@@ -103,7 +103,7 @@ The Levenshtein distance between two words is the minimum number of single-chara
|
|
|
It is a metric string distance. This implementation uses dynamic programming (Wagner–Fischer algorithm), with only 2 rows of data. The space requirement is thus O(m) and the algorithm runs in O(m.n).
|
|
|
|
|
|
```python
|
|
|
-from similarity.levenshtein import Levenshtein
|
|
|
+from strsim.levenshtein import Levenshtein
|
|
|
|
|
|
levenshtein = Levenshtein()
|
|
|
print(levenshtein.distance('My string', 'My $string'))
|
|
@@ -119,7 +119,7 @@ This distance is computed as levenshtein distance divided by the length of the l
|
|
|
The similarity is computed as 1 - normalized distance.
|
|
|
|
|
|
```python
|
|
|
-from similarity.normalized_levenshtein import NormalizedLevenshtein
|
|
|
+from strsim.normalized_levenshtein import NormalizedLevenshtein
|
|
|
|
|
|
normalized_levenshtein = NormalizedLevenshtein()
|
|
|
print(normalized_levenshtein.distance('My string', 'My $string'))
|
|
@@ -140,8 +140,8 @@ This algorithm is usually used for optical character recognition (OCR) applicati
|
|
|
It can also be used for keyboard typing auto-correction. Here the cost of substituting E and R is lower for example because these are located next to each other on an AZERTY or QWERTY keyboard. Hence the probability that the user mistyped the characters is higher.
|
|
|
|
|
|
```python
|
|
|
-from similarity.weighted_levenshtein import WeightedLevenshtein
|
|
|
-from similarity.weighted_levenshtein import CharacterSubstitutionInterface
|
|
|
+from strsim.weighted_levenshtein import WeightedLevenshtein
|
|
|
+from strsim.weighted_levenshtein import CharacterSubstitutionInterface
|
|
|
|
|
|
class CharacterSubstitution(CharacterSubstitutionInterface):
|
|
|
def cost(self, c0, c1):
|
|
@@ -162,7 +162,7 @@ It does respect triangle inequality, and is thus a metric distance.
|
|
|
This is not to be confused with the optimal string alignment distance, which is an extension where no substring can be edited more than once.
|
|
|
|
|
|
```python
|
|
|
-from similarity.damerau import Damerau
|
|
|
+from strsim.damerau import Damerau
|
|
|
|
|
|
damerau = Damerau()
|
|
|
print(damerau.distance('ABCDEF', 'ABDCEF'))
|
|
@@ -192,7 +192,7 @@ The difference from the algorithm for Levenshtein distance is the addition of on
|
|
|
Note that for the optimal string alignment distance, the triangle inequality does not hold and so it is not a true metric.
|
|
|
|
|
|
```python
|
|
|
-from similarity.optimal_string_alignment import OptimalStringAlignment
|
|
|
+from strsim.optimal_string_alignment import OptimalStringAlignment
|
|
|
|
|
|
optimal_string_alignment = OptimalStringAlignment()
|
|
|
print(optimal_string_alignment.distance('CA', 'ABC'))
|
|
@@ -214,7 +214,7 @@ It is (roughly) a variation of Damerau-Levenshtein, where the substitution of 2
|
|
|
The distance is computed as 1 - Jaro-Winkler similarity.
|
|
|
|
|
|
```python
|
|
|
-from similarity.jarowinkler import JaroWinkler
|
|
|
+from strsim.jarowinkler import JaroWinkler
|
|
|
|
|
|
jarowinkler = JaroWinkler()
|
|
|
print(jarowinkler.similarity('My string', 'My tsring'))
|
|
@@ -246,7 +246,7 @@ This class implements the dynamic programming approach, which has a space requir
|
|
|
In "Length of Maximal Common Subsequences", K.S. Larsen proposed an algorithm that computes the length of LCS in time O(log(m).log(n)). But the algorithm has a memory requirement O(m.n²) and was thus not implemented here.
|
|
|
|
|
|
```python
|
|
|
-from similarity.longest_common_subsequence import LongestCommonSubsequence
|
|
|
+from strsim.longest_common_subsequence import LongestCommonSubsequence
|
|
|
|
|
|
lcs = LongestCommonSubsequence()
|
|
|
# Will produce 4.0
|
|
@@ -263,7 +263,7 @@ http://heim.ifi.uio.no/~danielry/StringMetric.pdf
|
|
|
The distance is computed as 1 - |LCS(s1, s2)| / max(|s1|, |s2|)
|
|
|
|
|
|
```python
|
|
|
-from similarity.metric_lcs import MetricLCS
|
|
|
+from strsim.metric_lcs import MetricLCS
|
|
|
|
|
|
metric_lcs = MetricLCS()
|
|
|
s1 = 'ABCDEFG'
|
|
@@ -300,7 +300,7 @@ The algorithm uses affixing with special character '\n' to increase the weight o
|
|
|
In the paper, Kondrak also defines a similarity measure, which is not implemented (yet).
|
|
|
|
|
|
```python
|
|
|
-from similarity.ngram import NGram
|
|
|
+from strsim.ngram import NGram
|
|
|
|
|
|
twogram = NGram(2)
|
|
|
print(twogram.distance('ABCD', 'ABTUIO'))
|
|
@@ -320,7 +320,7 @@ The cost for computing these similarities and distances is mainly domnitated by
|
|
|
Directly compute the distance between strings:
|
|
|
|
|
|
```python
|
|
|
-from similarity.qgram import QGram
|
|
|
+from strsim.qgram import QGram
|
|
|
|
|
|
qgram = QGram(2)
|
|
|
print(qgram.distance('ABCD', 'ABCE'))
|
|
@@ -330,7 +330,7 @@ print(qgram.distance('ABCD', 'ABCE'))
|
|
|
Or, for large datasets, pre-compute the profile of all strings. The similarity can then be computed between profiles:
|
|
|
|
|
|
```python
|
|
|
-from similarity.cosine import Cosine
|
|
|
+from strsim.cosine import Cosine
|
|
|
|
|
|
cosine = Cosine(2)
|
|
|
s0 = 'My first string'
|