Hi
I have been trying to implement a "pos" function that only matches whole words. Below is my best effort. Is there a more efficient (or "better") way of implementing this? Thanks.
{ a "pos" function that matches whole words only } function PosInString (const sub, str : String) : Cardinal; Const nSeparators = ['a'..'z', 'A'..'Z', '0'..'9']; { non-separators } Var i, j : Cardinal; begin result := 0; If (sub = '') or (str = '') then Exit; i := pos (sub, str);
{ any match? } if (i = 0) then exit;
{ search for whole words only } result := i; j := length (sub) + i; if (i = 1) or ((i >1) and (Not (str [pred (i)] in nSeparators))) then begin if (j < length (str)) and (str [j] in nSeparators) then result := 0; end else result := 0; end;
{ test: correct results in comments } begin writeln (PosInString ('adam', 'he is adamant')); {0} writeln (PosInString ('adam', 'why is he so adamant about it?')); {0} writeln (PosInString ('adam', 'he knows about adam ant')); {16} writeln (PosInString ('adam', 'adam and eve')); {1} writeln (PosInString ('adam', 'eve and adam')); {9} writeln (PosInString ('adam', 'eve and adam ate the apple')); {9} end.
Best regards, The Chief -------- Prof. Abimbola A. Olowofoyeku (The African Chief) web: http://www.greatchief.plus.com/
On Sat, Apr 09, 2005 at 05:14:26PM +0100, Prof A Olowofoyeku (The African Chief) wrote:
Hi
I have been trying to implement a "pos" function that only matches whole words. Below is my best effort. Is there a more efficient (or "better") way of implementing this? Thanks.
Efficiency is not a problem, but correctness: PosInString ('adam','adamant adam') gives 0 instead of 9.
Emil Jerabek
{ a "pos" function that matches whole words only } function PosInString (const sub, str : String) : Cardinal; Const nSeparators = ['a'..'z', 'A'..'Z', '0'..'9']; { non-separators } Var i, j : Cardinal; begin result := 0; If (sub = '') or (str = '') then Exit; i := pos (sub, str);
{ any match? } if (i = 0) then exit;
{ search for whole words only } result := i; j := length (sub) + i; if (i = 1) or ((i >1) and (Not (str [pred (i)] in nSeparators))) then begin if (j < length (str)) and (str [j] in nSeparators) then result := 0; end else result := 0; end;
{ test: correct results in comments } begin writeln (PosInString ('adam', 'he is adamant')); {0} writeln (PosInString ('adam', 'why is he so adamant about it?')); {0} writeln (PosInString ('adam', 'he knows about adam ant')); {16} writeln (PosInString ('adam', 'adam and eve')); {1} writeln (PosInString ('adam', 'eve and adam')); {9} writeln (PosInString ('adam', 'eve and adam ate the apple')); {9} end.
Best regards, The Chief
Prof. Abimbola A. Olowofoyeku (The African Chief) web: http://www.greatchief.plus.com/
Prof A Olowofoyeku (The African Chief) wrote:
I have been trying to implement a "pos" function that only matches whole words. Below is my best effort. Is there a more efficient (or "better") way of implementing this? Thanks.
{ a "pos" function that matches whole words only } function PosInString (const sub, str : String) : Cardinal; Const nSeparators = ['a'..'z', 'A'..'Z', '0'..'9']; { non-separators }
BTW, what about the underscore?
Var i, j : Cardinal; begin result := 0;
Again, I'd strongly discourage using implicit `Result' unless absolutely necessary both syntactically and for Delphi compatibility. In this case, it's not necessary syntactically as it only appears on the LHS.
If (sub = '') or (str = '') then Exit;
BTW, `Pos' returns 1 if the substring is empty (meaning that an empty string matches anywhere). This may be a matter of definition -- if the first character of str is a letter it may seem wrong to find a word match there. Stricly speaking, one would have to find two subsequent separators (so that "nothing" is surrounded by separators), or a leading or trailing separator, but this case may not be important to you anyway.
i := pos (sub, str);
{ any match? } if (i = 0) then exit;
{ search for whole words only } result := i; j := length (sub) + i; if (i = 1) or ((i >1) and (Not (str [pred (i)] in nSeparators))) then begin if (j < length (str)) and (str [j] in nSeparators)
<=
(PosInString ('adam', 'adamx'))
then result := 0;
end else result := 0;
The two conditions can be merged into a single one. That's a matter of style.
However, if the first match is not a word, you might want to search again (PosInString ('adam', 'the adamant adam')). To do this, you'll need a loop and something like `PosFrom' (or `Pos' applied to a substring). In this case merging the conditions will help.
Frank
On Sat, 9 Apr 2005, Frank Heckenbach wrote:
Prof A Olowofoyeku (The African Chief) wrote:
I have been trying to implement a "pos" function that only matches whole words. Below is my best effort. Is there a more efficient (or "better") way of implementing this? Thanks.
[..] Rewrote the simple pascal editor I contributed a couple years ago. In the search routine it now has an "identifier" option that can find just " to " in a list containing " _to_ to too tow "
disclaimer: it may not be "better" but it works.
Will email a tar package later today to Frank (and anyone else that wants one) so he can replace the old version.
russ
Russell Whitaker wrote:
Will email a tar package later today to Frank (and anyone else that wants one) so he can replace the old version.
Don't hurry. You know the main server is currently down ...
Frank
Hi
First of all, many thanks to all those who responded, and spotted the bugs!
On 9 Apr 2005 at 18:51, Frank Heckenbach wrote:
Prof A Olowofoyeku (The African Chief) wrote:
I have been trying to implement a "pos" function that only matches whole words. Below is my best effort. Is there a more efficient (or "better") way of implementing this? Thanks.
{ a "pos" function that matches whole words only } function PosInString (const sub, str : String) : Cardinal; Const nSeparators = ['a'..'z', 'A'..'Z', '0'..'9']; { non-separators }
BTW, what about the underscore?
Added.
Var i, j : Cardinal; begin result := 0;
Again, I'd strongly discourage using implicit `Result' unless absolutely necessary both syntactically and for Delphi compatibility. In this case, it's not necessary syntactically as it only appears on the LHS.
Changed.
If (sub = '') or (str = '') then Exit;
BTW, `Pos' returns 1 if the substring is empty (meaning that an empty string matches anywhere).
But BP and Delphi (all versions) return 0. So there is an incompatibility here.
[...]
{ search for whole words only } result := i; j := length (sub) + i; if (i = 1) or ((i >1) and (Not (str [pred (i)] in nSeparators))) then begin if (j < length (str)) and (str [j] in nSeparators)
<=
(PosInString ('adam', 'adamx'))
Done.
Any comments on the revised version (below)? Thanks.
{ a "pos" function that matches whole words only } function PosInString (const sub : String; str : String) : Cardinal; const dummy = #0; nSeparators = ['a'..'z', 'A'..'Z', '_', dummy, '0'..'9'];
Var i, j : cardinal; notfound : boolean;
procedure findem; begin i := pos (sub, str) end;
begin PosInString := 0; If (sub = '') or (str = '') then Exit;
findem;
{ any match? check for whole words? } if (i = 0) then exit;
{ searching for whole words } while i <> 0 do begin notfound := false; j := length (sub) + i; if (i = 1) or ((i > 1) and (Not (str [i-1] in nSeparators))) then begin if (j <= length (str)) and (str [j] in nSeparators) then notfound := true else begin PosInString := i; exit; end; { if j < length (str) ...} end { if i = i, ... } else notfound := true;
if notfound then begin { dirty hack to discard processed part : any better ideas?? } for j := i to i + ( length (sub) -1 ) do str [j] := dummy; findem; { search again } end else break; end; { while }
end; { PosInString }
Best regards, The Chief -------- Prof. Abimbola A. Olowofoyeku (The African Chief) web: http://www.greatchief.plus.com/
Prof A Olowofoyeku (The African Chief) wrote:
If (sub = '') or (str = '') then Exit;
BTW, `Pos' returns 1 if the substring is empty (meaning that an empty string matches anywhere).
But BP and Delphi (all versions) return 0. So there is an incompatibility here.
I had thought they also returned 1. Where did I get that from?
Any comments on the revised version (below)? Thanks.
{ a "pos" function that matches whole words only } function PosInString (const sub : String; str : String) : Cardinal; const dummy = #0; nSeparators = ['a'..'z', 'A'..'Z', '_', dummy, '0'..'9'];
Var i, j : cardinal; notfound : boolean;
procedure findem; begin i := pos (sub, str) end;
begin PosInString := 0; If (sub = '') or (str = '') then Exit;
findem;
{ any match? check for whole words? } if (i = 0) then exit;
{ searching for whole words } while i <> 0 do begin notfound := false; j := length (sub) + i; if (i = 1) or ((i > 1) and (Not (str [i-1] in nSeparators))) then begin if (j <= length (str)) and (str [j] in nSeparators) then notfound := true else begin PosInString := i; exit; end; { if j < length (str) ...} end { if i = i, ... } else notfound := true;
if notfound
BTW, notfound will always be True here as you use Exit otherwise.
then begin { dirty hack to discard processed part : any better ideas?? } for j := i to i + ( length (sub) -1 ) do str [j] := dummy;
In fact this method is not reliable (PosInString ('aa-a', 'aa-aa-a')).
A better (and possibly faster) method is, as I said, to use `PosFrom' (non-standard) or `Pos' applied to a substring, both in order to search only after the position of the last match.
Set i := 0 before the loop and search from position i + 1 at the start of the loop (will also avoid the need for a subroutine).
findem; { search again } end else break;
end; { while }
end; { PosInString }
Frank
On 10 Apr 2005 at 2:06, Frank Heckenbach wrote:
Prof A Olowofoyeku (The African Chief) wrote:
If (sub = '') or (str = '') then Exit;
BTW, `Pos' returns 1 if the substring is empty (meaning that an empty string matches anywhere).
But BP and Delphi (all versions) return 0. So there is an incompatibility here.
I had thought they also returned 1. Where did I get that from?
Perhaps from some obsolete documentation?
[...]
{ dirty hack to discard processed part : any better ideas?? } for j := i to i + ( length (sub) -1 ) do str [j] := dummy;
In fact this method is not reliable (PosInString ('aa-a', 'aa-aa-a')).
A better (and possibly faster) method is, as I said, to use `PosFrom' (non-standard) or `Pos' applied to a substring, both in order to search only after the position of the last match.
[...]
That works well, thanks (and it was easy to implement a "posfrom" function for Delphi). This is (I think) the final incarnation of the function:
{ a "pos" function that matches whole words only } function PosInString (const sub : string; str : String) : Cardinal; const nSeparators = ['a'..'z', 'A'..'Z', '_', '0'..'9']; Var i, j : cardinal;
procedure findem (from : Cardinal); begin i := posfrom (sub, str, from) end;
begin PosInString := 0; If (sub = '') or (str = '') then Exit;
findem (1);
{ any match? check for whole words? } if (i = 0) then exit;
{ search for whole words } while i <> 0 do begin j := length (sub) + i; if (i = 1) or ((i > 1) and (Not (str [i-1] in nSeparators))) then begin if (j <= length (str)) and (str [j] in nSeparators) then { not found } else begin PosInString := i; exit; end; { if j <= length (str) ...} end; { if (i = 1), ... } findem (i + 1); { search again } end; { while } end;
Best regards, The Chief -------- Prof. Abimbola A. Olowofoyeku (The African Chief) web: http://www.greatchief.plus.com/
On Sat, Apr 09, 2005 at 05:14:26PM +0100, Prof A Olowofoyeku (The African Chief) wrote:
Hi
I have been trying to implement a "pos" function that only matches whole words. Below is my best effort. Is there a more efficient (or "better") way of implementing this? Thanks.
The efficiency of your solution (after correction) depends on the efficiency of pos.
Better efficiency might be obtained, depending on the characteristics of the actual parameters. Searching for short identifiers in relatively short strings versus searching for `tricky'/longer identifiers in trickier/longer texts (with many partial matches), but this requires that you rewrite the implementation of pos.
A typical implementation of pos traverses the text from left to right (low to high index values), and matches it against the substring, also from left to right. When the match fails, the next character in the text is taken as starting position of the next matching attempt. If you match the string from right to left against the text, you can often make a bigger jump in the text for the next candidate position.
If you are really interested in this art, you can consult e.g. one of (but there are other sources that may be more accessible):
A New Taxonomy of Sublinear Keyword Pattern Matching Algorithms, by L.G.W.A. Cleophas, B.W. Watson and G. Zwaan, Computing Science Report 2004/03, Faculty of Mathematics and Computing Science, Eindhoven University of Technology, Eindhoven, The Netherlands, March 2004.
http://alexandria.tue.nl/extra1/wskrap/publichtml/200407.pdf -- A Taxonomy of Keyword Pattern Matching Algorithms, by B.W. Watson and G. Zwaan, Computing Science Report 92/27, Faculty of Mathematics and Computing Science, Eindhoven University of Technology, Eindhoven, The Netherlands, 1992, ISSN 0926-4515.
http://alexandria.tue.nl/extra1/wskrap/publichtml/9310412.pdf --
{ test: correct results in comments } begin writeln (PosInString ('adam', 'he is adamant')); {0} writeln (PosInString ('adam', 'why is he so adamant about it?')); {0} writeln (PosInString ('adam', 'he knows about adam ant')); {16} writeln (PosInString ('adam', 'adam and eve')); {1} writeln (PosInString ('adam', 'eve and adam')); {9} writeln (PosInString ('adam', 'eve and adam ate the apple')); {9} end.
I am missing test cases like
writeln (PosInString ('adam', 'adam')); {1} writeln (PosInString ('adam', 'adamant')); {0} writeln (PosInString ('adam', 'madam')); {0} writeln (PosInString ('adam', 'madam i''m adam')); {11}
Tom
Tom Verhoeff wrote:
Prof A Olowofoyeku (The African Chief) wrote:
I have been trying to implement a "pos" function that only matches whole words. Below is my best effort. Is there a more efficient (or "better") way of implementing this? Thanks.
The efficiency of your solution (after correction) depends on the efficiency of pos.
Better efficiency might be obtained, depending on the characteristics of the actual parameters. Searching for short identifiers in relatively short strings versus searching for `tricky'/longer identifiers in trickier/longer texts (with many partial matches), but this requires that you rewrite the implementation of pos.
A typical implementation of pos traverses the text from left to right (low to high index values), and matches it against the substring, also from left to right. When the match fails, the next character in the text is taken as starting position of the next matching attempt. If you match the string from right to left against the text, you can often make a bigger jump in the text for the next candidate position.
You are basically talking about Boyer-Moore and Knuth-Morris-Pratt searching algorithms, of which there is a good discussion in Sedgewick. In practice, when there is a good chance of match failure on the first char. comparison, little is to be gained over the elementary techniques. The other methods are likely to gain on such examples as "duckbill" in the presence of many "ducktails" or "they" amidst many "the..." prefixes.
KMP is especially easy to install in place of the elementary technique, and has the primary advantage of never requiring backtracking over the input. This makes the operation become, at worst, O(M+N) where M and N are the sizes of the needle and the haystack. The elementary can be as bad as O(M*N), but normally will be about O(M+N) anyhow. BM has the same O(M+N) worst case, but can (low probability) significantly exceed it. However, it requires scanning in the reverse direction. It is especially attractive when looking for a rare suffix in the presence of a common prefix.
A New Taxonomy of Sublinear Keyword Pattern Matching Algorithms, by L.G.W.A. Cleophas, B.W. Watson and G. Zwaan, Computing Science Report 2004/03, Faculty of Mathematics and Computing Science, Eindhoven University of Technology, Eindhoven, The Netherlands, March 2004.
http://alexandria.tue.nl/extra1/wskrap/publichtml/200407.pdf
A Taxonomy of Keyword Pattern Matching Algorithms, by B.W. Watson and G. Zwaan, Computing Science Report 92/27, Faculty of Mathematics and Computing Science, Eindhoven University of Technology, Eindhoven, The Netherlands, 1992, ISSN 0926-4515.
http://alexandria.tue.nl/extra1/wskrap/publichtml/9310412.pdf
I am not familiar with these references, but they sound excellent, and I shall have to collect them later. I hope they are in English. :-)
You are basically talking about Boyer-Moore and Knuth-Morris-Pratt searching algorithms, of which there is a good discussion in Sedgewick. In practice, when there is a good chance of match failure on the first char. comparison, little is to be gained over the elementary techniques. The other methods are likely to gain on such examples as "duckbill" in the presence of many "ducktails" or "they" amidst many "the..." prefixes.
That's not really true. Boyer-Moore is typically O(M/N) where N is the search string length and M is the text length - that is the longer the search string, the quicker the match. It's also relatively easy to implement, since all you have to do is search checking the last character, and when it fails slide forward a precomputed amount based on the character in the text and the search string (ie, a table of characters is needed). Of coruse, in non-ASCII days, this is more challenging, but for UTF-8 you could probably treate all high bit set characters as equal and still get good results on mostly roman text.
There is some code I use in a shipping product below which you are welcome to adapt. It is quite old code, so the style is not something I'm proud of these days, but that's hardly unuusal. s is a string, already uppercased. It's reading through a file on disk so it is more complex than it needs to be if the content is in memory (and given how simple it is it should give you some idea that Boyer-Moore really is worthwhile for any case where a decent amount of text is expected).
That said, why not just add PCRE to your project. Its BSD licensed so should be useable with any project, and is relatively painless. Then you can do your search with the regexp /b\Qsearchstrin\E\b and it will get all the edge cases right for you, and handle unicode, and all you to offer regex searching too.
Enjoy, Peter.
function BoyerMooreFind: OSErr; { Boyer-Moore basic search } { From Algorithms by Robert Sedgewick, Second Edition, pg 287-288 } var skip: array[char] of integer; i, j: integer; pos, count: longInt; oe: OSErr; ch: char; begin for i := 0 to 255 do skip[chr(i)] := len; for i := 1 to len do skip[s[i]] := len - i; oe := GetFPos(ob.refnum, pos); for i := 1 to len do ob.buffer[i] := chr(0); i := len + 1; j := len; while (oe = noErr) and (j >= 1) and (pos < ob.filelen) do begin BlockMove(@ob.buffer[i - len], @ob.buffer, len - 1); i := len; count := SizeOf(OBBufferType) - len; if count + pos > ob.filelen then count := ob.filelen - pos; oe := FSRead(ob.refnum, count, @ob.buffer[len]); if (oe = noErr) & (count = 0) then oe := errGeneric; pos := pos + count; if oe = noErr then while (i <= count) and (j >= 1) do begin ch := UpperCaseChar(ob.buffer[i]); if ch = s[j] then begin i := i - 1; j := j - 1; end else begin if len - j + 1 > skip[ch] then begin i := i + len - j + 1; end else begin i := i + skip[ch]; end; j := len; end; end; end; if j >= 1 then begin if oe = noErr then oe := errGeneric; end else begin oe := SetFPos(ob.refnum, fsFromStart, pos - count + i); end; BoyerMooreFind := oe; end;
Peter N Lewis wrote:
That's not really true. Boyer-Moore is typically O(M/N) where N is the search string length and M is the text length - that is the longer the search string, the quicker the match. It's also relatively easy to implement, since all you have to do is search checking the last character, and when it fails slide forward a precomputed amount based on the character in the text and the search string (ie, a table of characters is needed). Of coruse, in non-ASCII days, this is more challenging, but for UTF-8 you could probably treate all high bit set characters as equal and still get good results on mostly roman text.
You could also do the matching byte-wise (not character-wise) AFAICS (at least if you want exact matches, not case-insensitive matches or so). -- Well, if you have to take into account different representations of the same characters (accents or so), which can even be of different lengths, then it gets hairy ...
There is some code I use in a shipping product below which you are welcome to adapt. It is quite old code, so the style is not something I'm proud of these days, but that's hardly unuusal. s is a string, already uppercased. It's reading through a file on disk so it is more complex than it needs to be if the content is in memory (and given how simple it is it should give you some idea that Boyer-Moore really is worthwhile for any case where a decent amount of text is expected).
The built-in `Pos' currently does the naive method. I'd thought about replacing it with a more sophisticated algorithm, but so far it hasn't been important enough to me.
But if anyone wants to, feel free to send me a subsitute (function PosFrom in p/rts/string1.pas). Of course, since `Pos' can be used for all kinds of purposes, it must not be less efficient in simple cases (at least not significantly so).
Frank
"Prof A Olowofoyeku (The African Chief)" wrote:
I have been trying to implement a "pos" function that only matches whole words. Below is my best effort. Is there a more efficient (or "better") way of implementing this? Thanks.
On looking over this thread I wonder if you are using the best technique for the job, since you don't actually specify the job. It sounds as if you are trying to parse some selection of identifiers or reserved words from incoming text. If so, I consider you are attacking the wrong end of the problem.
The first thing to do would be to isolate the identifier in the incoming text. This is very easily done in Pascal, because of the presence of one char. lookahead via f^. Once the identifier is isolated, it can be checked against one or more lists of words very rapidly.
In fact, this is an ideal spot for a perfect hash algorithm, which can establish non-membership in a single binary comparison, and confirm both membership and identity with a single string comparison. However any such perfect hash is a secondary consideration.
On 10 Apr 2005 at 10:27, CBFalconer wrote:
"Prof A Olowofoyeku (The African Chief)" wrote:
I have been trying to implement a "pos" function that only matches whole words. Below is my best effort. Is there a more efficient (or "better") way of implementing this? Thanks.
On looking over this thread I wonder if you are using the best technique for the job, since you don't actually specify the job.
[...]
The specific, limited purpose for which I am developing this is for a routine that replaces all occurrences of a substring within a string with another string, viz:
"Procedure ReplaceString (const old_str, new_str : string; Var dest : string)"
I have a routine like this for a while (using "pos" to find occurrences of "old_str" in "dest"), but I am now trying to extend it to replace optionally only "whole words", hence the need for this "PosInString" function.
I think that a hash algorithm is probably too complex for what I am trying to do ...
Best regards, The Chief -------- Prof. Abimbola A. Olowofoyeku (The African Chief) web: http://www.greatchief.plus.com/