Wednesday, March 28, 2012

Question about referencing Full-Text noise file through SQL rowset functions.

(SQL Server 2000, SP3a)
Hello all!
I'm trying to cobble together something that will let me reference the "noise.eng" (the
English list of the Full Text Search service ignored words) through Transact-SQL.
I'm trying to reference the "noise.eng" file that's located in:
C:\Program Files\Common Files\Microsoft Shared\MSSearch\Data\Config
There are a couple others in:
C:\Program Files\Microsoft SQL Server\MSSQL\FTDATA\SQLServer\Config
C:\windows\SYSTEM32
but the one in the "Common Files" had the most recent date. :-)
I *have* had some measure of success with it -- but I'm trying to come up with a "cleaner"
way, if possible.
This is how I got this to work:
First I tried setting up a linked server:
--Create a linked server
execute [dbo].[sp_addlinkedserver]
'TextTest',
'Jet 4.0',
'Microsoft.Jet.OLEDB.4.0',
'C:\Program Files\Common Files\Microsoft Shared\MSSearch\Data\Config',
NULL,
'Text'
go
--Set up login mappings
execute [dbo].[sp_addlinkedsrvlogin]
'TextTest',
FALSE,
NULL,
NULL
go
Then I checked out the tables that were available:
--List the tables in the linked server
execute [dbo].[sp_tables_ex] 'TextTest'
go
Interestingly, all the files in that directory with a .TXT extension were visible as
tables! However, the "noise.eng" file didn't seem to be available.
If I copied the "noise.eng" file to "noise.txt", then I could do something like:
select * from TextTest...[noise#txt]
or
select * from openquery(TextTest, 'select * from [noise#txt]') as a
And that seems to return the information! :-)
You can clean up the linked servers with:
execute [dbo].[sp_droplinkedsrvlogin]
'TextTest', NULL
go
execute [dbo].[sp_dropserver]
'TextTest'
go
However, I'd really like to use the OPENROWSET function if at all possible, and *not* have
to rename/copy the noise.eng file. I was reading that I could use a "schema.ini" file,
and after browsing the web, discovered this format that I thought might work:
[noise.eng]
ColNameHeader = False
CharacterSet = ANSI
Format = CSVDelimited
Col1=NoiseWord Char Width 100
Then I tried doing something like:
select *
from openrowset
(
'Microsoft.Jet.OLEDB.4.0',
'Provider=Microsoft.Jet.Oledb.4.0;Data Source=C:\Program Files\Common Files\Microsoft
Shared\MSSearch\Data\Config;Extended Properties="text;HDR=No;FMT=CSV";',
'select * from noise.eng'
) as a
Which doesn't quite work -- nor have any of the OPENROWSET variants I've tried. :-(
I'd like to be able to use OPENROWSET so that I don't have to create a linked server. I'd
also like to avoid copying the "noise.eng" to "noise.txt". I don't mind having to drop in
a "schema.ini" (if that's even necessary/helpful) in that directory.
I've found the following URLs to be useful sources of information:
http://groups.google.com/groups?q=sc...NGP10& rnum=6
http://groups.google.com/groups?q=sc...gle.com&rnum=2
Sorry for the length of this post, but I sure would be grateful to anyone who might be
able to help! :-)
John Peterson
I think I've been able to get close with some of this:
select * from openrowset('MSDASQL.1', 'Driver={Microsoft Text Driver (*.txt; *.csv)};
DefaultDir=C:\Program Files\Common Files\Microsoft Shared\MSSearch\Data\Config\;', 'select
* from noise.txt')
However, this seems to treat the first row as a header row.
select * from openrowset('Microsoft.Jet.OLEDB.4.0','Text;Databas e=C:\Program Files\Common
Files\Microsoft Shared\MSSearch\Data\Config\;HDR=NO', 'select * from noise.txt')
This is the closest I found, and allows me to specify that there is no header row.
The only remaining nit is that I *must* have an extension of .TXT. If I try to change
"noise.txt" to "noise.eng", I get errors.
A colleague of mine discovered that the JET driver has a list of "allowable" extensions:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engi nes\Text\Extensions
My guess is that if I put ".ENG" in that list, then the above queries will work with
"noise.eng". I was hoping to avoid modifying the Registry, though.
Does anyone know of a way to circumvent this issue?
Thanks!
"John Peterson" <j0hnp@.comcast.net> wrote in message
news:%23Mz740hMEHA.1388@.TK2MSFTNGP09.phx.gbl...
> (SQL Server 2000, SP3a)
> Hello all!
> I'm trying to cobble together something that will let me reference the "noise.eng" (the
> English list of the Full Text Search service ignored words) through Transact-SQL.
> I'm trying to reference the "noise.eng" file that's located in:
> C:\Program Files\Common Files\Microsoft Shared\MSSearch\Data\Config
> There are a couple others in:
> C:\Program Files\Microsoft SQL Server\MSSQL\FTDATA\SQLServer\Config
> C:\windows\SYSTEM32
> but the one in the "Common Files" had the most recent date. :-)
> I *have* had some measure of success with it -- but I'm trying to come up with a
"cleaner"
> way, if possible.
> This is how I got this to work:
> First I tried setting up a linked server:
> --Create a linked server
> execute [dbo].[sp_addlinkedserver]
> 'TextTest',
> 'Jet 4.0',
> 'Microsoft.Jet.OLEDB.4.0',
> 'C:\Program Files\Common Files\Microsoft Shared\MSSearch\Data\Config',
> NULL,
> 'Text'
> go
> --Set up login mappings
> execute [dbo].[sp_addlinkedsrvlogin]
> 'TextTest',
> FALSE,
> NULL,
> NULL
> go
> Then I checked out the tables that were available:
> --List the tables in the linked server
> execute [dbo].[sp_tables_ex] 'TextTest'
> go
> Interestingly, all the files in that directory with a .TXT extension were visible as
> tables! However, the "noise.eng" file didn't seem to be available.
> If I copied the "noise.eng" file to "noise.txt", then I could do something like:
> select * from TextTest...[noise#txt]
> or
> select * from openquery(TextTest, 'select * from [noise#txt]') as a
> And that seems to return the information! :-)
> You can clean up the linked servers with:
> execute [dbo].[sp_droplinkedsrvlogin]
> 'TextTest', NULL
> go
> execute [dbo].[sp_dropserver]
> 'TextTest'
> go
> However, I'd really like to use the OPENROWSET function if at all possible, and *not*
have
> to rename/copy the noise.eng file. I was reading that I could use a "schema.ini" file,
> and after browsing the web, discovered this format that I thought might work:
> [noise.eng]
> ColNameHeader = False
> CharacterSet = ANSI
> Format = CSVDelimited
> Col1=NoiseWord Char Width 100
> Then I tried doing something like:
> select *
> from openrowset
> (
> 'Microsoft.Jet.OLEDB.4.0',
> 'Provider=Microsoft.Jet.Oledb.4.0;Data Source=C:\Program Files\Common Files\Microsoft
> Shared\MSSearch\Data\Config;Extended Properties="text;HDR=No;FMT=CSV";',
> 'select * from noise.eng'
> ) as a
> Which doesn't quite work -- nor have any of the OPENROWSET variants I've tried. :-(
> I'd like to be able to use OPENROWSET so that I don't have to create a linked server.
I'd
> also like to avoid copying the "noise.eng" to "noise.txt". I don't mind having to drop
in
> a "schema.ini" (if that's even necessary/helpful) in that directory.
> I've found the following URLs to be useful sources of information:
>
http://groups.google.com/groups?q=sc...NGP10& rnum=6
>
http://groups.google.com/groups?q=sc...gle.com&rnum=2
> Sorry for the length of this post, but I sure would be grateful to anyone who might be
> able to help! :-)
> John Peterson
>
|||John,
A most interesting solution! You would need to use ".ENU" in that list, then
the above queries will work with "noise.enu" (US_English). Can I assume
you've tried to import the noise word file into a table? For example:
CREATE TABLE sqlfts_stop_words (
term nvarchar(50) NOT NULL )
GO
ALTER TABLE sqlfts_stop_words ADD
CONSTRAINT pk_sqlfts_stop_words PRIMARY KEY CLUSTERED (term)
GO
-- Alter drive letter and path to noise.enu as appropriate
BULK INSERT sqlfts_stop_words FROM
'F:\MSSQL80\MSSQL\FTDATA\SQLServer\Config\noise.en u'
GO
"John Peterson" <j0hnp@.comcast.net> wrote in message
news:#p5crXiMEHA.1312@.TK2MSFTNGP12.phx.gbl...
> I think I've been able to get close with some of this:
> select * from openrowset('MSDASQL.1', 'Driver={Microsoft Text Driver
(*.txt; *.csv)};
> DefaultDir=C:\Program Files\Common Files\Microsoft
Shared\MSSearch\Data\Config\;', 'select
> * from noise.txt')
> However, this seems to treat the first row as a header row.
> select * from
openrowset('Microsoft.Jet.OLEDB.4.0','Text;Databas e=C:\Program Files\Common
> Files\Microsoft Shared\MSSearch\Data\Config\;HDR=NO', 'select * from
noise.txt')
> This is the closest I found, and allows me to specify that there is no
header row.
> The only remaining nit is that I *must* have an extension of .TXT. If I
try to change
> "noise.txt" to "noise.eng", I get errors.
> A colleague of mine discovered that the JET driver has a list of
"allowable" extensions:
> HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engi nes\Text\Extensions
> My guess is that if I put ".ENG" in that list, then the above queries will
work with[vbcol=seagreen]
> "noise.eng". I was hoping to avoid modifying the Registry, though.
> Does anyone know of a way to circumvent this issue?
> Thanks!
> "John Peterson" <j0hnp@.comcast.net> wrote in message
> news:%23Mz740hMEHA.1388@.TK2MSFTNGP09.phx.gbl...
"noise.eng" (the[vbcol=seagreen]
Transact-SQL.[vbcol=seagreen]
up with a[vbcol=seagreen]
> "cleaner"
Shared\MSSearch\Data\Config',[vbcol=seagreen]
were visible as[vbcol=seagreen]
something like:[vbcol=seagreen]
possible, and *not*[vbcol=seagreen]
> have
"schema.ini" file,[vbcol=seagreen]
work:[vbcol=seagreen]
Files\Microsoft[vbcol=seagreen]
tried. :-([vbcol=seagreen]
linked server.[vbcol=seagreen]
> I'd
having to drop
> in
>
http://groups.google.com/groups?q=sc...NGP10& rnum=6
>
http://groups.google.com/groups?q=sc...gle.com&rnum=2[vbcol=seagreen]
anyone who might be
>
|||Hello John!
I was finally able to come up with a couple of solutions with the OPENQUERY, and settled
on the Registry "dink" for the various "noise.*" extensions. (My other technique was to
copy the specified noise file to a temporary file with a ".TXT" extension, but that was
using [xp_cmdshell] and our users aren't members of the db_admin role and I was leery of
opening up the permissions on that procedure.)
Your option to populate a table with the noise words is intriguing, too. We don't change
the noise list too often, but we *do* add stuff to it on occasion. I wonder if I could
have a scheduled Job that would essentially keep the file in sync with the table every
week or so? Something like that might work better.
Our application will be caching the results of the OPENQUERY approach -- so I don't feel
too bad about the potential "hit" for reading the file every time the SP is invoked. But
it is more "clumsy" than just having a static table...
Thanks for the alternative suggestion -- I'll have to mull it over further! :-)
"John Kane" <jt-kane@.comcast.net> wrote in message
news:urIuIDxMEHA.2876@.TK2MSFTNGP09.phx.gbl...
> John,
> A most interesting solution! You would need to use ".ENU" in that list, then
> the above queries will work with "noise.enu" (US_English). Can I assume
> you've tried to import the noise word file into a table? For example:
> CREATE TABLE sqlfts_stop_words (
> term nvarchar(50) NOT NULL )
> GO
> ALTER TABLE sqlfts_stop_words ADD
> CONSTRAINT pk_sqlfts_stop_words PRIMARY KEY CLUSTERED (term)
> GO
> -- Alter drive letter and path to noise.enu as appropriate
> BULK INSERT sqlfts_stop_words FROM
> 'F:\MSSQL80\MSSQL\FTDATA\SQLServer\Config\noise.en u'
> GO
>
> "John Peterson" <j0hnp@.comcast.net> wrote in message
> news:#p5crXiMEHA.1312@.TK2MSFTNGP12.phx.gbl...
> (*.txt; *.csv)};
> Shared\MSSearch\Data\Config\;', 'select
> openrowset('Microsoft.Jet.OLEDB.4.0','Text;Databas e=C:\Program Files\Common
> noise.txt')
> header row.
> try to change
> "allowable" extensions:
> work with
> "noise.eng" (the
> Transact-SQL.
> up with a
> Shared\MSSearch\Data\Config',
> were visible as
> something like:
> possible, and *not*
> "schema.ini" file,
> work:
> Files\Microsoft
> tried. :-(
> linked server.
> having to drop
>
http://groups.google.com/groups?q=sc...NGP10& rnum=6
>
http://groups.google.com/groups?q=sc...gle.com&rnum=2
> anyone who might be
>
|||Hi John,
When you need to change the noise word file, you can setup a SQLServerAgent
job to BCP out to a noise.tmp text file, then stop the MSSearch service,
swap out the files and restart the MSSearch service and then run a Full
Population. You can make this as simple or as complex as you need. Also, for
the below BULK INSERT to work without errors, you will need to manually edit
the noise.enu file and place CR/LF after each single letter at the end of
the file, for example:
a b c d...
becomes
a
b
c
d...
Regards,
John
"John Peterson" <j0hnp@.comcast.net> wrote in message
news:uXsOZ2yMEHA.2976@.TK2MSFTNGP10.phx.gbl...
> Hello John!
> I was finally able to come up with a couple of solutions with the
OPENQUERY, and settled
> on the Registry "dink" for the various "noise.*" extensions. (My other
technique was to
> copy the specified noise file to a temporary file with a ".TXT" extension,
but that was
> using [xp_cmdshell] and our users aren't members of the db_admin role and
I was leery of
> opening up the permissions on that procedure.)
> Your option to populate a table with the noise words is intriguing, too.
We don't change
> the noise list too often, but we *do* add stuff to it on occasion. I
wonder if I could
> have a scheduled Job that would essentially keep the file in sync with the
table every
> week or so? Something like that might work better.
> Our application will be caching the results of the OPENQUERY approach --
so I don't feel
> too bad about the potential "hit" for reading the file every time the SP
is invoked. But
> it is more "clumsy" than just having a static table...
> Thanks for the alternative suggestion -- I'll have to mull it over
further! :-)[vbcol=seagreen]
>
> "John Kane" <jt-kane@.comcast.net> wrote in message
> news:urIuIDxMEHA.2876@.TK2MSFTNGP09.phx.gbl...
then[vbcol=seagreen]
Files\Common[vbcol=seagreen]
I[vbcol=seagreen]
will[vbcol=seagreen]
the[vbcol=seagreen]
come[vbcol=seagreen]
might[vbcol=seagreen]
Files\Common[vbcol=seagreen]
Properties="text;HDR=No;FMT=CSV";',[vbcol=seagreen]
I've[vbcol=seagreen]
a[vbcol=seagreen]
mind
>
http://groups.google.com/groups?q=sc...NGP10& rnum=6
>
http://groups.google.com/groups?q=sc...gle.com&rnum=2
>
|||Thanks John!
I notice that there's a line in the noise.enu file that is:
a b c d e ... x y z
I'm assuming that "word" is really a space-delimited list of single noise words? If I
leave that "as is", will that be a problem? Or is there something with the mechanics of
the BULK INSERT that will give me problems with that? I can easily "parse" those values
out into the "expanded list", so I don't mind leaving it the way it is, unless it hoses up
the BULK INSERT.
Can I also assume, then, that I could potentially put other multiple noise words on the
same line, separated by a space if I wanted to? What if I wanted a *phrase* to be ignored
that contained spaces?
Thanks again for your help!
John Peterson
"John Kane" <jt-kane@.comcast.net> wrote in message
news:u9t$H88MEHA.3556@.TK2MSFTNGP09.phx.gbl...
> Hi John,
> When you need to change the noise word file, you can setup a SQLServerAgent
> job to BCP out to a noise.tmp text file, then stop the MSSearch service,
> swap out the files and restart the MSSearch service and then run a Full
> Population. You can make this as simple or as complex as you need. Also, for
> the below BULK INSERT to work without errors, you will need to manually edit
> the noise.enu file and place CR/LF after each single letter at the end of
> the file, for example:
> a b c d...
> becomes
> a
> b
> c
> d...
> Regards,
> John
>
> "John Peterson" <j0hnp@.comcast.net> wrote in message
> news:uXsOZ2yMEHA.2976@.TK2MSFTNGP10.phx.gbl...
> OPENQUERY, and settled
> technique was to
> but that was
> I was leery of
> We don't change
> wonder if I could
> table every
> so I don't feel
> is invoked. But
> further! :-)
> then
> Files\Common
> I
> will
> the
> come
> might
> Files\Common
> Properties="text;HDR=No;FMT=CSV";',
> I've
> a
> mind
>
http://groups.google.com/groups?q=sc...NGP10& rnum=6
>
http://groups.google.com/groups?q=sc...gle.com&rnum=2
>
|||Yes, that's the line I'm referring to... Yes, if you leave it "as is" it
will cause the BULK INSERT to throw an error and the single letters will not
be imported. However, making the change does not affect MSSearch, it's just
that the bulk insert hick-ups with the spaces between the single letters.
No, your assumption is incorrect (I've already tried that years ago ;-) and
the multiple-word noise phrases are ignored by MSSearch.
Regards,
John
"John Peterson" <j0hnp@.comcast.net> wrote in message
news:#f1$$R#MEHA.3972@.TK2MSFTNGP10.phx.gbl...
> Thanks John!
> I notice that there's a line in the noise.enu file that is:
> a b c d e ... x y z
> I'm assuming that "word" is really a space-delimited list of single noise
words? If I
> leave that "as is", will that be a problem? Or is there something with
the mechanics of
> the BULK INSERT that will give me problems with that? I can easily
"parse" those values
> out into the "expanded list", so I don't mind leaving it the way it is,
unless it hoses up
> the BULK INSERT.
> Can I also assume, then, that I could potentially put other multiple noise
words on the
> same line, separated by a space if I wanted to? What if I wanted a
*phrase* to be ignored[vbcol=seagreen]
> that contained spaces?
> Thanks again for your help!
> John Peterson
>
> "John Kane" <jt-kane@.comcast.net> wrote in message
> news:u9t$H88MEHA.3556@.TK2MSFTNGP09.phx.gbl...
SQLServerAgent[vbcol=seagreen]
for[vbcol=seagreen]
edit[vbcol=seagreen]
of[vbcol=seagreen]
other[vbcol=seagreen]
extension,[vbcol=seagreen]
and[vbcol=seagreen]
too.[vbcol=seagreen]
the[vbcol=seagreen]
approach --[vbcol=seagreen]
SP[vbcol=seagreen]
list,[vbcol=seagreen]
assume[vbcol=seagreen]
example:[vbcol=seagreen]
Driver[vbcol=seagreen]
from[vbcol=seagreen]
is no[vbcol=seagreen]
If[vbcol=seagreen]
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engi nes\Text\Extensions[vbcol=seagreen]
queries[vbcol=seagreen]
though.[vbcol=seagreen]
reference[vbcol=seagreen]
through[vbcol=seagreen]
Shared\MSSearch\Data\Config[vbcol=seagreen]
Server\MSSQL\FTDATA\SQLServer\Config[vbcol=seagreen]
to[vbcol=seagreen]
extension[vbcol=seagreen]
available.[vbcol=seagreen]
as a[vbcol=seagreen]
all[vbcol=seagreen]
use a[vbcol=seagreen]
thought[vbcol=seagreen]
variants[vbcol=seagreen]
create[vbcol=seagreen]
don't[vbcol=seagreen]
directory.[vbcol=seagreen]
information:
>
http://groups.google.com/groups?q=sc...NGP10& rnum=6
>
http://groups.google.com/groups?q=sc...gle.com&rnum=2[vbcol=seagreen]
to
>

No comments:

Post a Comment