SQL Full Text Index on multiple tables and columns

1k Views Asked by At

We have electronic forms that filers fill out online and we store the data in an SQL Server. We want to provide a search feature that allows us to search inside each electronic filing for matching keywords. We don’t need to know what word matched or where in the form it matches, we just need a ranked list of forms that match our keywords. We think SQL Full-Text Searching would be our best option because we are already using SQL server 2016. We just started with implementing a solution but would like some guidance since this is new territory for us.

Here is an example of how our tables are structured.

Filing is our top-level table for all electronic forms. We have sub tables that are all related through the FilingId. The Form Six Published Filings table has child tables to store information like Assets. The Form One Published Filings table has child tables to store information like Liabilities.

enter image description here

CREATE SCHEMA [Forms]
GO
CREATE SCHEMA [Form6]
GO
CREATE SCHEMA [Form1]
GO
CREATE TABLE [Forms].[Filing](
       [FilingId] INT NOT NULL IDENTITY(1,1) 
              CONSTRAINT [PK_Forms_Filing_FilingId] PRIMARY KEY CLUSTERED,
       [FilerUserId] [int] NOT NULL,
       [FormYear] [int] NOT NULL,
       [FormTypeId] [int] NOT NULL,
       [FilingStatusId] [int] NOT NULL,
       [FilerSignatureId] INT NULL,
       [SubmissionDate] DATETIME2(0) NULL,
    [IsScannedForm] BIT NOT NULL
        CONSTRAINT [DF_Forms_Filing_IsScannedForm] DEFAULT(0)
)
GO 
CREATE TABLE [Form6].[FormSixPublishedFilings](
       [FormSixPublishedFilingId] INT NOT NULL IDENTITY(1,1) 
           CONSTRAINT [PK_Form6_FormSixPublishedFilings_FormSixPublishedFilingId] PRIMARY KEY CLUSTERED,
       [FilingId] INT NOT NULL
       CONSTRAINT [FK_Form6_FormSixPublishedFilings_Filings] FOREIGN KEY ([FilingId]) REFERENCES [Forms].[Filing] ([FilingId]),
       [LastDateOfEmployment] DATE NULL,
    [NetWorthDate] DATE NULL,
       [NetWorth] MONEY NULL
)
GO
CREATE TABLE [Form6].[FormSixPublishedAssets](
       [FormSixPublishedAssetId] INT NOT NULL IDENTITY(1,1) 
        CONSTRAINT [PK_Form6_FormSixPublishedAssets_FormSixPublishedAssetId] PRIMARY KEY CLUSTERED,
       [FormSixPublishedFilingId] INT NOT NULL
        CONSTRAINT [FK_Form6_FormSixPublishedAssets_FormSixPublishedFilings] FOREIGN KEY ([FormSixPublishedFilingId]) REFERENCES [Form6].[FormSixPublishedFilings] ([FormSixPublishedFilingId]),
       [Name] VARCHAR(8000) NOT NULL, 
    [Amount] MONEY NOT NULL
)
GO

CREATE TABLE [Form1].[FormOnePublishedFilings]
(
       [FormOnePublishedFilingId] INT NOT NULL IDENTITY(1,1) 
           CONSTRAINT [PK_Form1_FormOnePublishedFilings_FormOnePublishedFilingId] PRIMARY KEY CLUSTERED,
    [FilingId] INT NOT NULL,
        CONSTRAINT [FK_Form1_FormOnePublishedFilings_Filing] FOREIGN KEY ([FilingId]) REFERENCES [Forms].[Filing] ([FilingId]),
    [HasServedAsAgent] BIT NULL,
    [LastDateOfEmployment] DATE NULL,
    [AmendmentReason] VARCHAR(1024) NULL,
)
GO

CREATE TABLE [Form1].[FormOnePublishedLiabilities]
(
       [FormOnePublishedLiabilityId] INT NOT NULL IDENTITY(1,1) 
              CONSTRAINT [PK_Form1_FormOnePublishedLiabilities_FormOnePublishedLiabilityId] PRIMARY KEY CLUSTERED,
       [FormOnePublishedFilingId] INT NOT NULL,
           CONSTRAINT [FK_Form1_FormOnePublishedLiabilities_FormOnePublishedFilings] FOREIGN KEY ([FormOnePublishedFilingId]) REFERENCES [Form1].[FormOnePublishedFilings] ([FormOnePublishedFilingId]),
       [NameOfCreditor] VARCHAR(8000) NOT NULL,
       [AddressOfCreditor] VARCHAR(8000) NOT NULL
)
GO

In order to be able to search through all the forms, I think we need to create a view that just has two columns. One for the FilingId and the other column would be an XML data type which would be an XML representation of all the data in each electronic filing. This XML column is what we will be using to set up our full-text index. I think we will be using the FreeTextTable search because we would like to have the results ranked and also the search terms will be entered by end-users.

create view ViewForFullTextSearching with schemabinding as
select f.FilingId,
    (select 
        filing.FilingId
        ,filing.FormYear
        ,filing.FormTypeId
        ,filing.FilingStatusId
        ,filing.FilerSignatureId
        ,filing.SubmissionDate
        ,filing.IsScannedForm
        ,form6Filing.LastDateOfEmployment 'Form6LastDateOfEmployment'
        ,form6Filing.NetWorthDate
        ,form6Filing.NetWorth
        ,form6Asset.Name
        ,form6Asset.Amount
        ,form1Filing.HasServedAsAgent
        ,form1Filing.LastDateOfEmployment 'Form1LastDateOfEmployment'
        ,form1Filing.AmendmentReason
        ,form1Liability.NameOfCreditor
        ,form1Liability.AddressOfCreditor
    from Forms.Filing filing
    left join Form6.FormSixPublishedFilings form6Filing on filing.FilingId = form6Filing.FilingId
    left join Form6.FormSixPublishedAssets form6Asset on form6Filing.FormSixPublishedFilingId = form6Asset.FormSixPublishedFilingId
    left join Form1.FormOnePublishedFilings form1Filing on filing.FilingId = form1Filing.FilingId
    left join Form1.FormOnePublishedLiabilities form1Liability on form1Liability.FormOnePublishedFilingId = form1Filing.FormOnePublishedFilingId
    where filing.FilingId = f.FilingId
    for xml auto, type
    ) as 'Filing'
from Forms.Filing f
GO
create unique clustered index [IX_ViewForFullTextSearching_FilingId] ON [Forms].[ViewForFullTextSearching] ([FilingId])
GO

The above SQL does not actually work because I get this error.

Cannot create an index on view "EthicsFdms.Forms.ViewForFullTextSearching" because it contains one or more subqueries. Consider changing the view to use only joins instead of subqueries. Alternatively, consider not indexing this view.

So, I’m a bit lost on how to create a view with XML to search over if I’m not allowed to create a materialized view that has subqueries.

This view results look like this:

enter image description here

Next we setup our Full Text Catalog and Index on this view:

CREATE FULLTEXT CATALOG [FtcFilings];
GO
CREATE FULLTEXT INDEX ON [Forms].[ViewForFullTextSearching] ([Filing] language 1033) key index [IX_ViewForFullTextSearching_FilingId] on [FtcFilings];
GO

Then I was hoping we could search the filings like so:

select ftt.*
from [Forms].[Filing] filing
inner join freetextable(Forms.ViewForFullTextSearching, Filing, 'APPLE') as ftt on filing.FilingId = ftt.[KEY]
order by rank desc

Right now my challenges are, is it possible to create a materialized view like this? Seems like I can’t because materialized views can’t have subqueries. I’m not sure how to build the XML field w/out subqueries.

If I’m not able to create a materialized view then how else can I create a full-text index that can search electronic Forms?

2

There are 2 best solutions below

2
SQLpro On

You cannot create an indexed view (which is a synchronous materialized view in SQL Server) only if there is a mathematical surjection and all scalar computation is deterministic and precise. By the way OUTER JOIN, SUBQUERIES and set operators (UNION, EXCEPT, INTERSECT) cannot be used...

The best ways to design your systeme is to do it in the reverse way...

  1. Create a persistent computed column using the CONCAT function of all the columns you want to fulltext index.
  2. Create fulltext indexes on the computed columns
  3. Create an UDF that search in the fulltext index on each tables and concatenate the result by UNION, and then aggregate results to compute the rank.

Let me know if you want more assistance to do so...

0
ch_g On

If these form filling data are seldom changed once created and it makes sense in business to store data of form1 and form6 together with its Filling, you may consider to go with document oriented design.

SQL server has good json support now. You can save all the Filling and form info in json, against which you can do full text search, and create views to simulate your current design if needed.

Here is an example -

    create table tst.form (
      form_id int not null identity primary key
     ,content_json nvarchar(max)
    )
    
-- inside content_json, the json may look like -
{
  "filler_user_id": 111,
  "filler_type_id": 1,
  "is_scanned_form": 1,
  "form1": [
    {
      "form1_filling_id": 101, 
      "has_served_as_agent":0, 
      "liabilities": [{"name_of_creditor": "abc"}]
    }
  ]
}

I only modelled form1 related info. You can add form6 related info as needed. Then you can do full text search against this content_json column.

Then create views to simulate your current design if needed -

create or alter view tst.form_base WITH SCHEMABINDING as 
select form_id
      ,convert(int, JSON_VALUE(content_json, '$.filler_user_id')) filler_user_id
      ,convert(int, JSON_VALUE(content_json, '$.filler_type_id')) filler_type_id
      ,convert(bit, JSON_VALUE(content_json, '$.is_scanned_form')) is_scanned_form
      ,JSON_QUERY(content_json, '$.form1') form1_json
  from tst.form

create unique clustered index idx_form_base_form_id on tst.form_base(form_id);
-- you can create index as needed
create index idx_form_base_filler_user_id on tst.form_base(filler_user_id);

create or alter view tst.form1 as 
select form_id
      ,a.form1_filling_id
      ,a.has_served_as_agent
      ,a.liabilities liabilities_json
  from tst.form_base cross apply OPENJSON(form1_json) WITH (
                                 form1_filling_id int '$.form1_filling_id',
                                 has_served_as_agent int '$.has_served_as_agent',
                                 liabilities nvarchar(max) '$.liabilities' as json) a

create or alter view tst.form1_liabilities as 
select form_id
      ,form1_filling_id
      ,a.name_of_creditor
  from tst.form1 cross apply OPENJSON(liabilities_json) WITH (
                                 name_of_creditor nvarchar(max) '$.name_of_creditor') a

Then create some test data -

insert into tst.form (content_json) values ('{
  "filler_user_id": 111,
  "filler_type_id": 1,
  "is_scanned_form": 1,
  "form1": [
    {
      "form1_filling_id": 101, 
      "has_served_as_agent":0, 
      "liabilities": [{"name_of_creditor": "abc"}]
    }
  ]
}');

insert into tst.form (content_json) values ('{
  "filler_user_id": 222,
  "filler_type_id": 1,
  "is_scanned_form": 0,
  "form1": [
    {
      "form1_filling_id": 102, 
      "has_served_as_agent":1, 
      "liabilities": [{"name_of_creditor": "def"}]
    }
  ]
}');

Try it -

select *
  from tst.form1_liabilities

enter image description here