Does replacing string column with repeating values at some point with an int FK have any performance benefits in RDBMS?

44 Views Asked by At

In order to store the country information for a person I did:

    CREATE TABLE test
  (
     id      INT IDENTITY(1, 1),
     name    VARCHAR(100) NOT NULL,
     country VARCHAR(100) NOT NULL,
     PRIMARY KEY(id)
  );

INSERT INTO test
VALUES      ('Amy', 'Mexico'),
            ('Tom', 'US'),
            ('Mark', 'Morocco'),
            ('Izzy', 'Mexico');
-- milions of other rows

A lot of the countries will repeat themselves in the country column.

Another option would be to get the country in it's own table and reference the country_id as a FK in the test table:

CREATE TABLE countries
  (
     id   INT IDENTITY(1, 1),
     name VARCHAR(100) NOT NULL,
     PRIMARY KEY(id)
  );

CREATE TABLE test
  (
     id         INT IDENTITY(1, 1),
     name       VARCHAR(100) NOT NULL,
     country_id INT NOT NULL,
     PRIMARY KEY(id),
     FOREIGN KEY(country_id) REFERENCES countries(id)
  ); 

My question is: is there benefit of doing the second scenario from performance point of view/ indexes point of view or it's just cumbersome to do so? ( I know I am not breaking any normal form with the first scenario)

2

There are 2 best solutions below

1
Tim Biegeleisen On

The second version has an obvious performance benefit, namely that only multiple country IDs need be stored for each person-country relationship. This, in turn, means that your storage requirements for the tables and indices would be reduced.

Because the index of the second version would use the integer country ID rather than a string name, I would expect index performance to improve. Your database doesn't "know" that there are only a fixed number of countries. So, the index for the first version would be a B-tree splitting across text, rather than integers. And the former is more verbose than the latter.

1
NickW On

The second one will, in general, have poorer performance as joins are always costly. So any query of test that includes the country would need to join to the country table. The storage difference is not going to affect query performance.

In the real world, a country entity would likely have multiple attributes (iso codes, regions, populations, etc) and therefore would need to be normalised into its own entity. You’d then need to join to it, and have the performance hit of joining, but that is, in general, outweighed by the benefits of normalisation - which is why we use normalisation rather than “one big table”