Last Updated: February 25, 2016
·
545
· huy

Postgresql - Correcting bad logging data

When bad logging data from different versions of an application come in, we want to force update them to a unified value. For example:

  • Old iOS app sends source = 'ios', partner = 'viki'

  • New iOS app sends app_id = '10004a'

We want to update the data in table (app_id, source, partner) to values ('10004a', 'ios', 'viki');

CREATE TABLE logging(
  app_id VARCHAR(15),
  source VARCHAR(50),
  partner VARCHAR(50)
);

Question: Which is faster, performing 1 update or 2 updates?

1 update:

UPDATE logging M
SET source = 'ios',
    partner = 'viki',
    app_id = '10004a'
WHERE app_id = '10004a'
  OR  (source = 'ios' AND partner = 'viki');

2 updates:

UPDATE logging M
SET source = 'ios',
    partner = 'viki'
WHERE app_id = '10004a'
  AND (source != 'ios' OR partner != 'viki');

UPDATE logging M
SET app_id = '10004a'
WHERE (source = 'ios' AND partner = 'viki')
  AND app_id != '10004a';

Answer: With indexes, both are equally about the same. Possibly because:

  • It requires just 1 scan for the first.

  • It requires less assignment operations for the second.

  • With 2 updates, subsequent rerun will be much faster because it does not have to perform assignments on already-correct rows.

The Test:

-- Creating 2 identical test tables

CREATE TABLE logging_1(
  app_id VARCHAR(15),
  source VARCHAR(50),
  partner VARCHAR(50)
);
INSERT INTO logging_1 SELECT '10004a', '', '' FROM generate_series(1, 1000000);
INSERT INTO logging_1 SELECT '', 'ios', 'viki' FROM generate_series(1, 1000000);

CREATE INDEX ON logging_1(app_id);
CREATE INDEX ON logging_1(source);
CREATE INDEX ON logging_1(partner);


CREATE TABLE logging_2(
  app_id VARCHAR(15),
  source VARCHAR(50),
  partner VARCHAR(50)
);
INSERT INTO logging_2 SELECT '10004a', '', '' FROM generate_series(1, 1000000);
INSERT INTO logging_2 SELECT '', 'ios', 'viki' FROM generate_series(1, 1000000);

CREATE INDEX ON logging_2(app_id);
CREATE INDEX ON logging_2(source);
CREATE INDEX ON logging_2(partner);


-- 1 update
UPDATE logging_1
SET source = 'ios',
    partner = 'viki',
    app_id = '10004a'
WHERE app_id = '10004a'
  OR  (source = 'ios' AND partner = 'viki');

# UPDATE 2000000
# Time: 16790.319 ms


-- 2 updates
UPDATE logging_2
SET source = 'ios',
    partner = 'viki'
WHERE app_id = '10004a'
  AND (source != 'ios' OR partner != 'viki');

# UPDATE 1000000
# Time: 7591.592 ms


UPDATE logging_2
SET app_id = '10004a'
WHERE (source = 'ios' AND partner = 'viki')
  AND app_id != '10004a';

# UPDATE 1000000
# Time: 10357.224 ms