← Blog

HOW A 32-BIT INTEGER OVERFLOW CRASHED TERAMIND ON 1,500 USERS — AND HOW WE FIXED IT

TMU 878 failed to fix a production tmsrv crash loop affecting 1,500 users. We traced the root cause to a 32-bit integer overflow in mon_mail_attachment_id — here's the full incident breakdown and the SQL fix that actually worked.

Background

We manage Teramind deployments for enterprises across Turkey as an authorized Teramind partner. One of our largest on-site deployments runs approximately 1,500 active users on a dedicated Teramind server.

In late May 2026, this environment began exhibiting two simultaneous symptoms that quickly escalated to a critical incident.


The Symptoms

1. tmsrv Crash Loop

The Teramind server process (tmsrv) was crashing and restarting repeatedly. Each restart was preceded by a SIGABRT signal — meaning the process was deliberately aborting itself rather than hitting an external kill signal.

2. work_time Under-Reporting

Employee work time data in the dashboards was severely under-reported. Sessions that clearly showed activity were logging zero or near-zero productive time. At first glance this looked like a monitoring policy misconfiguration.

Both symptoms appeared together, which was the first signal that they shared a common cause.


Initial Response: TMU 878

We opened a case with Teramind support. Their recommendation was to apply TMU 878, a maintenance update that addressed several tmsrv stability issues.

We applied TMU 878. The crash loop continued.

The update did not change the behavior at all. This meant the root cause was something TMU 878 was not designed to address — and we needed to find it ourselves.


Tracing the Root Cause

We pulled a core dump from the crashed tmsrv process and analyzed the stacktrace.

The crash originated in BackgroundWorker::threadFunc(), which called flush_log_and_abort():

#4  flush_log_and_abort()
#5  teramind::server::BackgroundWorker::threadFunc()
#6  libboost_thread.so.1.74.0
#7  start_thread

flush_log_and_abort() is an internal Teramind function that flushes pending log/data writes and then calls abort(). It is triggered when the process encounters a state it considers unrecoverable — typically a database write failure or a constraint violation.

We turned our attention to the database.

The Integer Overflow

Inspecting the PostgreSQL schema, we checked the column types on the highest-traffic tables. The mon_mail_attachment table stores every email attachment event captured by Teramind agents.

SELECT column_name, data_type 
FROM information_schema.columns 
WHERE table_name = 'mon_mail_attachment' 
  AND column_name = 'mon_mail_attachment_id';

Result: integer — a signed 32-bit integer with a maximum value of 2,147,483,647.

We then checked the current sequence value:

SELECT last_value FROM mon_mail_attachment_mon_mail_attachment_id_seq;

The sequence had hit the ceiling. Every new INSERT into mon_mail_attachment was failing with an integer overflow, BackgroundWorker was catching the unrecoverable error and calling flush_log_and_abort(), and tmsrv was restarting — only to crash again on the next email attachment event.

The work_time under-reporting was a side effect: when the background worker responsible for data persistence crashed mid-cycle, it also dropped the buffered productivity metrics for that window.

At 1,500 users with active email monitoring, this table accumulates roughly 2–5 million rows per month. The 32-bit limit was always going to be hit. It just took long enough that no one had encountered it yet.


The Fix

TMU 878 did not alter the data type of mon_mail_attachment_id. The fix was a one-line schema migration:

ALTER TABLE mon_mail_attachment 
  ALTER COLUMN mon_mail_attachment_id TYPE bigint;

bigint is a signed 64-bit integer with a maximum value of 9,223,372,036,854,775,807 — effectively unlimited for any realistic Teramind deployment.

We applied this on May 31, 2026 at approximately 22:00 (after business hours, with no agent downtime required — PostgreSQL executes this as a metadata-only operation on modern versions, with no full table rewrite).

Immediate result:

  • tmsrv crash loop stopped
  • Agent connections stabilized within minutes
  • work_time data returned to normal on the next reporting cycle

Why No Downtime?

On PostgreSQL 12+, changing an integer column to bigint is a catalog-only change when the column has no check constraints that reference the type range. PostgreSQL does not rewrite the table on disk — it simply updates the type metadata. This means the migration completes in milliseconds regardless of table size and does not require a maintenance window.


The Near-Miss

The fix was applied on Saturday, May 31. Monday, June 2 was the first business day after a public holiday.

Had we not caught and resolved this proactively, the customer would have opened their Teramind dashboards on Monday morning — the first working day after a long weekend — to find:

  • All agent data missing for the entire holiday period
  • Productivity reports showing zero values
  • Behavior alert triggers misfiring due to missing data

For a 1,500-user deployment with management relying on these dashboards, that would have been a serious escalation. The timing made it worse: a support ticket opened on Monday morning after a holiday would have taken hours to reach the right people.


Preventing the Same Issue on Other Deployments

If you run an on-site Teramind deployment with significant email monitoring volume, check whether your mon_mail_attachment_id column is still typed as integer:

SELECT 
  column_name,
  data_type,
  (SELECT last_value 
   FROM mon_mail_attachment_mon_mail_attachment_id_seq) AS current_seq,
  2147483647 AS int_max,
  ROUND(
    (SELECT last_value FROM mon_mail_attachment_mon_mail_attachment_id_seq)::numeric 
    / 2147483647 * 100, 2
  ) AS pct_used
FROM information_schema.columns
WHERE table_name = 'mon_mail_attachment' 
  AND column_name = 'mon_mail_attachment_id';

If data_type is integer and pct_used is above 70%, apply the migration before your next high-volume period:

ALTER TABLE mon_mail_attachment 
  ALTER COLUMN mon_mail_attachment_id TYPE bigint;

The same risk applies to any other high-volume event table that uses a 32-bit sequence. Tables worth auditing on busy deployments:

Table Risk Factor
mon_mail_attachment High (every attachment = 1 row)
mon_web_file Medium–High (file uploads/downloads)
mon_keystroke High on typing-intensive deployments
mon_screen Medium (screenshot events)

Data Growth Management

As a complementary measure, we deploy an automated cleanup daemon on large on-site installations. The daemon runs daily and removes event data older than 12 months using Teramind's built-in tm.pl utility:

/usr/local/teramind/scripts/tm.pl \
  -func remove_user_data_ex \
  -keep_months 12 \
  -no_disk_check

The -no_disk_check flag is important on Master servers that proxy to Terabi nodes — without it the script may abort with a disk space check error even when space is not the concern.

This does not eliminate the need for the bigint migration (12 months of data at 1,500 users still generates hundreds of millions of rows), but it prevents unbounded growth and keeps database performance healthy long-term.


What We Reported to Teramind

Following the incident, we sent Teramind support a detailed write-up. The key points:

  1. TMU 878 did not address this issue. The integer → bigint migration for mon_mail_attachment_id was not part of the update.
  2. We recommend this migration be included in an official TMU release so deployments are patched through the standard update path.
  3. Existing on-site deployments should be audited proactively — any customer that has been running Teramind with heavy email monitoring for several years is potentially at risk.

Summary

Symptom tmsrv crash loop + work_time under-reporting
Root cause mon_mail_attachment_id hit 32-bit integer max (2,147,483,647)
Vendor fix TMU 878 — did not resolve the issue
Actual fix ALTER COLUMN mon_mail_attachment_id TYPE bigint
Downtime Zero (PostgreSQL catalog-only operation)
Affected users ~1,500
Time to fix ~15 minutes after root cause identified

If you manage a Teramind on-site deployment and this looks familiar, run the audit query above. The migration is safe, fast, and permanent.

For Teramind deployments in Turkey — licensing, setup, or ongoing management — get in touch.