Skip to content

cherrypick: add lock wait time#24885

Open
daviszhen wants to merge 1 commit into
matrixorigin:mainfrom
daviszhen:0608-pick3.0-to-main-58ad528
Open

cherrypick: add lock wait time#24885
daviszhen wants to merge 1 commit into
matrixorigin:mainfrom
daviszhen:0608-pick3.0-to-main-58ad528

Conversation

@daviszhen

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue ##24420

What this PR does / why we need it:

1, 启用lock_wait_timeout session 级别变量. mysql也有. 用户可以设置此值. lock_wait_timeout 的默认值 31536000 (1 year), 能兼容目前的行为. 如果用户设置了超时时间就用用户的值. 2, 增加lock rpc timeout . 如果lock_wait_timeout设置了, lock rpc timeout 就用 lock_wait_timeout值.

避免几个小时加锁傻等.

3, 修改lockoption 增加lockwaittimeout选项.

4, 给 RPC 加 slack(宽松预算):RPC 超时 = lock_wait_timeout + 30s,让服务端有足够时间返回超时结果 5, 服务端异步路径也强制检查超时:之前异步路径(远程锁走这条)从不检查 LockWaitTimeout,现在 waiterEvents.check() 里会定期检查并通知超时 6, 客户端翻译 deadline 错误:如果 RPC deadline 到了但 caller context 还没到,说明是锁超时而非连接问题,直接翻译成 ErrLockTimeout

修改后的效果:
session 1 begin个事务, 加行锁.
session 2 delete 行, 等锁, 超时

session1

MySQL [test]> select * from t1;
+------+
| a    |
+------+
|    2 |
|    3 |
|    4 |
|    1 |
+------+
4 rows in set (0.001 sec)

MySQL [test]>
MySQL [test]>
MySQL [test]> begin;
Query OK, 0 rows affected (0.000 sec)

MySQL [test]> select * from t1 where a = 1 for update;
+------+
| a    |
+------+
|    1 |
+------+
1 row in set (0.001 sec)

MySQL [test]>

session2

MySQL [test]> select @@session.lock_wait_timeout;
+---------------------+
| @@lock_wait_timeout |
+---------------------+
| 180                 |
+---------------------+
1 row in set (0.000 sec)

MySQL [test]> delete from t1 where a = 1;
ERROR 1105 (HY000): context deadline exceeded
MySQL [test]>

1, 启用lock_wait_timeout session 级别变量. mysql也有. 用户可以设置此值. lock_wait_timeout 的默认值 `31536000 (1 year)`, 能兼容目前的行为. 如果用户设置了超时时间就用用户的值.
2, 增加lock rpc timeout . 如果lock_wait_timeout设置了, lock rpc timeout 就用 lock_wait_timeout值.

避免几个小时加锁傻等.

3, 修改lockoption 增加lockwaittimeout选项.

4, 给 RPC 加 slack(宽松预算):RPC 超时 = lock_wait_timeout + 30s,让服务端有足够时间返回超时结果
5, 服务端异步路径也强制检查超时:之前异步路径(远程锁走这条)从不检查 LockWaitTimeout,现在 waiterEvents.check() 里会定期检查并通知超时
6, 客户端翻译 deadline 错误:如果 RPC deadline 到了但 caller context 还没到,说明是锁超时而非连接问题,直接翻译成 ErrLockTimeout

修改后的效果:
session 1 begin个事务, 加行锁.
session 2 delete 行, 等锁, 超时

```
session1

MySQL [test]> select * from t1;
+------+
| a    |
+------+
|    2 |
|    3 |
|    4 |
|    1 |
+------+
4 rows in set (0.001 sec)

MySQL [test]>
MySQL [test]>
MySQL [test]> begin;
Query OK, 0 rows affected (0.000 sec)

MySQL [test]> select * from t1 where a = 1 for update;
+------+
| a    |
+------+
|    1 |
+------+
1 row in set (0.001 sec)

MySQL [test]>

```

```
session2

MySQL [test]> select @@session.lock_wait_timeout;
+---------------------+
| @@lock_wait_timeout |
+---------------------+
| 180                 |
+---------------------+
1 row in set (0.000 sec)

MySQL [test]> delete from t1 where a = 1;
ERROR 1105 (HY000): context deadline exceeded
MySQL [test]>

```

Approved by: @iamlinjunhong, @ouyuanning, @XuPeng-SH, @aunjgr, @fengttt
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@aptend aptend left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Reviewed the lock_wait_timeout propagation from session/txn options into lock options, local and remote lock wait handling, async waiter timeout path, and related tests. No blocking issues found.

@XuPeng-SH XuPeng-SH left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still a real multi-CN/session-semantics hole here. lock_wait_timeout is supposed to behave like a session variable, but on the remote-CN lock path it can get stuck at the value captured when the transaction was created.

The flow is:

  • frontend copies lock_wait_timeout into TxnOptions only when the txn is created (pkg/frontend/txn.go:428-434);
  • lockop.lockWaitTimeout() prefers the current process resolver, but falls back to TxnOptions when that resolver is unavailable (pkg/sql/colexec/lockop/lock_op.go:789-809);
  • remote processes reconstructed from ProcessInfo do not carry a resolve-variable function or arbitrary session sysvars — only basic session info such as user/host/database/version/timezone is serialized (pkg/vm/process/process_codec.go:48-100).

So if a user does something like BEGIN; SET SESSION lock_wait_timeout = 1; ... and the later locking statement runs on a remote CN, the local path can see the new 1-second setting, but the remote path falls back to the stale txn-start value from TxnOptions (possibly the default 1 year). That means the same session-level change behaves differently depending on whether the lock is local or remote, which is exactly the kind of wait this PR is trying to fix.

I'd like to see the remote path get the current session value as well (or the feature/documentation explicitly narrowed to txn-start snapshot semantics), plus a test that changes lock_wait_timeout after BEGIN and then exercises a remote lock wait.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working kind/enhancement size/L Denotes a PR that changes [500,999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants