1626 字
8 分钟
记一次 Bug 定位:Tailscale Derper 自签 IP 证书连接问题

症状#

搭建 Tailscale Derper 服务,客户端连接到 Derper 时能 ping 通(tailscale netcheck 能够显示出延迟),但无法建立实际的 Derper 连接。

tailscale derper debug 900 报错:certificate signed by unknown authority。

{
 "Info": [
  "Region 900 == \"wuh\""
 ],
 "Warnings": null,
 "Errors": [
  "Error upgrading connection to node \"10.1.2.3\" @ \"10.1.2.3:443\" to TLS over IPv4: tls: failed to verify certificate: x509: certificate signed by unknown authority"
 ]
}

Derper 启动指令:

derper --hostname="10.1.2.3" --certmode manual --certdir /etc/derper/certs --verify-clients

相关功能实现 Issues 和 PR:

初步分析#

和使用常规 CA 签署的域名证书不同,使用自签名证书验证需要提前将 Derper 输出的证书 sha256 hash 写入到 tailnet ACL policy 中,客户端在连接 Derper 时通过这个 hash 认证证书有效性。 例如:

"derpMap": {
  "Regions": {
   "900": {
    "RegionID":   900,
    "RegionCode": "wuh",
    "RegionName": "Wuhan",
    "Nodes": [
     {
      "Name":     "Aliyun",
      "RegionID": 900,
      "HostName": "10.1.2.3",
      "IPv4":     "10.1.2.3",
      "CertName": "sha256-raw:38cb38778e208445160bcb3edcacecff5310b6d77fdfcb71232904ac31b09821",
     },
    ],
   },
  },
 },

此处复用了 CertName 属性,正常情况下这个属性用于覆盖证书中的实体名字。在 Derp Client 建立连接时会进行特判,检查 CertName 是否由 sha256-raw 开头(源代码);如果满足这种情况,会调用 SetConfigExpectedCertHash 绕开 crypto/tls 的证书检查,转而执行以下几项检查:

  • 应只提供一个证书
if len(rawCerts) > 1 {
 return errors.New("unexpected multiple certs presented")
}
  • 证书 Hash 应当与预期 Hash 相符
if fmt.Sprintf("%02x", sha256.Sum256(rawCerts[0])) != wantFullCertSHA256Hex {
  return fmt.Errorf("cert hash does not match expected cert hash")
}
  • 证书的 ServerName 应当与预期的 Host(IP 地址)一致,且处于有效期内
if err := cert.VerifyHostname(c.ServerName); err != nil {
  return fmt.Errorf("cert does not match server name %q: %w", c.ServerName, err)
}
now := time.Now()
if now.After(cert.NotAfter) {
  return fmt.Errorf("cert expired %v", cert.NotAfter)
}
if now.Before(cert.NotBefore) {
  return fmt.Errorf("cert not yet valid until %v; is your clock correct?", cert.NotBefore)
}

Derper 在设定了手动证书模式--certmode manual并且选用 IP 作为 hostname 的情况下,如果没有找到证书,Derper 会帮我们生成一个:

if you run derper with --certmode=manual and --hostname=$IP_ADDRESS we previously permitted, but now we also:

  • auto-generate the self-signed cert for you if it doesn’t yet exist on disk
  • print out the derpmap configuration you need to use that self-signed cert

但 Derper 的 443 端口却发过来了两个 certificate:

 openssl s_client -connect 10.1.2.3:443 -showcerts
CONNECTED(00000003)
depth=0 CN = 10.1.2.3
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = 10.1.2.3
verify error:num=21:unable to verify the first certificate
verify return:1
write W BLOCK
---
Certificate chain
 0 s:/CN=10.1.2.3
   i:/CN=10.1.2.3
-----BEGIN CERTIFICATE-----
MIIB...Kbv
-----END CERTIFICATE-----
 1 s:/CN=derpkey5294bb9649526c28a36c976721ff055b8b6a708f2aa58663449e63fb8470b454
   i:/CN=derpkey5294bb9649526c28a36c976721ff055b8b6a708f2aa58663449e63fb8470b454
-----BEGIN CERTIFICATE-----
MIIB...HEQU=
-----END CERTIFICATE-----
---
Server certificate
subject=/CN=10.1.2.3
issuer=/CN=10.1.2.3
---
No client certificate CA names sent
Server Temp Key: ECDH, X25519, 253 bits
---
SSL handshake has read 1093 bytes and written 351 bytes
---
New, TLSv1/SSLv3, Cipher is AEAD-CHACHA20-POLY1305-SHA256
Server public key is 256 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.3
    Cipher    : AEAD-CHACHA20-POLY1305-SHA256
    Session-ID: 
    Session-ID-ctx: 
    Master-Key: 
    Start Time: 1744094442
    Timeout   : 7200 (sec)
    Verify return code: 21 (unable to verify the first certificate)
---

Tailscale 控制台也给出了对应的警报,并且拒绝了连接:

引入一个 patch,在处理自签证书的位置尝试打印证书数量,证实了以上猜测。

diff --git a/net/tlsdial/tlsdial.go b/net/tlsdial/tlsdial.go
index 4d22383ef..1cef31e87 100644
--- a/net/tlsdial/tlsdial.go
+++ b/net/tlsdial/tlsdial.go
@@ -260,6 +260,7 @@ func SetConfigExpectedCertHash(c *tls.Config, wantFullCertSHA256Hex string) {
  c.InsecureSkipVerify = true
  c.VerifyConnection = nil
  c.VerifyPeerCertificate = func(rawCerts [][]byte, _ [][]*x509.Certificate) error {
+  fmt.Println("!!!!!!!!!!!!!! length of rawCerts", len(rawCerts))
   if len(rawCerts) == 0 {
    return errors.New("no certs presented")
   }
@@ -283,6 +284,7 @@ func SetConfigExpectedCertHash(c *tls.Config, wantFullCertSHA256Hex string) {
   if now.Before(cert.NotBefore) {
    return fmt.Errorf("cert not yet valid until %v; is your clock correct?", cert.NotBefore)
   }
+  fmt.Println("!!!!!!!!!!!!!! cert is valid")
   return nil
  }
 }
derphttp.Client.Recv: connecting to derp-900 (wuh)
!!!!!!!!!!!!!! length of rawCerts 2
magicsock: [0x14000116000] derp.Recv(derp-900): derphttp.Client.Recv connect to region 900 (wuh): unexpected multiple certs presented

证书数量之谜#

这里就出现了两个问题:

  1. 为什么 TLS 握手过程中 Derper 会发过来两个证书?
  2. 为什么tailscale debug derp 900的报错显示的是certificate signed by unknown authority,而不是unexpected multiple certs presented

第二个问题很好回答:因为debug命令是另一套代码,这套 代码 忽视了自签名 IP 证书这个 Feature,直接使用了tls.Client来验证证书(扶额)。

// Upgrade to TLS and verify that works properly.
tlsConn := tls.Client(conn, &tls.Config{
  ServerName: cmp.Or(derpNode.CertName, derpNode.HostName),
})
if err := tlsConn.HandshakeContext(ctx); err != nil {
  st.Errors = append(st.Errors, fmt.Sprintf("Error upgrading connection to node %q @ %q to TLS over IPv4: %v", derpNode.HostName, addr, err))
} else {
  hasIPv4 = true
}

而第一个问题相对复杂一些。检查发来的第二个证书,发现它的 CN 是一个很长的字符串,derpkey5294bb9649526c28a36c976721ff055b8b6a708f2aa58663449e63fb8470b454

在源码中搜索derpkey,发现一个名叫parseMetaCert方法

func parseMetaCert(certs []*x509.Certificate) (serverPub key.NodePublic, serverProtoVersion int) {
 for _, cert := range certs {
  // Look for derpkey prefix added by initMetacert() on the server side.
  if pubHex, ok := strings.CutPrefix(cert.Subject.CommonName, "derpkey"); ok {
   var err error
   serverPub, err = key.ParseNodePublicUntyped(mem.S(pubHex))
   if err == nil && cert.SerialNumber.BitLen() <= 8 { // supports up to version 255
    return serverPub, int(cert.SerialNumber.Int64())
   }
  }
 }
 return key.NodePublic{}, 0
}

这里提到了一个新概念:MetaCerttailscale.com/derp 文档中描述如下:

MetaCert returns the server metadata cert that can be sent by the TLS server to let the client skip a round trip during start-up.

检查 源码,在initMetaCert方法处找到了更为详细的注释。

// initMetacert initialized s.metaCert with a self-signed x509 cert
// encoding this server's public key and protocol version. cmd/derper
// then sends this after the Let's Encrypt leaf + intermediate certs
// after the ServerHello (encrypted in TLS 1.3, not that it matters
// much).
//
// Then the client can save a round trip getting that and can start
// speaking DERP right away. (We don't use ALPN because that's sent in
// the clear and we're being paranoid to not look too weird to any
// middleboxes, given that DERP is an ultimate fallback path). But
// since the post-ServerHello certs are encrypted we can have the
// client also use them as a signal to be able to start speaking DERP
// right away, starting with its identity proof, encrypted to the
// server's public key.
//
// This RTT optimization fails where there's a corp-mandated
// TLS proxy with corp-mandated root certs on employee machines and
// and TLS proxy cleans up unnecessary certs. In that case we just fall
// back to the extra RTT.
func (s *Server) initMetacert() {

正常情况下客户端首先与 Derp 服务器进行 TLS 握手,握手成功后还需要再交换一些 metadata(服务器公钥、协议版本等 Derp 协议需要用到的数据)。

由于在 TLS1.3 之后,ServerHello 之后的所有内容(包括证书)都会被加密,因此可以直接在服务器启动时本地自签一个证书,使用它的 CN 属性传递这些信息,这个证书只相当于信息载体,随着 TLS 握手过程一起发送给 Client,免去了一轮 RTT。

于是乎这里就出现了两个 cert,其中第一个是真正的 HTTPS Certificate,第二个一定是自签的信封~

解决方案#

两个位置都需要修改:

  1. Derp Client 在连接时需要允许最多出现 2 个证书。由于在设置验证规则时还没有真正建立连接,所以很不幸地不能用 TLS 版本做区分。
  2. debug derp代码需要改用 Derp Client 定制的 TLS Client,或者单独增加对自签 IP 证书的支持。我最后选择的实现方式是后者,前者耦合太多。

最终提交给 Tailscale 的 PR 在这里:net/tlsdial, ipn/localapi: fix self-signed IP cert with DERP

记一次 Bug 定位:Tailscale Derper 自签 IP 证书连接问题
https://noise.amono.me/posts/tailscale-mysterious-derper/
作者
Noa Virellia
发布于
2025-04-08
许可协议
CC BY-NC-SA 4.0