PHP 的字串比較

2022年10月17日 · 閱讀時間約 5 分鐘

Software Enineer, Backend

蛤？PHP 字串比較還要特別寫一篇文章嗎？

會開始研究這個問題，主要是因為在 Laravel Fortify 中使用 hash_equals() 這個函式比對字串。

PHP 的字串比對

PHP 開發者會使用 Equal Operator 執行比對作業：

'Hello' == 'World'; // false
'Hello' == 'Hello'; // true

在大多數情況下，更建議使用 Identical Operator 進行比對：

1 == '1'; // true
1 === '1'; // false
'1' == '1'; // true
'1' === '1'; // true

兩個 Operator 的差異在於它們是否會對資料型態自動轉換。為了聚焦主題，本篇文章僅討論 Identical Operator。

實作

註：本文以 PHP 8.1.11 作為其程式碼的研究範本

在 Zend/zend_operators.h 中定義了 is_identical_function：

ZEND_API zend_result ZEND_FASTCALL is_identical_function(zval *result, zval *op1, zval *op2) /* {{{ */
{
	ZVAL_BOOL(result, zend_is_identical(op1, op2));
	return SUCCESS;
}

其核心為 zend_is_identical（為了簡潔起見我們刪去不是字串比對的部份）：

ZEND_API bool ZEND_FASTCALL zend_is_identical(zval *op1, zval *op2) /* {{{ */
{
	if (Z_TYPE_P(op1) != Z_TYPE_P(op2)) {
		return 0;
	}
	switch (Z_TYPE_P(op1)) {
        // ...
		case IS_STRING:
			return zend_string_equals(Z_STR_P(op1), Z_STR_P(op2));
        // ...
	}
}

在 Zend/zend_string.h 中定義了 zend_string_equals()：

static zend_always_inline bool zend_string_equals(zend_string *s1, zend_string *s2)
{
	return s1 == s2 || zend_string_equal_content(s1, s2);
}

並在同一處定義了 zend_string_equal_content() 比對兩者的值是否相等：

static zend_always_inline bool zend_string_equal_content(zend_string *s1, zend_string *s2)
{
	return ZSTR_LEN(s1) == ZSTR_LEN(s2) && zend_string_equal_val(s1, s2);
}

對於 zend_string_equal_val() 的實作，如果不考慮 CPU 架構及編譯器的優化，採用的是以下程式：

static zend_always_inline bool zend_string_equal_val(zend_string *s1, zend_string *s2)
{
	return !memcmp(ZSTR_VAL(s1), ZSTR_VAL(s2), ZSTR_LEN(s1));
}

註：在 GNUC 且為 i386, x86_64 時會使用組合語言進行優化，這已經超出本篇的討論範圍

hash_equals 字串比對

即便結果是相同的，但在可能會面臨時序攻擊（Timing attack）時，需要使用 hash_equals() 進行字串比對。

這是因為 memcmp 比對的所需時間會取決於字串長度，攻擊者可以利用輸入不同長度的字串進行推斷原始字串的長度，所以在一些敏感的資料中應該使用其它方式進行字串比對（在 FreeBSD 中有 consttime_memequal，但這並非 POSIX 標準的一部份）。

實作

hash_equals 的實作位於 ext/hash/hash.c 中：

PHP_FUNCTION(hash_equals)
{
	zval *known_zval, *user_zval;
	char *known_str, *user_str;
	int result = 0;
	size_t j;

	if (zend_parse_parameters(ZEND_NUM_ARGS(), "zz", &known_zval, &user_zval) == FAILURE) {
		RETURN_THROWS();
	}

	/* We only allow comparing string to prevent unexpected results. */
	if (Z_TYPE_P(known_zval) != IS_STRING) {
		zend_argument_type_error(1, "must be of type string, %s given", zend_zval_type_name(known_zval));
		RETURN_THROWS();
	}

	if (Z_TYPE_P(user_zval) != IS_STRING) {
		zend_argument_type_error(2, "must be of type string, %s given", zend_zval_type_name(user_zval));
		RETURN_THROWS();
	}

	if (Z_STRLEN_P(known_zval) != Z_STRLEN_P(user_zval)) {
		RETURN_FALSE;
	}

	known_str = Z_STRVAL_P(known_zval);
	user_str = Z_STRVAL_P(user_zval);

	/* This is security sensitive code. Do not optimize this for speed. */
	for (j = 0; j < Z_STRLEN_P(known_zval); j++) {
		result |= known_str[j] ^ user_str[j];
	}

	RETURN_BOOL(0 == result);
}

這個實作中，有三點值得注意：

known_zval 與 user_zval 的資料型態都必須是 string
- 轉型的過程中會造成額外的 CPU 時間損耗
known_zval 與 user_zval 的字串長度必須相等
- 因為在比對的過程中是以 known_zval 的長度為準，如果此時長度不一會 Memory Leak
因為無論如何都會將每個字元比對完，所以不存在提前退出導致時間差的情況

補充

在 intel/linux-sgx 中，有針對 consttime_memequal 的實作：

#include <string.h>

int
consttime_memequal(const void *b1, const void *b2, size_t len)
{
	const unsigned char *c1 = b1, *c2 = b2;
	unsigned int res = 0;

	while (len--)
		res |= *c1++ ^ *c2++;

	/*
	 * Map 0 to 1 and [1, 256) to 0 using only constant-time
	 * arithmetic.
	 *
	 * This is not simply `!res' because although many CPUs support
	 * branchless conditional moves and many compilers will take
	 * advantage of them, certain compilers generate branches on
	 * certain CPUs for `!res'.
	 */
	return (1 & ((res - 1) >> 8));
}

其核心與 PHP source code 中並沒有多大差別，但對於是否使用 !res 或 0 == res，Intel 的工程師在註解中給出一個很不錯的解釋。

PHP 的字串比對​

實作​

hash_equals 字串比對​

實作​

補充​

PHP 的字串比對

實作

hash_equals 字串比對

實作

補充