Bộ Mã Thống nhất (unicode) | Học Regex

Unicode: flag "u" and class \p{...}

JavaScript sử dụng bảng mã Unicode encoding cho các chuỗi Hầu hết các ký tự được mã hóa chỉ với 2 byte (nhiều nhất 65536 ký tự).

Một số ký tự yêu cầu mã hóa đến 4 byte như: 𝒳(X toán học) hoặc 😄(một nụ cười) xem Emoji Full

Và nó xảy ra lỗi với trường hợp đếm chiều dài :

alert('😄'.length); // 2
alert('𝒳'.length); // 2

Biểu thức chính quy có cờ u (unicode) sẽ khắc phục lỗi trên.

Thuộc tính Unicode \p {…}

Mỗi \p{Letter} sẽ có nhiều thuộc tính bên trong:

Here’s the main character categories and their subcategories:

Letter L:

lowercase Ll
modifier Lm,
titlecase Lt,
uppercase Lu,
other Lo.

Number N:

decimal digit Nd,
letter number Nl,
other No.

Punctuation P:

connector Pc,
dash Pd,
initial quote Pi,
final quote Pf,
open Ps,
close Pe,
other Po.

Mark M (accents etc):

spacing combining Mc,
enclosing Me,
non-spacing Mn.

Symbol S:

currency Sc,
modifier Sk,
math Sm,
other So.

Separator Z:

line Zl,
paragraph Zp,
space Zs.

Other C:

control Cc,
format Cf,
not assigned Cn,
private use Co,
surrogate Cs

let str = "A ბ ㄱ";

alert( str.match(/\p{L}/gu) ); // A,ბ,ㄱ
alert( str.match(/\p{L}/g) ); // null (no matches, \p doesn't work without the flag "u")

Tham khảo:

Liệt kê tất cả các thuộc tính theo một ký tự: https://unicode.org/cldr/utility/character.jsp .

Ví dụ:

//ví dụ thập lục phân  xAF
let regexp = /x\p{Hex_Digit}\p{Hex_Digit}/u;

alert("number: xAF".match(regexp)); // xAF

// chữ tượng hình trung quốc

let regexp = /\p{sc=Han}/gu; // returns Chinese hieroglyphs

let str = `Hello Привет 你好 123_456`;

alert( str.match(regexp) ); // 你,好

// tiền tệ

let regexp = /\p{Sc}\d/gu;

let  str = `Prices: $2, €1, ¥9`;

alert( str.match(regexp) ); // $2,€1,¥9