Fast UTF-8 validation in a single C header, conforming to the Unicode and ISO/IEC 10646:2011 specification.
#include "utf8_valid.h"
// Check if a string is valid UTF-8
bool ok = utf8_valid(src, len);
// Check and find the error position
size_t cursor;
if (!utf8_check(src, len, &cursor)) {
// cursor is the length of the maximal valid prefix
}
// Length of the maximal subpart of an ill-formed sequence
size_t n = utf8_maximal_subpart(src, len);bool utf8_valid(const char *src, size_t len);
bool utf8_valid_ascii(const char *src, size_t len);
bool utf8_check(const char *src, size_t len, size_t *cursor);
bool utf8_check_ascii(const char *src, size_t len, size_t *cursor);
size_t utf8_maximal_subpart(const char *src, size_t len);utf8_valid returns true if src[0..len) is valid UTF-8.
utf8_check returns true if valid. On failure, if cursor is
non-NULL, sets *cursor to the length of the maximal valid prefix —
i.e. the byte offset of the first invalid sequence.
utf8_maximal_subpart returns the length of the maximal subpart of
the ill-formed sequence starting at src, as defined by Unicode 6.3
Table 3-8. The return value is always at least 1. This is intended to
be called after utf8_check reports failure, pointing src at the
cursor position.
utf8_valid_ascii and utf8_check_ascii are variants with an
ASCII fast path that skips the DFA for 16-byte chunks containing only
ASCII bytes. Throughput relative to utf8_valid depends on content mix
and microarchitecture, see the Performance section for measured results.
Otherwise identical in behaviour to utf8_valid and utf8_check.
typedef struct { uint32_t state; } utf8_stream_t;
void utf8_stream_init(utf8_stream_t *s);
size_t utf8_stream_check(utf8_stream_t* s,
const char* src,
size_t len,
bool eof,
size_t* cursor);For validating input that arrives in chunks.
utf8_stream_init initializes the stream state. Must be called before the first utf8_stream_check.
utf8_stream_check validates the next chunk and returns the number of bytes forming complete, valid sequences. Any bytes beyond the returned count represent an incomplete sequence crossing the chunk boundary and must be prepended to the next chunk by the caller.
If eof is true and the stream does not end on a sequence boundary, the trailing bytes are treated as ill-formed.
On error, returns (size_t)-1 and sets *cursor (if non-NULL) to the byte offset of the invalid or truncated sequence within src. The stream is automatically reset to the initial state so the caller can resume from the next byte without reinitializing.
- C99 or later
utf8_valid.h implements a shift-based DFA originally described by
Per Vognsen,
with 32-bit row packing derived by
Dougall Johnson
using an SMT solver (see tool/smt_solver.py).
The DFA has 9 states. Each state is assigned a bit offset within a 32-bit integer. Each byte in the input maps to a row in a 256-entry lookup table; the row encodes all 9 state transitions packed into a single uint32_t. The inner loop is a single table lookup and shift:
state = (dfa[byte] >> state) & 31;The error state is fixed at bit offset 0. Since error transitions shift a zero value into the row, they contribute nothing to the row value. The SMT solver finds offsets for the remaining 8 states that fit without collision inside 32 bits.
The fast path scans 16 bytes at a time, skipping the DFA entirely for pure ASCII chunks.
utf8_valid64.h provides the same API using a 64-bit table. The error state is fixed at bit offset 0 and contributes nothing to row values. The remaining 8 state offsets are fixed multiples of 6 (6, 12, 18, ..., 48), using 54 bits of the 64-bit row with 10 bits of headroom for future state additions. No solver is needed. This variant is provided for reference and further development.
Documents are retrieved from Wikipedia and converted to plain text, available in benchmark/corpus/.
| File | Size | Code points | Distribution | Best utf8_valid |
Best utf8_valid_ascii |
|---|---|---|---|---|---|
| ar.txt | 25 KB | 14K | 19% ASCII, 81% 2-byte | 4110 MB/s | 5221 MB/s |
| el.txt | 102 KB | 59K | 23% ASCII, 77% 2-byte | 4108 MB/s | 5009 MB/s |
| en.txt | 80 KB | 82K | 99.9% ASCII | 4108 MB/s | 33142 MB/s |
| ja.txt | 176 KB | 65K | 11% ASCII, 89% 3-byte | 4111 MB/s | 5221 MB/s |
| lv.txt | 135 KB | 127K | 92% ASCII, 7% 2-byte | 4108 MB/s | 6164 MB/s |
| ru.txt | 148 KB | 85K | 23% ASCII, 77% 2-byte | 4110 MB/s | 4896 MB/s |
| sv.txt | 94 KB | 93K | 96% ASCII, 4% 2-byte | 4108 MB/s | 8776 MB/s |
Best numbers from Clang 20 -O3 -march=x86-64-v3 (Raptor Lake).
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
2379 MB/s | 2501 MB/s | ascii fast path competitive on multibyte |
-O2 -march=x86-64-v3 |
4101 MB/s | 3977 MB/s | BMI2 SHRX eliminates CL constraint |
-O3 -march=x86-64-v3 |
4105 MB/s | 5104 MB/s | fast path profitable on all content types |
-O3 -march=native |
4101 MB/s | 4953 MB/s | similar to x86-64-v3 |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches 29–33 GB/s at -O3.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
2178 MB/s | 1628 MB/s | ascii fast path hurts on multibyte |
-O2 -march=x86-64-v3 |
2175 MB/s | 1667 MB/s | GCC not generating SHRX at -O2 |
-O3 -march=x86-64-v3 |
3679 MB/s | 3480 MB/s | O3 needed for SHRX |
-O3 -march=native |
4101 MB/s | 3950 MB/s | native closes gap with Clang |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches 29–33 GB/s at -O3.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
2094 MB/s | 1670 MB/s | ascii fast path hurts on multibyte |
-O2 -march=x86-64-v3 |
3454 MB/s | 2569 MB/s | BMI2 SHRX eliminates CL constraint |
-O3 -march=x86-64-v3 |
3452 MB/s | 3162 MB/s | gap narrows at O3 |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches 14–19 GB/s at all optimization levels.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
2592 MB/s | 2684 MB/s | essentially equal on multibyte |
-O2 -mtune=apple-m1 |
2779 MB/s | 2764 MB/s | essentially equal on multibyte |
-O3 -mtune=apple-m1 |
2787 MB/s | 4970 MB/s | O3 unlocks NEON fast path |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches ~21 GB/s at -O3.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
1483 MB/s | 1096 MB/s | GCC significantly lags Clang |
-O2 -mtune=apple-m1 |
1485 MB/s | 1094 MB/s | mtune no effect |
-O3 -mtune=apple-m1 |
2771 MB/s | 2785 MB/s | O3 closes gap with Clang O2 |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches ~36 GB/s at -O3, though GCC's non-ASCII throughput lags Clang throughout.
utf8_validis the most consistent performer across content mixes and compilers in these measurements. Across all tested platforms,utf8_validprocesses approximately one byte per clock cycle at peak throughput, consistent with the unrolled DFA loop executing one table lookup and one shift per byte, pipelined by the out-of-order engine across the 16-byte unroll.- On x86,
-march=x86-64-v3or-march=nativeenables BMI2SHRX, which removes the variable-shift dependency onCLand roughly doubles throughput forutf8_valid. GCC requires-O3to take advantage of this; Clang does so at-O2. utf8_valid_asciiprofitability on multibyte content is microarchitecture dependent. On Haswell it is slower thanutf8_validon multibyte-heavy input at all tested flags. On Raptor Lake with Clang-O3 -march=x86-64-v3it is faster on all content types, reflecting the wider OOO window and stronger branch predictor on the newer microarchitecture.- On Apple M1 with Clang,
utf8_valid_asciiis essentially equal toutf8_validon multibyte content at-O2, and significantly faster at-O3where the NEON fast path vectorizes aggressively. No penalty is observed. - On Apple M1 with GCC 15, both implementations lag Clang at
-O2; at-O3,utf8_validreaches near-Clang throughput, whileutf8_valid_asciiremains much stronger on near-pure ASCII input.
The benchmark is in benchmark/bench.c. Compile and run with:
make bench
make bench BENCH_OPTFLAGS="-O3 -march=x86-64-v3"
./bench # benchmarks benchmark/corpus/
./bench -d <directory> # benchmarks all .txt files in directory
./bench -f <file> # benchmarks single file
./bench -s <MB> # resize input to <MB> before benchmarkingBenchmark mode (mutually exclusive):
| Flag | Description |
|---|---|
-t <secs> |
run each implementation for <secs> seconds (default: 20) |
-n <reps> |
run each implementation for <reps> repetitions |
-b <MB> |
run each implementation until <MB> total data processed |
Warmup (-w <n>): before timing, each implementation is run for n
iterations to warm up caches and branch predictors. By default the warmup
count is derived from input size — targeting approximately 256 MB of warmup
data, capped between 1 and 100 iterations. Use -w 0 to disable warmup
entirely.
Output format: For each file, the header line shows the filename,
byte size, code point count, and average bytes per code point
(units/point). The code point distribution breaks down the input by
Unicode range. Results are sorted slowest to fastest; the percentage
shows improvement over the slowest implementation.
sv.txt: 94 KB; 93K code points; 1.04 units/point
U+0000..U+007F 90K 96.4%
U+0080..U+07FF 3K 3.5%
U+0800..U+FFFF 171 0.2%
hoehrmann 422 MB/s
utf8_valid_old 1746 MB/s (+314%)
utf8_valid 2778 MB/s (+559%)
utf8_valid_ascii 10012 MB/s (+2274%)
units/point is a rough content-mix indicator: 1.00 is near-pure ASCII,
~1.7–1.9 is common for 2-byte-heavy text , and ~2.7–3.0 for CJK-heavy text.
The benchmark includes two reference implementations: hoehrmann a widely used
table-driven DFA implementation and utf8_valid_old (previous scalar decoder)
to track regression and quantify gains from the current DFA approach.
BSD 2-Clause License. Copyright (c) 2017–2026 Christian Hansen.
- Flexible and Economical UTF-8 Decoder by Björn Höhrmann
- Branchless UTF-8 Decoder by Chris Wellons
- simdutf Unicode validation and transcoding at SIMD speed