utf8_valid.h

Fast UTF-8 validation in a single C header, conforming to the Unicode and ISO/IEC 10646:2011 specification.

Usage

#include "utf8_valid.h"

// Check if a string is valid UTF-8
bool ok = utf8_valid(src, len);

// Check and find the error position
size_t cursor;
if (!utf8_check(src, len, &cursor)) {
  // cursor is the length of the maximal valid prefix
}

// Length of the maximal subpart of an ill-formed sequence
size_t n = utf8_maximal_subpart(src, len);

API

bool   utf8_valid(const char *src, size_t len);
bool   utf8_valid_ascii(const char *src, size_t len);
bool   utf8_check(const char *src, size_t len, size_t *cursor);
bool   utf8_check_ascii(const char *src, size_t len, size_t *cursor);
size_t utf8_maximal_subpart(const char *src, size_t len);

utf8_valid returns true if src[0..len) is valid UTF-8.

utf8_check returns true if valid. On failure, if cursor is non-NULL, sets *cursor to the length of the maximal valid prefix — i.e. the byte offset of the first invalid sequence.

utf8_maximal_subpart returns the length of the maximal subpart of the ill-formed sequence starting at src, as defined by Unicode 6.3 Table 3-8. The return value is always at least 1. This is intended to be called after utf8_check reports failure, pointing src at the cursor position.

utf8_valid_ascii and utf8_check_ascii are variants with an ASCII fast path that skips the DFA for 16-byte chunks containing only ASCII bytes. Throughput relative to utf8_valid depends on content mix and microarchitecture, see the Performance section for measured results. Otherwise identical in behaviour to utf8_valid and utf8_check.

Streaming API

typedef struct { uint32_t state; } utf8_stream_t;

void   utf8_stream_init(utf8_stream_t *s);
size_t utf8_stream_check(utf8_stream_t* s,
                         const char* src,
                         size_t len,
                         bool eof,
                         size_t* cursor);

For validating input that arrives in chunks.

utf8_stream_init initializes the stream state. Must be called before the first utf8_stream_check.

utf8_stream_check validates the next chunk and returns the number of bytes forming complete, valid sequences. Any bytes beyond the returned count represent an incomplete sequence crossing the chunk boundary and must be prepended to the next chunk by the caller.

If eof is true and the stream does not end on a sequence boundary, the trailing bytes are treated as ill-formed.

On error, returns (size_t)-1 and sets *cursor (if non-NULL) to the byte offset of the invalid or truncated sequence within src. The stream is automatically reset to the initial state so the caller can resume from the next byte without reinitializing.

Requirements

C99 or later

How it works

utf8_valid.h implements a shift-based DFA originally described by Per Vognsen, with 32-bit row packing derived by Dougall Johnson using an SMT solver (see tool/smt_solver.py).

The DFA has 9 states. Each state is assigned a bit offset within a 32-bit integer. Each byte in the input maps to a row in a 256-entry lookup table; the row encodes all 9 state transitions packed into a single uint32_t. The inner loop is a single table lookup and shift:

state = (dfa[byte] >> state) & 31;

The error state is fixed at bit offset 0. Since error transitions shift a zero value into the row, they contribute nothing to the row value. The SMT solver finds offsets for the remaining 8 states that fit without collision inside 32 bits.

The fast path scans 16 bytes at a time, skipping the DFA entirely for pure ASCII chunks.

64-bit variant

utf8_valid64.h provides the same API using a 64-bit table. The error state is fixed at bit offset 0 and contributes nothing to row values. The remaining 8 state offsets are fixed multiples of 6 (6, 12, 18, ..., 48), using 54 bits of the 64-bit row with 10 bits of headroom for future state additions. No solver is needed. This variant is provided for reference and further development.

Performance

Corpus

Documents are retrieved from Wikipedia and converted to plain text, available in benchmark/corpus/.

File	Size	Code points	Distribution	Best `utf8_valid`	Best `utf8_valid_ascii`
ar.txt	25 KB	14K	19% ASCII, 81% 2-byte	4110 MB/s	5221 MB/s
el.txt	102 KB	59K	23% ASCII, 77% 2-byte	4108 MB/s	5009 MB/s
en.txt	80 KB	82K	99.9% ASCII	4108 MB/s	33142 MB/s
ja.txt	176 KB	65K	11% ASCII, 89% 3-byte	4111 MB/s	5221 MB/s
lv.txt	135 KB	127K	92% ASCII, 7% 2-byte	4108 MB/s	6164 MB/s
ru.txt	148 KB	85K	23% ASCII, 77% 2-byte	4110 MB/s	4896 MB/s
sv.txt	94 KB	93K	96% ASCII, 4% 2-byte	4108 MB/s	8776 MB/s

Best numbers from Clang 20 -O3 -march=x86-64-v3 (Raptor Lake).

Raptor Lake (Clang 20, x86-64)

Flags	`utf8_valid`	`utf8_valid_ascii`	Notes
`-O2`	2379 MB/s	2501 MB/s	ascii fast path competitive on multibyte
`-O2 -march=x86-64-v3`	4101 MB/s	3977 MB/s	BMI2 `SHRX` eliminates `CL` constraint
`-O3 -march=x86-64-v3`	4105 MB/s	5104 MB/s	fast path profitable on all content types
`-O3 -march=native`	4101 MB/s	4953 MB/s	similar to x86-64-v3

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 29–33 GB/s at -O3.

Raptor Lake (GCC 14, x86-64)

Flags	`utf8_valid`	`utf8_valid_ascii`	Notes
`-O2`	2178 MB/s	1628 MB/s	ascii fast path hurts on multibyte
`-O2 -march=x86-64-v3`	2175 MB/s	1667 MB/s	GCC not generating SHRX at `-O2`
`-O3 -march=x86-64-v3`	3679 MB/s	3480 MB/s	O3 needed for SHRX
`-O3 -march=native`	4101 MB/s	3950 MB/s	native closes gap with Clang

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 29–33 GB/s at -O3.

Haswell (Clang 22, x86-64)

Flags	`utf8_valid`	`utf8_valid_ascii`	Notes
`-O2`	2094 MB/s	1670 MB/s	ascii fast path hurts on multibyte
`-O2 -march=x86-64-v3`	3454 MB/s	2569 MB/s	BMI2 `SHRX` eliminates `CL` constraint
`-O3 -march=x86-64-v3`	3452 MB/s	3162 MB/s	gap narrows at O3

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 14–19 GB/s at all optimization levels.

Apple M1 (Clang 21, AArch64)

Flags	`utf8_valid`	`utf8_valid_ascii`	Notes
`-O2`	2592 MB/s	2684 MB/s	essentially equal on multibyte
`-O2 -mtune=apple-m1`	2779 MB/s	2764 MB/s	essentially equal on multibyte
`-O3 -mtune=apple-m1`	2787 MB/s	4970 MB/s	O3 unlocks NEON fast path

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches ~21 GB/s at -O3.

Apple M1 (GCC 15, AArch64)

Flags	`utf8_valid`	`utf8_valid_ascii`	Notes
`-O2`	1483 MB/s	1096 MB/s	GCC significantly lags Clang
`-O2 -mtune=apple-m1`	1485 MB/s	1094 MB/s	mtune no effect
`-O3 -mtune=apple-m1`	2771 MB/s	2785 MB/s	O3 closes gap with Clang O2

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches ~36 GB/s at -O3, though GCC's non-ASCII throughput lags Clang throughout.

Observations

utf8_valid is the most consistent performer across content mixes and compilers in these measurements. Across all tested platforms, utf8_valid processes approximately one byte per clock cycle at peak throughput, consistent with the unrolled DFA loop executing one table lookup and one shift per byte, pipelined by the out-of-order engine across the 16-byte unroll.
On x86, -march=x86-64-v3 or -march=native enables BMI2 SHRX, which removes the variable-shift dependency on CL and roughly doubles throughput for utf8_valid. GCC requires -O3 to take advantage of this; Clang does so at -O2.
utf8_valid_ascii profitability on multibyte content is microarchitecture dependent. On Haswell it is slower than utf8_valid on multibyte-heavy input at all tested flags. On Raptor Lake with Clang -O3 -march=x86-64-v3 it is faster on all content types, reflecting the wider OOO window and stronger branch predictor on the newer microarchitecture.
On Apple M1 with Clang, utf8_valid_ascii is essentially equal to utf8_valid on multibyte content at -O2, and significantly faster at -O3 where the NEON fast path vectorizes aggressively. No penalty is observed.
On Apple M1 with GCC 15, both implementations lag Clang at -O2; at -O3, utf8_valid reaches near-Clang throughput, while utf8_valid_ascii remains much stronger on near-pure ASCII input.

Benchmark

The benchmark is in benchmark/bench.c. Compile and run with:

make bench
make bench BENCH_OPTFLAGS="-O3 -march=x86-64-v3"
./bench                        # benchmarks benchmark/corpus/
./bench -d <directory>         # benchmarks all .txt files in directory
./bench -f <file>              # benchmarks single file
./bench -s <MB>                # resize input to <MB> before benchmarking

Benchmark mode (mutually exclusive):

Flag	Description
`-t <secs>`	run each implementation for `<secs>` seconds (default: 20)
`-n <reps>`	run each implementation for `<reps>` repetitions
`-b <MB>`	run each implementation until `<MB>` total data processed

Warmup (-w <n>): before timing, each implementation is run for n iterations to warm up caches and branch predictors. By default the warmup count is derived from input size — targeting approximately 256 MB of warmup data, capped between 1 and 100 iterations. Use -w 0 to disable warmup entirely.

Output format: For each file, the header line shows the filename, byte size, code point count, and average bytes per code point (units/point). The code point distribution breaks down the input by Unicode range. Results are sorted slowest to fastest; the percentage shows improvement over the slowest implementation.

sv.txt: 94 KB; 93K code points; 1.04 units/point
  U+0000..U+007F          90K  96.4%
  U+0080..U+07FF           3K   3.5%
  U+0800..U+FFFF          171   0.2%
  hoehrmann                    422 MB/s
  utf8_valid_old              1746 MB/s  (+314%)
  utf8_valid                  2778 MB/s  (+559%)
  utf8_valid_ascii           10012 MB/s  (+2274%)

units/point is a rough content-mix indicator: 1.00 is near-pure ASCII, ~1.7–1.9 is common for 2-byte-heavy text , and ~2.7–3.0 for CJK-heavy text.

The benchmark includes two reference implementations: hoehrmann a widely used table-driven DFA implementation and utf8_valid_old (previous scalar decoder) to track regression and quantify gains from the current DFA approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

utf8_valid.h

Usage

API

Streaming API

Requirements

How it works

64-bit variant

Performance

Corpus

Raptor Lake (Clang 20, x86-64)

Raptor Lake (GCC 14, x86-64)

Haswell (Clang 22, x86-64)

Apple M1 (Clang 21, AArch64)

Apple M1 (GCC 15, AArch64)

Observations

Benchmark

License

See Also

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
benchmark		benchmark
test		test
tool		tool
Makefile		Makefile
README.md		README.md
utf8_valid.h		utf8_valid.h
utf8_valid64.h		utf8_valid64.h

Folders and files

Latest commit

History

Repository files navigation

utf8_valid.h

Usage

API

Streaming API

Requirements

How it works

64-bit variant

Performance

Corpus

Raptor Lake (Clang 20, x86-64)

Raptor Lake (GCC 14, x86-64)

Haswell (Clang 22, x86-64)

Apple M1 (Clang 21, AArch64)

Apple M1 (GCC 15, AArch64)

Observations

Benchmark

License

See Also

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages