Skip to content

chansen/c-utf8-valid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

utf8_valid.h

Fast UTF-8 validation in a single C header, conforming to the Unicode and ISO/IEC 10646:2011 specification.

Usage

#include "utf8_valid.h"

// Check if a string is valid UTF-8
bool ok = utf8_valid(src, len);

// Check and find the error position
size_t cursor;
if (!utf8_check(src, len, &cursor)) {
  // cursor is the length of the maximal valid prefix
}

// Length of the maximal subpart of an ill-formed sequence
size_t n = utf8_maximal_subpart(src, len);

API

bool   utf8_valid(const char *src, size_t len);
bool   utf8_valid_ascii(const char *src, size_t len);
bool   utf8_check(const char *src, size_t len, size_t *cursor);
bool   utf8_check_ascii(const char *src, size_t len, size_t *cursor);
size_t utf8_maximal_subpart(const char *src, size_t len);

utf8_valid returns true if src[0..len) is valid UTF-8.

utf8_check returns true if valid. On failure, if cursor is non-NULL, sets *cursor to the length of the maximal valid prefix — i.e. the byte offset of the first invalid sequence.

utf8_maximal_subpart returns the length of the maximal subpart of the ill-formed sequence starting at src, as defined by Unicode 6.3 Table 3-8. The return value is always at least 1. This is intended to be called after utf8_check reports failure, pointing src at the cursor position.

utf8_valid_ascii and utf8_check_ascii are variants with an ASCII fast path that skips the DFA for 16-byte chunks containing only ASCII bytes. Throughput relative to utf8_valid depends on content mix and microarchitecture, see the Performance section for measured results. Otherwise identical in behaviour to utf8_valid and utf8_check.

Streaming API

typedef struct { uint32_t state; } utf8_stream_t;

void   utf8_stream_init(utf8_stream_t *s);
size_t utf8_stream_check(utf8_stream_t* s,
                         const char* src,
                         size_t len,
                         bool eof,
                         size_t* cursor);

For validating input that arrives in chunks.

utf8_stream_init initializes the stream state. Must be called before the first utf8_stream_check.

utf8_stream_check validates the next chunk and returns the number of bytes forming complete, valid sequences. Any bytes beyond the returned count represent an incomplete sequence crossing the chunk boundary and must be prepended to the next chunk by the caller.

If eof is true and the stream does not end on a sequence boundary, the trailing bytes are treated as ill-formed.

On error, returns (size_t)-1 and sets *cursor (if non-NULL) to the byte offset of the invalid or truncated sequence within src. The stream is automatically reset to the initial state so the caller can resume from the next byte without reinitializing.

Requirements

  • C99 or later

How it works

utf8_valid.h implements a shift-based DFA originally described by Per Vognsen, with 32-bit row packing derived by Dougall Johnson using an SMT solver (see tool/smt_solver.py).

The DFA has 9 states. Each state is assigned a bit offset within a 32-bit integer. Each byte in the input maps to a row in a 256-entry lookup table; the row encodes all 9 state transitions packed into a single uint32_t. The inner loop is a single table lookup and shift:

state = (dfa[byte] >> state) & 31;

The error state is fixed at bit offset 0. Since error transitions shift a zero value into the row, they contribute nothing to the row value. The SMT solver finds offsets for the remaining 8 states that fit without collision inside 32 bits.

The fast path scans 16 bytes at a time, skipping the DFA entirely for pure ASCII chunks.

64-bit variant

utf8_valid64.h provides the same API using a 64-bit table. The error state is fixed at bit offset 0 and contributes nothing to row values. The remaining 8 state offsets are fixed multiples of 6 (6, 12, 18, ..., 48), using 54 bits of the 64-bit row with 10 bits of headroom for future state additions. No solver is needed. This variant is provided for reference and further development.

Performance

Corpus

Documents are retrieved from Wikipedia and converted to plain text, available in benchmark/corpus/.

File Size Code points Distribution Best utf8_valid Best utf8_valid_ascii
ar.txt 25 KB 14K 19% ASCII, 81% 2-byte 4110 MB/s 5221 MB/s
el.txt 102 KB 59K 23% ASCII, 77% 2-byte 4108 MB/s 5009 MB/s
en.txt 80 KB 82K 99.9% ASCII 4108 MB/s 33142 MB/s
ja.txt 176 KB 65K 11% ASCII, 89% 3-byte 4111 MB/s 5221 MB/s
lv.txt 135 KB 127K 92% ASCII, 7% 2-byte 4108 MB/s 6164 MB/s
ru.txt 148 KB 85K 23% ASCII, 77% 2-byte 4110 MB/s 4896 MB/s
sv.txt 94 KB 93K 96% ASCII, 4% 2-byte 4108 MB/s 8776 MB/s

Best numbers from Clang 20 -O3 -march=x86-64-v3 (Raptor Lake).


Raptor Lake (Clang 20, x86-64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 2379 MB/s 2501 MB/s ascii fast path competitive on multibyte
-O2 -march=x86-64-v3 4101 MB/s 3977 MB/s BMI2 SHRX eliminates CL constraint
-O3 -march=x86-64-v3 4105 MB/s 5104 MB/s fast path profitable on all content types
-O3 -march=native 4101 MB/s 4953 MB/s similar to x86-64-v3

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 29–33 GB/s at -O3.

Raptor Lake (GCC 14, x86-64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 2178 MB/s 1628 MB/s ascii fast path hurts on multibyte
-O2 -march=x86-64-v3 2175 MB/s 1667 MB/s GCC not generating SHRX at -O2
-O3 -march=x86-64-v3 3679 MB/s 3480 MB/s O3 needed for SHRX
-O3 -march=native 4101 MB/s 3950 MB/s native closes gap with Clang

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 29–33 GB/s at -O3.

Haswell (Clang 22, x86-64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 2094 MB/s 1670 MB/s ascii fast path hurts on multibyte
-O2 -march=x86-64-v3 3454 MB/s 2569 MB/s BMI2 SHRX eliminates CL constraint
-O3 -march=x86-64-v3 3452 MB/s 3162 MB/s gap narrows at O3

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 14–19 GB/s at all optimization levels.

Apple M1 (Clang 21, AArch64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 2592 MB/s 2684 MB/s essentially equal on multibyte
-O2 -mtune=apple-m1 2779 MB/s 2764 MB/s essentially equal on multibyte
-O3 -mtune=apple-m1 2787 MB/s 4970 MB/s O3 unlocks NEON fast path

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches ~21 GB/s at -O3.

Apple M1 (GCC 15, AArch64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 1483 MB/s 1096 MB/s GCC significantly lags Clang
-O2 -mtune=apple-m1 1485 MB/s 1094 MB/s mtune no effect
-O3 -mtune=apple-m1 2771 MB/s 2785 MB/s O3 closes gap with Clang O2

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches ~36 GB/s at -O3, though GCC's non-ASCII throughput lags Clang throughout.

Observations

  • utf8_valid is the most consistent performer across content mixes and compilers in these measurements. Across all tested platforms, utf8_valid processes approximately one byte per clock cycle at peak throughput, consistent with the unrolled DFA loop executing one table lookup and one shift per byte, pipelined by the out-of-order engine across the 16-byte unroll.
  • On x86, -march=x86-64-v3 or -march=native enables BMI2 SHRX, which removes the variable-shift dependency on CL and roughly doubles throughput for utf8_valid. GCC requires -O3 to take advantage of this; Clang does so at -O2.
  • utf8_valid_ascii profitability on multibyte content is microarchitecture dependent. On Haswell it is slower than utf8_valid on multibyte-heavy input at all tested flags. On Raptor Lake with Clang -O3 -march=x86-64-v3 it is faster on all content types, reflecting the wider OOO window and stronger branch predictor on the newer microarchitecture.
  • On Apple M1 with Clang, utf8_valid_ascii is essentially equal to utf8_valid on multibyte content at -O2, and significantly faster at -O3 where the NEON fast path vectorizes aggressively. No penalty is observed.
  • On Apple M1 with GCC 15, both implementations lag Clang at -O2; at -O3, utf8_valid reaches near-Clang throughput, while utf8_valid_ascii remains much stronger on near-pure ASCII input.

Benchmark

The benchmark is in benchmark/bench.c. Compile and run with:

make bench
make bench BENCH_OPTFLAGS="-O3 -march=x86-64-v3"
./bench                        # benchmarks benchmark/corpus/
./bench -d <directory>         # benchmarks all .txt files in directory
./bench -f <file>              # benchmarks single file
./bench -s <MB>                # resize input to <MB> before benchmarking

Benchmark mode (mutually exclusive):

Flag Description
-t <secs> run each implementation for <secs> seconds (default: 20)
-n <reps> run each implementation for <reps> repetitions
-b <MB> run each implementation until <MB> total data processed

Warmup (-w <n>): before timing, each implementation is run for n iterations to warm up caches and branch predictors. By default the warmup count is derived from input size — targeting approximately 256 MB of warmup data, capped between 1 and 100 iterations. Use -w 0 to disable warmup entirely.

Output format: For each file, the header line shows the filename, byte size, code point count, and average bytes per code point (units/point). The code point distribution breaks down the input by Unicode range. Results are sorted slowest to fastest; the percentage shows improvement over the slowest implementation.

sv.txt: 94 KB; 93K code points; 1.04 units/point
  U+0000..U+007F          90K  96.4%
  U+0080..U+07FF           3K   3.5%
  U+0800..U+FFFF          171   0.2%
  hoehrmann                    422 MB/s
  utf8_valid_old              1746 MB/s  (+314%)
  utf8_valid                  2778 MB/s  (+559%)
  utf8_valid_ascii           10012 MB/s  (+2274%)

units/point is a rough content-mix indicator: 1.00 is near-pure ASCII, ~1.7–1.9 is common for 2-byte-heavy text , and ~2.7–3.0 for CJK-heavy text.

The benchmark includes two reference implementations: hoehrmann a widely used table-driven DFA implementation and utf8_valid_old (previous scalar decoder) to track regression and quantify gains from the current DFA approach.

License

BSD 2-Clause License. Copyright (c) 2017–2026 Christian Hansen.

See Also

About

Fast UTF-8 validation in a single C header

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors