#unicode #grapheme #unicode-text #word #boundary #character #text

unic-segment

UNIC — Unicode Text Segmentation Algorithms

3 releases (breaking)

0.9.0 Mar 3, 2019
0.8.0 Jan 2, 2019
0.7.0 Feb 7, 2018

#334 in Internationalization (i18n)

Download history 76408/week @ 2024-01-29 79081/week @ 2024-02-05 81196/week @ 2024-02-12 74427/week @ 2024-02-19 83337/week @ 2024-02-26 81809/week @ 2024-03-04 76464/week @ 2024-03-11 81015/week @ 2024-03-18 87525/week @ 2024-03-25 88417/week @ 2024-04-01 76676/week @ 2024-04-08 79081/week @ 2024-04-15 75859/week @ 2024-04-22 82640/week @ 2024-04-29 78791/week @ 2024-05-06 83969/week @ 2024-05-13

326,239 downloads per month
Used in 720 crates (8 directly)

MIT/Apache

110KB
1.5K SLoC

UNIC — Unicode Text Segmentation Algorithms

Crates.io Documentation

This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences.

Notes

Initial code for this component is based on unicode-segmentation.


lib.rs:

UNIC — Unicode Text Segmentation Algorithms

A component of unic: Unicode and Internationalization Crates for Rust.

This UNIC component implements algorithms from Unicode® Standard Annex #29 - Unicode Text Segmentation, used for detecting boundaries of text element boundaries, such as user-perceived characters (a.k.a. Grapheme Clusters), Words, and Sentences (last one not implemented yet).

Examples

assert_eq!(
    Graphemes::new("a\u{310}e\u{301}o\u{308}\u{332}").collect::<Vec<&str>>(),
    &["a\u{310}", "e\u{301}", "o\u{308}\u{332}"]
);

assert_eq!(
    Graphemes::new("a\r\nb🇺🇳🇮🇨").collect::<Vec<&str>>(),
    &["a", "\r\n", "b", "🇺🇳", "🇮🇨"]
);

assert_eq!(
    GraphemeIndices::new("a̐éö̲\r\n").collect::<Vec<(usize, &str)>>(),
    &[(0, ""), (3, ""), (6, "ö̲"), (11, "\r\n")]
);

fn has_alphanumeric(s: &&str) -> bool {
    s.chars().any(|ch| ch.is_alphanumeric())
}

assert_eq!(
    Words::new(
        "The quick (\"brown\") fox can't jump 32.3 feet, right?",
        has_alphanumeric,
    ).collect::<Vec<&str>>(),
    &["The", "quick", "brown", "fox", "can't", "jump", "32.3", "feet", "right"]
);

assert_eq!(
    WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
    &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);

assert_eq!(
    WordBoundIndices::new("Brr, it's 29.3°F!").collect::<Vec<(usize, &str)>>(),
    &[
        (0, "Brr"),
        (3, ","),
        (4, " "),
        (5, "it's"),
        (9, " "),
        (10, "29.3"),
        (14, "°"),
        (16, "F"),
        (17, "!")
    ]
);

Dependencies