Author image Алексей Суриков
and 1 contributors

Changes for version 0.06

  • some changes to handle Unicode more or less properly: normalization, unicode classes in regular expressions
  • speed optimizations
  • synced algorithm with current PHP version
  • changed tests to use empirically found threshold
  • data update


download newer data for tokenizer


tokenizer for OpenCorpora project
download newer data for tokenizer
represents a file with vectors