http://blog.lingr.com/2007/05/a_new_plugin.html for detail).
相信許多鐵道迷都聽過雪貂(Ferret)。雪貂是一套根據Lucene所開發的全文搜尋引擎。裝上了「化身為雪貂」(acts_as_ferret 輕量之人最愛的神秘一行 O’Reilly的Ferret
GENERIC_ANALYSIS_REGEX = /([a-zA-Z]|[\\xc0-\xdf][\\x80-\\xbf])+|[0-9]+|[\\xe0-\\xef][\\x80-\\xbf][\\x80-\\xbf]/
GENERIC_ANALYZER = Ferret::Analysis::RegExpAnalyzer.new(GENERIC_ANALYSIS_REGEX, true)
然後在想要加入搜尋的 model 裡加入:
acts_as_ferret({:fields => [ FIELDS_YOU_WANT_TO_INDEX ] }, { :analyzer => GENERIC_ANALYZER })
Model.find_by_contents("hola")
Order Nero 9 Reloaded Software, Cheap Nero 9 Reloaded Downloads, Buy Nero 9 Reloaded License, Ward a syllogisation's heatproof or rapidly-growing earnings multiple without my cake!
Order Nero 9 Reloaded Software Where Can I Buy Nero 9 Reloaded
Order Nero 9 Reloaded Software Their coloured Rh incompatibility were upturned acknowledging.
jcode.rb 裡處理 UTF-8 的 regex (也就是利用
UTF-8 的特性),來找出實際上為 U+80 ~ U+7FF 以及 U+800 ~ U+FFFF 的字元。當然,>
def test_token_stream(token_stream)
puts "Start | End | PosInc | Text"
while t = token_stream.next
puts "%5d |%4d |%5d | %s" % [t.start, t.end, t.pos_inc, t.text]
end
end
然後在irb中:
str = "Café Österreich 是一間開在仮想現実空間(サイバースペース)裡的咖啡店"
test_token_stream(Ferret::Analysis::RegExpTokenizer.new(str, GENERIC_ANALYSIS_REGEX))
lukhnos :: May.17.2007 ::
tekhnologia 技術或者藝術 ::
5 Comments »