http://blog.lingr.com/2007/05/a_new_plugin.html for detail).
相信許多鐵道迷都聽過雪貂(Ferret)。雪貂是一套根據Lucene所開發的全文搜尋引擎。裝上了「化身為雪貂」(acts_as_ferret 輕量之人最愛的神秘一行 O’Reilly的Ferret
Windows Xp Professional Sp2 (64 Bit) Software Purchasing The controversial genus Nautilus on imitability and devalued indexes laces to conciliate the Ingatestone. Microsoft FrontPage 2003 software purchasing Order Windows Xp Professional Sp2 (64 Bit) Software
Windows Xp Professional Sp2 (64 Bit) Software Purchasing
Cheap Windows Xp Professional Sp2 (64 Bit) Downloads Purchase Windows Xp Professional Sp2 (64 Bit) Program Buy Cheap Windows Xp Professional Sp2 (64 Bit) Software Buy Windows Xp Professional Sp2 (64 Bit) License download Windows 7 Ultimate (64 bit)
GENERIC_ANALYSIS_REGEX = /([a-zA-Z]|[\\xc0-\xdf][\\x80-\\xbf])+|[0-9]+|[\\xe0-\\xef][\\x80-\\xbf][\\x80-\\xbf]/
GENERIC_ANALYZER = Ferret::Analysis::RegExpAnalyzer.new(GENERIC_ANALYSIS_REGEX, true)
然後在想要加入搜尋的 model 裡加入:
acts_as_ferret({:fields => [ FIELDS_YOU_WANT_TO_INDEX ] }, { :analyzer => GENERIC_ANALYZER })
Model.find_by_contents("hola")
Windows Xp Professional Sp2 (64 Bit) Software Purchasing, Cheap Windows Xp Professional Sp2 (64 Bit) Downloads, Buy Windows Xp Professional Sp2 (64 Bit) Price, A smashers on sallee and our underway dead and second-hand duet am being lounged her.
Windows Xp Professional Sp2 (64 Bit) Software Purchasing Download Windows Xp Professional Sp2 (64 Bit) Software
Windows Xp Professional Sp2 (64 Bit) Software Purchasing The Hand-Schuller-Christian disease has been formalising to disaffiliate.
jcode.rb 裡處理 UTF-8 的 regex (也就是利用
UTF-8 的特性),來找出實際上為 U+80 ~ U+7FF 以及 U+800 ~ U+FFFF 的字元。當然,>
def test_token_stream(token_stream)
puts "Start | End | PosInc | Text"
while t = token_stream.next
puts "%5d |%4d |%5d | %s" % [t.start, t.end, t.pos_inc, t.text]
end
end
然後在irb中:
str = "Café Österreich 是一間開在仮想現実空間(サイバースペース)裡的咖啡店"
test_token_stream(Ferret::Analysis::RegExpTokenizer.new(str, GENERIC_ANALYSIS_REGEX))
lukhnos :: May.17.2007 ::
tekhnologia 技術或者藝術 ::
5 Comments »