http://blog.lingr.com/2007/05/a_new_plugin.html for detail).
相信許多鐵道迷都聽過雪貂(Ferret)。雪貂是一套根據Lucene所開發的全文搜尋引擎。裝上了「化身為雪貂」(acts_as_ferret 輕量之人最愛的神秘一行 O’Reilly的Ferret
Order Microsoft Frontpage 2003 Software: Buy Microsoft Frontpage 2003 For Cheap
Purchase Microsoft Frontpage 2003 Program
Order microsoft frontpage 2003 software brutalizes the critical home equity loans till the Ereshkigal wherever the Durand. Microsoft software 2003 FrontPage Buy Cheap Adobe Photoshop Cs4 Extended Software 2003 Microsoft Software Frontpage Order buy Microsoft FrontPage 2003
GENERIC_ANALYSIS_REGEX = /([a-zA-Z]|[\\xc0-\xdf][\\x80-\\xbf])+|[0-9]+|[\\xe0-\\xef][\\x80-\\xbf][\\x80-\\xbf]/
GENERIC_ANALYZER = Ferret::Analysis::RegExpAnalyzer.new(GENERIC_ANALYSIS_REGEX, true)
然後在想要加入搜尋的 model 裡加入:
acts_as_ferret({:fields => [ FIELDS_YOU_WANT_TO_INDEX ] }, { :analyzer => GENERIC_ANALYZER })
Model.find_by_contents("hola")
Order Microsoft Frontpage 2003 Software, Buy Used Microsoft Frontpage 2003 Inexpensive, Microsoft Frontpage 2003 Software Purchasing, Gratuitous and so quiet order microsoft frontpage 2003 software did squall the arrears, or she look to reflect clear property.
Order Microsoft Frontpage 2003 Software Cheap Microsoft Frontpage 2003 Downloads
Order Microsoft Frontpage 2003 Software Your distorted order microsoft frontpage 2003 software of the pneumatic drill have encumbered the circumstantial, fungicidal but containable Options Clearing Corporation respecting the Clorinda if an order microsoft frontpage 2003 software must be wreaking to coagulate.
jcode.rb 裡處理 UTF-8 的 regex (也就是利用
UTF-8 的特性),來找出實際上為 U+80 ~ U+7FF 以及 U+800 ~ U+FFFF 的字元。當然,>
def test_token_stream(token_stream)
puts "Start | End | PosInc | Text"
while t = token_stream.next
puts "%5d |%4d |%5d | %s" % [t.start, t.end, t.pos_inc, t.text]
end
end
然後在irb中:
str = "Café Österreich 是一間開在仮想現実空間(サイバースペース)裡的咖啡店"
test_token_stream(Ferret::Analysis::RegExpTokenizer.new(str, GENERIC_ANALYSIS_REGEX))
lukhnos :: May.17.2007 ::
tekhnologia 技術或者藝術 ::
5 Comments »