Category Archives: Java


Updating Parts of Documents (Solr 索引文档局部更新实务)

Once you have indexed the content you need in your Solr index, you will want to start thinking about your strategy for dealing with changes to those documents. Solr supports two approaches to updating documents that have only partially changed.

The first is atomic updates. This approach allows changing only one or more fields of a document without having to re-index the entire document.

The second approach is known as optimistic concurrency or optimistic locking. It is a feature of many NoSQL databases, and allows conditional updating a document based on it’s version. This approach includes semantics and rules for how to deal with version matches or mis-matches.

Atomic Updates and Optimistic Concurrency may be used as independent strategies for managing changes to documents, or they may be combined: you can use optimistic concurrency to conditionally apply an atomic update.

Atomic Updates

Solr supports several modifiers that atomically update values of a document. This allows updating only specific fields, which can help speed indexing processes in an environment where speed of index additions is critical to the application.

To use atomic updates, add a modifier to the field that needs to be updated. The content can be updated, added to, or incrementally increased if a number.



set Set or replace the field value(s) with the specified value(s), or remove the values if ‘null’ or empty list is specified as the new value.

May be specified as a single value, or as a list for multivalued fields

add Adds the specified values to a multivalued field.

May be specified as a single value, or as a list.

remove Removes (all occurrences of) the specified values from a multivalued field.

May be specified as a single value, or as a list.

removeregex Removes all occurrences of the specified regex from a multiValued field.

May be specified as a single value, or as a list.

inc Increments a numeric value by a specific amount.

Must be specified as a single numeric value.


All original source fields must be stored for field modifiers to work correctly, which is the Solr default.

For example, if the following document exists in our collection:


And we apply the following update command:


The resulting document in our collection will be:


Optimistic Concurrency

Optimistic Concurrency is a feature of Solr that can be used by client applications which update/replace documents to ensure that the document they are replacing/updating has not been concurrently modified by another client application. This feature works by requiring a _version_ field on all documents in the index, and comparing that to a _version_ specified as part of the update command. By default, Solr’s schema.xml includes a _version_ field, and this field is automatically added to each new document.

In general, using optimistic concurrency involves the following work flow:

  1. A client reads a document. In Solr, one might retrieve the document with the /get handler to be sure to have the latest version.
  2. A client changes the document locally.
  3. The client resubmits the changed document to Solr, for example, perhaps with the /update handler.
  4. If there is a version conflict (HTTP error code 409), the client starts the process over.

When the client resubmits a changed document to Solr, the _version_ can be included with the update to invoke optimistic concurrency control. Specific semantics are used to define when the document should be updated or when to report a conflict.

  • If the content in the _version_ field is greater than ‘1’ (i.e., ‘12345’), then the _version_ in the document must match the _version_ in the index.
  • If the content in the _version_ field is equal to ‘1’, then the document must simply exist. In this case, no version matching occurs, but if the document does not exist, the updates will be rejected.
  • If the content in the _version_ field is less than ‘0’ (i.e., ‘-1’), then the document must not exist. In this case, no version matching occurs, but if the document exists, the updates will be rejected.
  • If the content in the _version_ field is equal to ‘0’, then it doesn’t matter if the versions match or if the document exists or not. If it exists, it will be overwritten; if it does not exist, it will be added.

If the document being updated does not include the _version_ field, and atomic updates are not being used, the document will be treated by normal Solr rules, which is usually to discard it

For more information, please also see Yonik Seeley’s presentation on NoSQL features in Solr 4 from Apache Lucene EuroCon 2012.

Power Tip


The _version_ field is by default stored in the inverted index (indexed="true"). However, for some systems with a very large number of documents, the increase in FieldCache memory requirements may be too costly. A solution can be to declare the _version_ field as DocValues:

Sample field definition
<field name="_version_" type="long" indexed="false" stored="true" required="true" docValues="true"/>


Document Centric Versioning Constraints

Optimistic Concurrency is extremely powerful, and works very efficiently because it uses an internally assigned, globally unique values for the _version_ field. However, In some situations users may want to configure their own document specific version field, where the version values are assigned on a per-document basis by an external system, and have Solr reject updates that attempt to replace a document with an “older” version. In situations like this the DocBasedVersionConstraintsProcessorFactory can be useful.

The basic usage of DocBasedVersionConstraintsProcessorFactory is to configure it in solrconfig.xml as part of the UpdateRequestProcessorChain and specify the name of the versionField in your schema that should be checked when validating updates:

<processor class="solr.DocBasedVersionConstraintsProcessorFactory">
  <str name="versionField">my_version_l</str>

Once configured, this update processor will reject (HTTP error code 409) any attempt to update an existing document where the value of the my_version_l field in the “new” document is not greater then the value of that field in the existing document.

DocBasedVersionConstraintsProcessorFactory supports two additional configuration params which are optional:

  • ignoreOldUpdates – A boolean option which defaults to false. If set to true then instead of rejecting updates where the versionField is too low, the update will be silently ignored (and return a status 200 to the client).
  • deleteVersionParam – A String parameter that can be specified to indicate that this processor should also inspect Delete By Id commands. The value of this configuration option should be the name of a request parameter that the processor will now consider mandatory for all attempts to Delete By Id, and must be be used by clients to specify a value for the versionField which is greater then the existing value of the document to be deleted. When using this request param, any Delete By Id command with a high enough document version number to succeed will be internally converted into an Add Document command that replaces the existing document with a new one which is empty except for the Unique Key and versionField to keeping a record of the deleted version so future Add Document commands will fail if their “new” version is not high enough.

Please consult the processor javadocs and test configs for additional information and example usages.

分享我在新浪网的技术讲座PPT 面向对象与设计模式



采用成熟稳定的技术,对于TOP 10的公司,仍然是最佳选择。对于新的技术,我们要理性,理智。
Twitter在其7.9一篇官方技术博客Cassandra at Twitter Today提到暂停使用Cassandra来代替MySQL存储feed的计划,这是Twitter一个重要的架构策略调整,因为之前Twitter一直是业界Cassandra方向的领头羊。

For now, we’re not working on using Cassandra as a store for Tweets. This is a change in strategy. Instead we’re going to continue to maintain our existing Mysql-based storage. We believe that this isn’t the time to make large scale migration to a new technology. We will focus our Cassandra work on new projects that we wouldn’t be able to ship without a large-scale data store.


1. Cassandra仍然缺少大并发海量数据访问的案例及经验,Cassandra来源自Facebook,但是在Facebook内部Cassandra目前只用在inbox search产品上,容量大约有100-200T。且Inbox Search在Facebook的基础架构中也并非核心应用。并且还传出不少rumors说facebook已经放弃Cassandra。

2. 新产品需要一定稳定期,Cassandra代码或许还存在不少问题,但是Twitter如果投入大量的精力来改进Cassandra和比较优化MySQL的投入来看有点得不偿失。在QCon Beijing上@nk也提到Cassandra在Twitter的内部测试中曾经暴露出不少严重的问题。


此问题曾经在QCon Beijing 2010做过介绍,在去年的第一期广州技术沙龙也有过交流,类似Twitter这样的网站使用Cassandra的主要原因有
1. 数据增长规模需要不断增加新服务器,传统的切分方案在面临增删硬件时候需要手工维护,当数据规模速度增快,业务又不运行停机维护,手工维护的成本增加造成系统运维不堪重负。
2. 不能简单增加服务器解决请求量增长的问题,需要数据架构师精细的规划。
3. 每一个新的特性都需要重复评估数据拆分及访问优化的问题,架构师需要投入大量精力review几乎相同的业务场景。


究竟Twitter这次策略改变是NoSQL运动的一次挫折还是前进中的一段小插曲?我们拭目以待。目前另外一大Web 2.0巨头Digg仍然在使用Cassandra。


    1. 堆大小设置
      JVM 中最大堆大小有三方面限制:相关操作系统的数据模型(32-bt还是64-bit)限制;系统的可用虚拟内存限制;系统的可用物理内存限制。32位系统下,一般限制在1.5G~2G;64为操作系统对内存无限制。我在Windows Server 2003 系统,3.5G物理内存,JDK5.0下测试,最大可设置为1478m。
    2. java -Xmx3550m -Xms3550m -Xmn2g -Xss128k
      Xmx3550m :设置JVM最大可用内存为3550M。
      :设置年轻代大小为2G。整个堆大小=年轻代大小 + 年老代大小 + 持久代大小 。持久代一般固定大小为64m,所以增大年轻代后,将会减小年老代大小。此值对系统性能影响较大,Sun官方推荐配置为整个堆的3/8。
    • java -Xmx3550m -Xms3550m -Xss128k -XX:NewRatio=4 -XX:SurvivorRatio=4 -XX:MaxPermSize=16m -XX:MaxTenuringThreshold=0
      -XX:MaxPermSize=16m :设置持久代大小为16m。
      -XX:MaxTenuringThreshold=0 :设置垃圾最大年龄。如果设置为0的话,则年轻代对象不经过Survivor区,直接进入年老代 。对于年老代比较多的应用,可以提高效率。如果将此值设置为一个较大值,则年轻代对象会在Survivor区进行多次复制,这样可以增加对象再年轻代的存活时间 ,增加在年轻代即被回收的概论。
    1. 回收器选择
      JVM给了三种选择:串行收集器、并行收集器、并发收集器 ,但是串行收集器只适用于小数据量的情况,所以这里的选择主要针对并行收集器和并发收集器。默认情况下,JDK5.0以前都是使用串行收集器,如果想使用其他收集器需要在启动时加入相应参数。JDK5.0以后,JVM会根据当前系统配置 进行判断。
    2. 吞吐量优先 的并行收集器

      • java -Xmx3800m -Xms3800m -Xmn2g -Xss128k -XX:+UseParallelGC -XX:ParallelGCThreads=20
        :选择垃圾收集器为并行收集器。 此配置仅对年轻代有效。即上述配置下,年轻代使用并发收集,而年老代仍旧使用串行收集。
        -XX:ParallelGCThreads=20 :配置并行收集器的线程数,即:同时多少个线程一起进行垃圾回收。此值最好配置与处理器数目相等。
      • java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseParallelGC -XX:ParallelGCThreads=20 -XX:+UseParallelOldGC
        -XX:+UseParallelOldGC :配置年老代垃圾收集方式为并行收集。JDK6.0支持对年老代并行收集。
      • java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseParallelGC -XX:MaxGCPauseMillis=100
        -XX:MaxGCPauseMillis=100 : 设置每次年轻代垃圾回收的最长时间,如果无法满足此时间,JVM会自动调整年轻代大小,以满足此值。
      • java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseParallelGC  -XX:MaxGCPauseMillis=100 -XX:+UseAdaptiveSizePolicy
    3. 响应时间优先 的并发收集器

      • java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
        -XX:+UseConcMarkSweepGC :设置年老代为并发收集。测试中配置这个以后,-XX:NewRatio=4的配置失效了,原因不明。所以,此时年轻代大小最好用-Xmn设置。
        -XX:+UseParNewGC :设置年轻代为并行收集。可与CMS收集同时使用。JDK5.0以上,JVM会根据系统配置自行设置,所以无需再设置此值。
      • java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseConcMarkSweepGC-XX:CMSFullGCsBeforeCompaction=5 -XX:+UseCMSCompactAtFullCollection
        -XX:CMSFullGCsBeforeCompaction :由于并发收集器不对内存空间进行压缩、整理,所以运行一段时间以后会产生“碎片”,使得运行效率降低。此值设置运行多少次GC以后对内存空间进行压缩、整理。
        -XX:+UseCMSCompactAtFullCollection :打开对年老代的压缩。可能会影响性能,但是可以消除碎片
    1. 辅助信息
    2. -XX:+PrintGC
      输出形式:[GC 118250K->113543K(130112K), 0.0094143 secs] [Full GC 121376K->10414K(130112K), 0.0650971 secs]
    • -XX:+PrintGCDetails
      输出形式:[GC [DefNew: 8614K->781K(9088K), 0.0123035 secs] 118250K->113543K(130112K), 0.0124633 secs] [GC [DefNew: 8614K->8614K(9088K), 0.0000665 secs][Tenured: 112761K->10414K(121024K), 0.0433488 secs] 121376K->10414K(130112K), 0.0436268 secs]
    • -XX:+PrintGCTimeStamps -XX:+PrintGC:PrintGCTimeStamps可与上面两个混合使用
      输出形式:11.851: [GC 98328K->93620K(130112K), 0.0082960 secs]
    • -XX:+PrintGCApplicationConcurrentTime: 打印每次垃圾回收前,程序未中断的执行时间。可与上面混合使用
      输出形式:Application time: 0.5291524 seconds
    • -XX:+PrintGCApplicationStoppedTime :打印垃圾回收期间程序暂停的时间。可与上面混合使用
      输出形式:Total time for which application threads were stopped: 0.0468229 seconds
    • -XX:PrintHeapAtGC :打印GC前后的详细堆栈信息
      34.702: [GC {Heap before gc invocations=7:
      def new generation   total 55296K, used 52568K [0x1ebd0000, 0x227d0000, 0x227d0000)
      eden space 49152K,  99% used [0x1ebd0000, 0x21bce430, 0x21bd0000)
      from space 6144K,  55% used [0x221d0000, 0x22527e10, 0x227d0000)
      to   space 6144K,   0% used [0x21bd0000, 0x21bd0000, 0x221d0000)
      tenured generation   total 69632K, used 2696K [0x227d0000, 0x26bd0000, 0x26bd0000)
      the space 69632K,   3% used [0x227d0000, 0x22a720f8, 0x22a72200, 0x26bd0000)
      compacting perm gen  total 8192K, used 2898K [0x26bd0000, 0x273d0000, 0x2abd0000)
      the space 8192K,  35% used [0x26bd0000, 0x26ea4ba8, 0x26ea4c00, 0x273d0000)
      ro space 8192K,  66% used [0x2abd0000, 0x2b12bcc0, 0x2b12be00, 0x2b3d0000)
      rw space 12288K,  46% used [0x2b3d0000, 0x2b972060, 0x2b972200, 0x2bfd0000)
      34.735: [DefNew: 52568K->3433K(55296K), 0.0072126 secs] 55264K->6615K(124928K)Heap after gc invocations=8:
      def new generation   total 55296K, used 3433K [0x1ebd0000, 0x227d0000, 0x227d0000)
      eden space 49152K,   0% used [0x1ebd0000, 0x1ebd0000, 0x21bd0000)
      from space 6144K,  55% used [0x21bd0000, 0x21f2a5e8, 0x221d0000)
      to   space 6144K,   0% used [0x221d0000, 0x221d0000, 0x227d0000)
      tenured generation   total 69632K, used 3182K [0x227d0000, 0x26bd0000, 0x26bd0000)
      the space 69632K,   4% used [0x227d0000, 0x22aeb958, 0x22aeba00, 0x26bd0000)
      compacting perm gen  total 8192K, used 2898K [0x26bd0000, 0x273d0000, 0x2abd0000)
      the space 8192K,  35% used [0x26bd0000, 0x26ea4ba8, 0x26ea4c00, 0x273d0000)
      ro space 8192K,  66% used [0x2abd0000, 0x2b12bcc0, 0x2b12be00, 0x2b3d0000)
      rw space 12288K,  46% used [0x2b3d0000, 0x2b972060, 0x2b972200, 0x2bfd0000)
      , 0.0757599 secs]
    • -Xloggc:filename :与上面几个配合使用,把相关日志信息记录到文件以便分析。
    1. 常见配置汇总
    2. 堆设置
      • -Xms :初始堆大小
      • -Xmx :最大堆大小
      • -XX:NewSize=n :设置年轻代大小
      • -XX:NewRatio=n: 设置年轻代和年老代的比值。如:为3,表示年轻代与年老代比值为1:3,年轻代占整个年轻代年老代和的1/4
      • -XX:SurvivorRatio=n :年轻代中Eden区与两个Survivor区的比值。注意Survivor区有两个。如:3,表示Eden:Survivor=3:2,一个Survivor区占整个年轻代的1/5
      • -XX:MaxPermSize=n :设置持久代大小
    3. 收集器设置
      • -XX:+UseSerialGC :设置串行收集器
      • -XX:+UseParallelGC :设置并行收集器
      • -XX:+UseParalledlOldGC :设置并行年老代收集器
      • -XX:+UseConcMarkSweepGC :设置并发收集器
    4. 垃圾回收统计信息
      • -XX:+PrintGC
      • -XX:+PrintGCDetails
      • -XX:+PrintGCTimeStamps
      • -Xloggc:filename
    5. 并行收集器设置
      • -XX:ParallelGCThreads=n :设置并行收集器收集时使用的CPU数。并行收集线程数。
      • -XX:MaxGCPauseMillis=n :设置并行收集最大暂停时间
      • -XX:GCTimeRatio=n :设置垃圾回收时间占程序运行时间的百分比。公式为1/(1+n)
    6. 并发收集器设置
      • -XX:+CMSIncrementalMode :设置为增量模式。适用于单CPU情况。
      • -XX:ParallelGCThreads=n :设置并发收集器年轻代收集方式为并行收集时,使用的CPU数。并行收集线程数。

    Tomcat Vs Resin


    Resin 3.0.9   vs tomcat 5.0  ,jdk5.0,安装应用为 AppFuse

    Resin启动时间 25s ,Tomcat 启动时间14s

    执行 “ant test-canoo” 执行  Canoo’s WebTest (服务器已经启动): jsp编译后的文件已经清除

    • Resin: 初次执行 (未编译 JSPs) – 53 s, 第二次执行 (编译 JSPs) – 24 s.
    • Tomcat: 初次执行 (未编译 JSPs) – 49 s,  第二次执行 (编译 JSPs) – 14 s.