Transcription

Testing Lucene and Solr with various JVMs:Bugs, Bugs, BugsUwe SchindlerApache Lucene Committer & PMC [email protected]://www.thetaphi.de, http://[email protected] DataSolutions GmbH, Wätjenstr. 49, 28213 Bremen, GermanyTel: 49 421 40889785-0, http://www.sd-datasolutions.de1

My Background Committer and PMC member of Apache Lucene and Solr - mainfocus is on development of Lucene Java. Implemented fast numerical search and maintaining the newattribute-based text analysis API. Well known as Generics andSophisticated Backwards Compatibility Policeman. Working as consultant and software architect for SDDataSolutions GmbH in Bremen, Germany. The main task ismaintaining PANGAEA (Publishing Network for Geoscientific &Environmental Data) where I implemented the portal's geo-spatialretrieval functions with Apache Lucene Core. Talks about Lucene at various international conferences like theprevious Berlin Buzzwords, ApacheCon EU/NA, Lucene Eurocon,Lucene Revolution, and various local meetups.

Agenda Some historyThe famous bugs How to debug hotspot problemsSetting up Jenkins to test your softwarewith lots of virtual machine vendors Bugs, Bugs, Bugs3

What happened?SOME HISTORY 4

Chronology Java 7 Release Candidate released July 6,2011 as build 147 (compiled and signed on June27, 2011 – also the release date of OpenJDK 7b147) Saturday, July 23, 2011:– downloaded it to do some testing with Lucene trunk,core tests ran fine on my Windows 7 x64 box– Installation of FreeBSD package on Apache’s Jenkins“Lucene” slave heavy testing started: variouscrashes/failures:5

Issues found Jenkins revealed SIGSEGV bug in Porterstemmer (found when number of iterations wereraised) [LUCENE-3335] New Lucene 3.4 facetting test sometimesproduced corrupt indexes [LUCENE-3346]6

WARNING !!! Also Java 6 was affected!(some time after the only stable version 1.6.0 18) Optimizations disabled by default, so:Don’t use -XX: AggressiveOptsif you want your loops behave correctly!7

Chronology Thursday, July 28, 2011:– Oracle released JDK 7 to public– Package was identical to release candidate (WindowsEXE signature dated June 27, 2011)8

Chronology Thursday, July 28, 2011:– Oracle released JDK 7 to public– Package was identical to release candidate (WindowsEXE signature dated June 27, 2011)8

Chronology Thursday, July 28, 2011:– Oracle released JDK 7 to public– Package was identical to release candidate (WindowsEXE signature dated June 27, 2011) Apache Lucene PMC decided to warnusers on web page [email protected] mailing list8

Chronology:Friday, July 29, 20119

Chronology:Friday, July 29, 20119

Chronology:Friday, July 29, 20119

Chronology:Friday, July 29, 20119

Chronology:Friday, July 29, 20119

Chronology:Friday, July 29, 20119

Further analysis the week after10

Further analysis the week after10

Further analysis the week after10

Further analysis the week after10

Further analysis the week after10

Further analysis the week after10

Further analysis the week after10

Further analysis the week after10

Java 7 Crashes Eclipse THE PORTER STEMMERSIGSEGV BUG11

What’s wrong with these methods?12

Conclusion: Porter Stemmer Bug Less serious bug as your virtual machinesimply crashes. You won’t use it! Oracle made bug report “serious”, as thisaffects their software, reproducible toeveryone. Can be prevented by JVM option:-XX:-UseLoopPredicate13

Loop UnwindingTHE VINT BUG14

What’s wrong with this method?15

What’s wrong with this method?15

Conclusion: Vint Bug Serious data corruption: Some methods using loopssilently return wrong results! Bug already existed in Java 6– appeared some time after 1.6.0 18, enabled by default– is prevented since Lucene 3.1 by manual loopunwinding (helps only in Java 6) Cannot easily be reproduced, Oracle assigned“medium” bug priority – was never fixed in Java 6. Problems got worse with Java 7, only safe way toprevent is to disable loop unwinding completely, butthat makes Lucene very slow.16

Conclusion: Vint Bug Serious data corruption: Some methods using loopssilently return wrong results! Bug already existed in Java 6– appeared some time after 1.6.0 18, enabled by default– is prevented since Lucene 3.1 by manual loopunwinding (helps only in Java 6) Cannot easily be reproduced, Oracle assigned“medium” bug priority – was never fixed in Java 6. Problems got worse with Java 7, only safe way toprevent is to disable loop unwinding completely, butthat makes Lucene very slow.16

Hands-OnHOW TO DEBUG HOTSPOTPROBLEMS17

First Fetch some beer! Tell your girlfriend that you will not come tobed! Forget about Eclipse & Co! We need acommand line and our source code 18

Hardcore:Debugging without Debugger Open hs err file and watch for stack trace.(if your JVM crashed like in Porter stemmer) Otherwise: disable Hotspot to verify that it’snot a logic error! (-Xint / -Xbatch) Start to dig around by addingSystem.out.println, assertions,.Please note: You cannot use a debugger!!!19

Hardcore:Debugging without Debugger Open hs err file and watch for stack trace.(if your JVM crashed like in Porter stemmer) Otherwise: disable Hotspot to verify that it’snot a logic error! (-Xint / -Xbatch) Start to dig around by addingSystem.out.println, assertions,.Please note: You cannot use a debugger!!!19

Digging If you found a method that works incorrectly,disable Hotspot optimizations for only that one:-XX:CompileCommand exclude,your/package/Class,method– If program works now, you found a workaround!– But this may not be the root cause - does not help at all! Step down the call hierarchy and replaceexclusion by methods called from this one.20

Take action!Open a bug report at mailing list.21

Setting up JenkinsTESTING SOFTWARE ONVARIOUS JVM VENDORS22

Randomization everywhere Apache Lucene & Solr use randomization whiletesting:––––Random codec settingsRandom Lucene directory implementationRandom locales, default charsets, Random indexing data23

Randomization everywhere Apache Lucene & Solr use randomization whiletesting:––––Random codec settingsRandom Lucene directory implementationRandom locales, default charsets, Random indexing data Reproducible:– Every test gets an initial random seed– Printed on test execution & included in stack traces23

Missing parts JVM randomization– Oracle JDK 6 / 7– IBM J9 6 / 7– Oracle JRockit 624

Missing parts JVM randomization– Oracle JDK 6 / 7– IBM J9 6 / 7– Oracle JRockit 6 JVM settings randomization––––Garbage collectorBitness: 32 / 64 bitsServer / Client VMCompressed OOPs (ordinary object pointer)24

Missing parts JVM randomization– Oracle JDK 6 / 7– IBM J9 6 / 7– Oracle JRockit 6 JVM settings randomization––––Garbage collectorBitness: 32 / 64 bitsServer / Client VMCompressed OOPs (ordinary object pointer) Platform– Linux, Windows, MacOS X, FreeBSD, 24

Possibilities Define each Jenkins job with a different JVM:– Duplicates– Hard to maintain– Multiplied by additional JVM settings like GC,server/client, or OOP size25

Possibilities Define each Jenkins job with a different JVM:– Duplicates– Hard to maintain– Multiplied by additional JVM settings like GC,server/client, or OOP size Make Jenkins server set build / environmentvariables with a (pseudo-)randomization script:– JAVA HOME passed to Apache Ant– TEST JVM ARGS passed to test runner25

Plugins needed Environment Injector Plugin– Executes Groovy script to do the actual work– Sets some build environment variables: JAVA HOME, TEST JVM ARGS, JAVA DESC26

Plugins needed Environment Injector Plugin– Executes Groovy script to do the actual work– Sets some build environment variables: JAVA HOME, TEST JVM ARGS, JAVA DESC Jenkins Description Setter Plugin / Jenkins EmailExtension Plugin– Add JVM details / settings to build description and e-mails26

Global Jenkins settings Extra JDK config in Jenkins (called “random”):– pointing to dummy directory (we can use the basedirectory containing all our JDKs)– Assigned to every job that needs a randomly choosenvirtual machine27

28

The warning displayed by Jenkins doesn’t matter!28

Job Config Standard free style build with plugins activated– Calls Groovy script file with main logic (sets JAVA HOME randomly, )– List of JVM options as a „config file“– Job‘s JDK version set to „random“– Apache Ant configuration automatically gets JAVA HOME and test runner gets extra options viabuild properties29

Job Config Standard free style build with plugins activated– Calls Groovy script file with main logic (sets JAVA HOME randomly, )– List of JVM options as a „config file“– Job‘s JDK version set to „random“– Apache Ant configuration automatically gets JAVA HOME and test runner gets extra options viabuild properties Should work with Maven builds, too!29

30

31

32

33

34

34

ResultsBUGS FOUND35

Oracle (Hotspot) JVM Various issues with JIT compilation around allOpenJDK / Oracle JDK versions:––––Miscompiled loopsSegmentation faultsSystem.nanotime() brokenness on MacOSXDouble free() Lucene bugs with memory allocations ifcompressed oops are disabled on 64bit JVMs– happens only with large heaps 32 GB36

Java 8 prereleases G1 garbage collector deadlock due to marking stackoverflow (fixed) Compile failures with –source 1.7 related to defaultinterface methods (“isAnnotationPresent”) (fixed) Javadoc bugs– new doclint feature did not work (fixed)– doc-files folders were not copied (fixed)37

Java 8 prereleases G1 garbage collector deadlock due to marking stackoverflow (fixed) Compile failures with –source 1.7 related to defaultinterface methods (“isAnnotationPresent”) (fixed) Javadoc bugs– new doclint feature did not work (fixed)– doc-files folders were not copied (fixed) Solr test bugs with cool new Nashorn Javascript engine(fixed in Solr tests)37

Oracle JRockit TestPostingsOffsets#testBackwardsOffsetsfails in assertion in core Lucene code– JVM “ignores” an if-statement– IndexWriter later hits assertion No fix available by Oracle– Impossible to open a bug report without support contract!– JRockit seems unsupported– No Java 7 version available anymore discontinued Workaround: -XnoOpt– Slowdown better use supported Oracle Java 738

Oracle JRockit TestPostingsOffsets#testBackwardsOffsetsfails in assertion in core Lucene code– JVM “ignores” an if-statement– IndexWriter later hits assertion No fix available by Oracle– Impossible to open a bug report without support contract!– JRockit seems unsupported– No Java 7 version available anymore discontinued Workaround: -XnoOpt– Slowdown better use supported Oracle Java 7Don’t use JRockit or WebLogic App Server38

IBM J9 GrowableWriter#ensureCapacity() fails in assertion incore Lucene code– FST#pack() passes wrong argument Cause completely unknown! Hard to debug– Happens with JIT, AOT and without any optimizer– Only happens if test is executed in whole test suite Workaround:-Xjit:exclude he/lucene/util/fst/FST;}39

IBM J9 GrowableWriter#ensureCapacity() fails in assertion incore Lucene code– FST#pack() passes wrong argument Cause completely unknown! Hard to debug– Happens with JIT, AOT and without any optimizer– Only happens if test is executed in whole test suite Workaround:-Xjit:exclude he/lucene/util/fst/FST;}Don’t use IBM J9(Warning: Installed on SUSE Enterprise Linux by default)39

How about OpenJDK? Version numbers are inconsistent to official Oracle Java! Ubuntu 12 still installs OpenJDK 7b147, but patched! OpenJDK 6 is very different to Oracle JDK 6:– Forked from early Java 7!– Not all patches applied: e.g., ReferenceQueue#poll() does notuse double checked locking40

How about OpenJDK? Version numbers are inconsistent to official Oracle Java! Ubuntu 12 still installs OpenJDK 7b147, but patched! OpenJDK 6 is very different to Oracle JDK 6:– Forked from early Java 7!– Not all patches applied: e.g., ReferenceQueue#poll() does notuse double checked lockingYou may use OpenJDK 7(if you understand version numbers and their relation toOracle’s update packages)40

How about OpenJDK? Version numbers are inconsistent to official Oracle Java! Ubuntu 12 still installs OpenJDK 7b147, but patched! OpenJDK 6 is very different to Oracle JDK 6:– Forked from early Java 7!– Not all patches applied: e.g., ReferenceQueue#poll() does notuse double checked lockingYou may use OpenJDK 7(if you understand version numbers and their relation toOracle’s update packages)Don’t use OpenJDK 640

Inform yourself about further bugs:http://wiki.apache.org/lucene-java/JavaBugs41