by Dan Chudnov
by Stephanie Collett and Martin Haye
Stephanie Collett, California Digital Library, email@example.com
Martin Haye, California Digital Library, firstname.lastname@example.org
Within a relatively short time since their introduction, distributed version control systems (DVCS) like Git and Mercurial have enjoyed widespread adoption for versioning code. It didn’t take long for the library development community to start discussing the potential for using DVCS within our applications and repositories to version data. After all, many of the features that have made some of these systems popular in the open source community to version code (e.g. lightweight, file-based, compressed, reliable) also make them compelling options for versioning data. And why write an entire versioning system from scratch if a DVCS solution can be a drop-in solution? At the California Digital Library (CDL) we’ve started using Git and Mercurial in some of our applications to version data. This has proven effective in some situations and unworkable in others. This presentation will be a practical case study of CDL’s experiences with using DVCS to version data. We will explain how we’re incorporating Git and Mercurial in our applications, describe our successes and failures and consider the issues involved in repurposing these systems for data versioning.
by Jennifer Bowen
Jennifer Bowen, University of Rochester River Campus Libraries, email@example.com
Linked data is poised to replace MARC as the basis for the new library bibliographic framework. For libraries to benefit from linked data, they must learn about it, experiment with it, demonstrate its usefulness, and take a leadership role in its deployment.
The eXtensible Catalog Organization (XCO) offers open-source software for libraries that is “linked-data-ready.” XC software prepares MARC and Dublin Core metadata for exposure to the semantic web, incorporating FRBR Group 1 entities and registered vocabularies for RDA elements and roles. This presentation will include a software demonstration, proposed software architecture for creation and management of linked data, a vision for how libraries can migrate from MARC to linked data, and an update on XCO progress toward linked data goals.
by Tom Johnson
Tom Johnson, Oregon State University Libraries, firstname.lastname@example.org
Linked Library Data activity over the last year has seen bibliographic data sets and vocabularies proliferating from traditional library sources. We've reached a point where regular libraries don't have to go it alone to be on the Semantic Web. There is a quickly growing pool of things we can actually link to, and everyone's existing data can be immediately enriched by participating.
This is a quick and dirty road to getting your catalog onto the Linked Data web. The talk will take you from start to finish, using Free Software tools to establish a namespace, put up a SPARQL endpoint, make a simple data model, convert MARC records to RDF, and link the results to major existing data sets (skipping conveniently over pesky processing time). A small amount of "why linked data?" content will be covered, but the primary goal is to leave you able to reproduce the process and start linking your catalog into the web of data. Appropriate documentation will be on the web.
Jason Ronallo, North Carolina State University Libraries, email@example.com
When the big search engines announced support for HTML5 microdata and the schema.org vocabularies, the balance of power for semantic markup in HTML shifted.
What is microdata?
Where does microdata fit with regards to other approaches like RDFa and microformats?
Where do libraries stand in the worldview of Schema.org and what can they do about it?
How can implementing microdata and schema.org optimize your sites for search engines?
What tools are available?
Declan Fleming, University of California, San Diego, dfleming AT ucsd DING edu
What's the right metadata standard to use for a digital repository? There isn't just one standard that fits documents, videos, newspapers, audio files, local data, etc. And there is no standard to rule them all. So what do you do? At UC San Diego Libraries, we went down a conceptual level and attempted to hold every piece of metadata and give each holding place some context, hopefully in a common namespace. RDF has proven to be the ideal solution, and allows us to work with MODS, PREMIS, MIX, and just about anything else we've tried. It also opens up the potential for data re-use and authority control as other metadata owners start thinking about and expressing their data in the same way. I'll talk about our workflow which takes metadata from a stew of various sources (CSV dumps, spreadsheet data of varying richness, MARC data, and MODS data), normalizes them into METS by our Metadata Specialists who create an assembly plan, and then ingests them into our digital asset management system. The result is a beautiful graph of RDF triples with metadata poised to be expressed as HTML, RSS, METS, XML, and opens linked data possibilities that we are just starting to explore.
by Tom Burton-West
Tom Burton-West, DLPS, University of Michigan Library, tburtonw AT umich edu
HathiTrust Large-Scale search provides full-text search services over nearly 10 million full-text books using Solr for the back-end. Our index is around 5-6 TB in size and each shard contains over 3 billion unique terms due to content in over 400 languages and dirty OCR.
Searching the full-text of 10 million books often results in very large result sets. By conference time a number of features designed to help users narrow down large result sets and to do exploratory searching will either be in production or in preparation for release. There are often trade-offs between implementing desirable user features and keeping response time reasonable in addition to the traditional search trade-offs of precision versus recall.
We will discuss various scalability and usability issues including:
Trade-offs between desirable user features and keeping response time reasonable and scalable
Our solution to providing the ability to search within the 10 million books and also search within each book
Migrating the personal collection builder application from a separate Solr instance to an app which uses the same back-end as full-text search.
Design of a scalable multilingual spelling suggester
Providing advanced search features combining MARC metadata with OCR
The dismax mm and tie parameters
Weighting issues and tuning relevance ranking
Displaying only the most "relevant" facets
Tuning relevance ranking
Dirty OCR issues
CJK tokenizing and other multilingual issues.
by Tamar Sadeh
Tamar Sadeh, PhD, Ex Libris Group, firstname.lastname@example.org
The greatest challenge for discovery systems is how to provide users with the most relevant search results, given the immense landscape of available content. In a manner that is similar to human interaction between two parties, in which each person adjusts to the other in tone, language, and subject matter, discovery systems would ideally be sophisticated and flexible enough to adjust their algorithms to individual users and each user’s information needs.
When evaluating the relevance of an item to a specific user in a specific context, relevance-ranking algorithms need to take into account, in addition to the degree to which the item matches the query, information that is not embodied in the item itself. Such information, which includes the item’s scholarly value, the type of search that the user is conducting (e.g., an exploratory search or a known-item search), and other factors, enables a discovery system to fulfill user expectations that have been shaped by experience with Web search engines.
The session will focus on the challenges of developing and evaluating relevance-ranking algorithms for the scholarly domain. Examples will be drawn mainly from the relevance-ranking technology deployed by the Ex Libris Primo discovery solution.
Jørn Thøgersen, Statsbiblioteket/State and University Library, Aarhus, Denmark. email@example.com
Michael Poltorak Nielsen, Statsbiblioteket/State and University Library, Aarhus, Denmark. firstname.lastname@example.org, (aka the Danes - some of them).
Web based library search engines are traditionally operated using keys, input fields, buttons, and links. Being equipped with touch screens, accelerometers, GPS's, and cameras, smartphones and tablets offer a whole new range of input options.
In this talk we'll demonstrate some of our ideas of how to utilise these new input options interacting with a search engine. The basic idea is to have no traditional GUI input elements, but only use touch interactions (pinch, zoom, swipe, long-press, etc) and gestures (shake, tilt, turn, etc.). Using these interactions, we’ll demonstrate how to:
toggle search result views
request materials, add to favourites
interact with your stuff, renew items
We'll also show you some (conceptual) ideas about using the device camera for locating and checking out materials.
On a general level, what we are trying to achieve is a move away from a web based paradigm and establish new ways of interaction better suited to the new devices and on their own terms. The demonstration will feature working mobile prototypes including both native apps (iPhone) and web apps. In both cases they will run on live data from our OPAC on www.statsbiblioteket.dk/search/
This talk is actually also a continuation of our Code4Lib 2010 talk called "Kill The Search Button" (http://code4lib.org/conference/2...), which we unfortunately never got around to do, due to a Danish blizzard.
by Lisa Kurt
Lisa Kurt, University of Nevada, Reno, email@example.com
Users expect good design. This talk will delve into what makes really great design, what to look for, and how to do it. Learn the principles of great design to take your applications, user interfaces, and projects to a higher level. With years of experience in graphic design and illustration, Lisa will discuss design principles, trends, process, tools, and development. Design examples will be from her own projects as well as a variety from industry. You’ll walk away with design knowledge that you can apply immediately to a variety of applications and a number of top notch go-to resources to get you up and running.
by Robin Chandler, Susan Chesley Perry and Kevin S. Clarke
Robin Chandler, University of California (Santa Cruz), chandler [at] ucsc [dot] edu
Susan Chesley Perry, University of California (Santa Cruz), chesley [at] ucsc [dot] edu
Kevin S. Clarke, University of California (Santa Cruz), ksclarke [at] ucsc [dot] edu
The Grateful Dead Archive at the University of California (Santa Cruz) is a collection of over 600 linear feet of material, including: business records, photographs, posters, fan envelopes, tickets, video, audio (oral histories, interviews and music) and 3-d objects such as stage props and band merchandise. In addition, with the release of the Grateful Dead Archive Online website in 2012, the Archive will start actively collecting artifacts from an enthusiastic community of Grateful Dead fans.
This talk will discuss the challenges of merging a traditional archive with a socially constructed one. We will also present the first round of development and explain how we're using tools like Omeka, ContentDM, UC3 Merritt, djatoka, Kaltura, Google Maps, and Solr to lay the foundation for a robust and engaging site. Future directions, like the integration/development of better curation tools and what we hope to learn from opening the archive to contributions from a large community of fans, will also be discussed.
Kirk Hess, Digital Humanities Specialist, University of Illinois Urbana-Champaign, firstname.lastname@example.org
The presentation will review tracking search queries, adding events such as clicking external links or downloading files, and custom variables, to track user behavior that is normally difficult to track. We'll also discuss using jQuery scripts to add tracking code to the page without having to modify the underlying web application. Once you've collected data, you may use the Google Analytics API to extract data and integrate it with data from your digital repository to show granular data about individual items in your Digital Library. Finally, we'll discuss how this information allows you to improve the user experience, and summarize some of the research we are doing with our digital repository and the data gathered from Google Analytics.
by Cory Lown
Cory Lown, North Carolina State University Libraries, email@example.com
Searching the library is complex. There's the catalog, article databases, journal title and database title look-ups, the library website, finding aids, knowledge bases, etc. How would users search if they could get to all of these resources from a single search box? I'll share what we've learned about single search at NCSU Libraries by tracking use of QuickSearch (http://www.lib.ncsu.edu/search/i...), our home-grown unified search application. As part of this talk I will suggest low-cost ways to collect real world use data that can be applied to improve search. I will try to convince you that data collection must be carefully planned and designed to be an effective tool to help you understand what your users are telling you through their behavior. I will talk about how the fragmented library resource environment challenges us to provide useful and understandable search environments. Finally, I will share findings from analyzing millions of user transactions about how people search the library from a production single search box at a large university library.
William Gunn, Mendeley firstname.lastname@example.org (@mrgunn)
This is partly a tool talk and partly a big idea one.
Mendeley has built the world's largest open database of research and we've now begun to collect some interesting social metadata around the document metadata. I would like to share with the Code4Lib attendees information about using this resource to do things within your application that have previously been impossible for the library community, or in some cases impossible without expensive database subscriptions. One thing that's now possible is to augment catalog search by surfacing information about content usage, allowing people to not only find things matching a query, but popular things or things read by their colleagues. In addition to augmenting search, you can also use this information to augment discovery. Imagine an online exhibit of artifacts from a newly discovered dig not just linking to papers which discuss the artifact, but linking to really good interesting papers about the place and the people who made the artifacts. So the big idea is, "How will looking at the literature from a broader perspective than simple citation analysis change how research is done and communicated? How can we build tools that make this process easier and faster?" I can show some examples of applications that have been built using the Mendeley and PLoS APIs to begin to address this question, and I can also present results from Mendeley's developer challenge which shows what kinds of applications researchers are looking for, what kind of applications peope are building, and illustrates some interesting places where the two don't overlap.
by Annie Cain
Annie Cain, Harvard Library Innovation Lab, email@example.com
In an effort to recreate and build upon the traditional method of browsing a physical library, we used catalog data, including dimensions and page count, to create a virtual shelf.
by Jeremy Nelson
Jeremy Nelson, Colorado College, firstname.lastname@example.org
In October, the Library of Congress issued a news release, "A Bibliographic Framework for the Digital Age" outlining a list of requirements for a New Bibliographic Framework Environment. Responding to this challenge, this talk will demonstrate a Redis (http://redis.io) FRBR datastore proof-of-concept that, with a lightweight python-based interface, can meet these requirements.
Because FRBR is an Entity-Relationship model; it is easily implemented as key-value within the primitive data structures provided by Redis. Redis' flexibility makes it easy to associate arbitrary metadata and vocabularies, like MARC, METS, VRA or MODS, with FRBR entities and inter-operate with legacy and emerging standards and practices like RDA Vocabularies and LinkedData.
by Herp Derp
Carmen Mitchell, carmenmitchell at gmail (@carmendarlene)
code4lib 2012, Wednesday, February 8 2012, 11:15-12:00
a.k.a. "Human Search Engine". A chance for you to ask a roomful of code4libbers anything that's on your mind: questions seeking answers (short or long), requests for things (hardware, software, skills, or help), or offers of things. We'll keep the pace fast, and the answers faster. Come with questions and line up at the start of the session and we'll go through as many as we can; sometimes we'll stop at finding the right person or people to answer a query and it'll be up to you to find each other after the session. Third time at code4libcon! (Thanks to Ka-Ping Yee and Dan Chudnov for the inspiration/explanation, reused here in part.)
by Scott Fisher and Erik Hetzner
Scott Fisher, California Digital Library, scott.fisher AT ucop BORK edu
Erik Hetzner, California Digital Library, erik.hetzner AT ucop BORK edu
The Web Archiving Service at the California Digital Library has crawled a large amount of data, in every format found on the web: 30 TB, comprising about 600 million fetched URLs. In this talk we will discuss how we parsed this data using Tika and map-reduce, and how we indexed this data with Solr, tweaked the relevance ranking, and were able to provide our users with a better search experience.
by Jason Casden
Jason Casden, North Carolina State University Libraries, email@example.com
When it comes to storing data in web browsers on a semi-persistent basis, there are several partially-adopted, semi-deprecated, product-specific, or even universally accepted options. These include models such as key-value stores, relational databases, and object stores. I will present some of these options and discuss possible applications of these technologies in library services. In addition to quoting heavily from Mark Pilgrim's excellent chapter on this topic, I will weave in my own experience utilizing in-browser data storage in an iPad-based data collection tool to successfully improve performance and data stability while reducing network dependence. See also: HTML5.
by James Stuart
James Stuart, Columbia University, firstname.lastname@example.org
We've all heard about that one study that showed that Pair Programming was 20% efficient than working alone. Or maybe you saw on a blog that study that showed that programmers who write fewer lines of code per day are more efficient...or was it less efficient? And of course, we all know that programmers who work in (Ruby|Python|Java|C|Erlang) have been shown to be more efficient.
A quick examination of some of the research surrounding programming efficiency and methodology, with a focus on personal productivity, and how to incorporate the more believable research into your own team's workflow.
by Naomi Dushay
Naomi Dushay, Stanford University Libraries, email@example.com
code4lib 2012, Wednesday, February 8 2012, 14:00-14:20
Agile development techniques can be difficult to adopt in the context of library software development. Maybe your shop has only one or two developers, or you always have too many simultaneous projects. Maybe your new projects can’t be started until 27 librarians reach consensus on the specifications.
This talk will present successful Agile- and Silicon-Valley-inspired practices we’ve adopted at Stanford and/or in the Blacklight and Hydra projects. We’ve targeted developer happiness as well as improved productivity with our recent changes. User stories, dead week, sight lines … it’ll be a grab bag of goodies to bring back to your institution, including some ideas on how to adopt these practices without overt management buy in.
by Robin Schaaf
Robin Schaaf, University of Notre Dame, firstname.lastname@example.org
UI development is hard and too often ends up as an after-thought to computer programmers - if you were a CS major in college I'll bet you didn't have many, if any, design courses. I'll talk about how to involve the users upfront with design and some common pitfalls of this approach. I'll also make a case for why you should do the screen design before a single line of code is written. And I'll throw in some ideas for increasing usability and attractiveness of your web applications. I'd like to make a case study of the UI development of our open source ERMS.
Shaun Ellis, Princeton University Libraries, email@example.com
"The code itself is unimportant; a project is only as useful as people actually find it." - Linus Torvalds 
Usability has been a buzzword for some time now, but what is the process for making the the transition toward a better user experience, and hence, better designed library sites? I will discuss the one facet of the process my team is using to redesign the Finding Aids site for Princeton University Libraries (still in development). The approach involves the use of rapid prototyping, with Bootstrap , to make sure we are on track with what users and stakeholders expect up front, and throughout the development process.
Mike Schultz, formerly Summon Search Architect, firstname.lastname@example.org
Solr/Lucene provides a lot of flexibility for adjusting relevancy scoring and improving search results. Roughly speaking there are two areas of concern: Firstly, a 'dynamic rank' calculation that is a function of the user query and document text fields. And secondly, a 'static rank' which is independent of the query and generally is a function of non-text document metadata. In this talk I will outline an easily understood, hand-tunable static rank system with a minimal number of parameters.
The obvious major feature of a search engine is to return results relevant to a user query. Perhaps less obvious is the huge role query independent document features play in achieving that. Google's PageRank is an example of a static ranking of web pages based on links and other secret sauce. In the Summon service, our 800 million documents have features like publication date, document type, citation count and Boolean features like the-article-is-peer-reviewed. These fields aren't textual and remain 'static' from query to query, but need to influence a document's relevancy score. In our search results, with all query related features being equal, we'd rather have more recent documents above older ones, Journals above Newspapers, and articles that are peer reviewed above those that are not. The static rank system I will describe achieves this and has the following features:
Query-time only calculation - nothing is baked into the index - with parameters adjustable at query time
The system is based on a signal metaphor where components are 'wired' together. System components allow multiplexing, amplifying, summing, tunable band-pass filtering, string-to-value-mapping all with a bare minimum of parameters.
An intuitive approach for mixing dynamic and static rank that is more effective than simple adding or multiplying.
A way of equating disparate static metadata types that leads to understandable results ordering.
6th–9th February 2012