Migrating from a multi-user blog to a personal blog

post_author == 2 && is_single()) {
    header('Location: ' . str_replace('http://authoritativeopinion.com/', 'http://cbeer.info/', $post->guid ));
    die();
  }
}
?>

Posted in Uncategorized.


Digital Asset Management for Public Broadcasting: Interlude

Just a quick update on my progress developing a shareable prototype. The basic integration work is functional, I’ve ripped out the previously-mentioned Camel workflow components in favor of ruote (which is so much easier to wrap my mind around — I’ve pushed the skeleton code for this out as a separate package called fedora-workflow), and I’ve started doing some very basic datastream display work.

After this work is complete, I think a first-round alpha will be ready to publish within the next couple weeks.

Posted in Repository, TODO.


Digital Asset Management for Public Broadcasting: Blacklight (Part 3 of ??)

In the previous parts, I wrote about two “back-office” open source applications (and tangentially discussed a few others) that are well-established in their communities and can support a wide variety of repository services. While it may be philosophically important that these are open source applications, I would argue that the next parts, in which I want to talk about services and applications on top of the repository infrastructure, are the more crucial and benefit tremendously from the ability to create and customize interfaces for specific use cases to the full extent necessary by anyone with a fairly broad skill-set.

Blacklight grew out of a next-generation library catalog interface, and while it still has very firm roots in the library world, it is also being used for archives, digital collections, and institutional repository interfaces. It is also an open source application, based on the Ruby on Rails framework.

Out of the box, it is a fairly generic interface to a solr index (with a little sprinkling of optional MARC data) and some relatively benign application features (users, bookmarks, saved searches). Connecting it to our existing Solr index is fairly trivial, and just requires some little configuration changes:

config[:index_fields] = {
    :field_names => [
      "dc.description",
      "dc.creator",
      "dc.publisher",
      "dc.subject",
      "dc.date",
      "dc.format"
    ],
    :labels => {
      "dc.description"           => "Description:",
      "dc.creator" => "Creator:",
      "dc.publisher" => "Publisher:",
      "dc.subject" => "Subject:",
      "dc.date" => "Date:",
      "dc.format" => "Format:"
    }
  }

Which gives you a very basic discovery interface into your collection.

Extending Blacklight to work with Fedora is also easy, so in less than 50 lines of code, I had full access to the Fedora web services APIs and SPARQL interface. Adding management interfaces was also simple, using normal Ruby of Rails techniques and with less than 500 lines of code, a passable repository manager interface was available and I could import assets and metadata.

Adding a security layer on top of the repository content is also easy, thanks to the work the UPEI team put into the DrupalServletFilter, which allows Fedora to authenticate users against any SQL database. Because of this, we can use the XACML policy language built into Fedora to do record-level security (which I confess, I don’t entirely understand, however, it is an enormously powerful and expressive language if you like XML verbiage). For storing re-use rights, I am very intrigued by the Open Digital Rights Language, which can integrate with Fedora and Blacklight to express non-object-security rights (re-use, segmentation, etc) using my proof-of-concept ruby-odrl.

With these fundamentals in place (ingest services, security policies, and resource discovery), one can build more advanced services on top of the repository, like collections, batch and on-demand conversion/transcode services, export/transfer services (one-click “export to PBS COVE”?) — and, because this can be done as rails plug-ins, they are readily sharable outside of this single application and provide templates for others to continue to develop and extend similar services to evolving platforms.

Because setting up a Blacklight application is so painless, it would be easy for public broadcasting institutions to create custom-made (yet shareable) modules and views for specific purposes (news, productions, archiving, etc) that all share the same back-end infrastructure yet offer users an easy way to interact with their data in a way that makes sense for their work. As I mentioned in my Fedora article, you aren’t limited to data you control and have locally, but can bring in data from external sources (say, pulling in metadata from the NPR API or an RSS feed from a stock footage house) and present it both coherently and cohesively.

I’m looking for a good source of freely available test data, and I would rather not invest too much time building a corpus of archival assets if there is something already existing. The biggest challenge I’m having is finding comprehensive metadata, but the closest I’ve come are some podcast feeds from sources like Democracy Now!, however that doesn’t capture the breadth of materials I’d like to demonstrate.

Finally, a couple requisite screen-shots now that there is something visual to work with, using the default Blacklight theme with some quick interface hacks.

Posted in Repository, TODO.

Tagged with , , .


Digital Asset Management for Public Broadcasting: Solr (Part 2 of ??)

The Lucene-based Apache Solr is an incredible platform for building decent search experiences with — especially compared to the “more traditional” database-driven approach with many SQL JOINs that it becomes difficult to efficiently add search features like stemming, ASCII-folding, term highlighting, facets, and synonyms which, I would argue, are essential parts of the discovery experience and you essentially get for free with Solr. Another benefit Solr provides is a foundation for many light-weight interfaces on top of a single index (or, across multiple indexes, because Solr enforces some decent scalability principles that make expanding to task-based indexes easier).

For a DAM project, each asset should appear in the search index with the basic layer of contributed metadata, relationships, metadata extracted from the assets, as well as the administrative metadata managed by Fedora. I would align the fields the the Dublin Core (and DCTerms) elements (which is probably all you can get users to contribute in any case). At this point, because legacy systems lack authority control, linked data, or otherwise, existing metadata is sparse, inaccurate, or limited, which means the entry-level bar is set pretty low, so targeting ease-of-use and metadata collection are the priorities. Eliding a lot of detail, here’s the skeleton schema:

  <field name="id" type="string" indexed="true" stored="true" required="true" />
   <field name="title" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="description" type="string" indexed="true" stored="true"/>

   <dynamicField name="dc.*" type="string" indexed="true" stored="true" multiValued="true"/>
   <dynamicField name="dcterms.*" type="string" indexed="true" stored="true" multiValued="true"/>
   <dynamicField name="rdf.*" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
   <field name="payloads" type="payloads" indexed="true" stored="true"/>
   <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>

   <copyField source="title" dest="title_t" />
   <copyField source="subject" dest="dc.subject" />
   <copyField source="description" dest="description_t" />
   <copyField source="comments" dest="text" />
   <copyField source="dc.creator" dest="author" />
   <copyField source="dc.*" dest="text" />
   <copyField source="text" dest="text_rev" />
   <copyField source="payloads" dest="text" />

  <copyField source="dc.title" dest="dc.title_t" />
  <copyField source="dc.description" dest="dc.description_t" />
  <copyField source="dc.coverage" dest="dc.coverage_t" />
  <copyField source="dc.contributor" dest="dc.contributor_t" />
  <copyField source="dc.subject" dest="dc.subject_t" />
  <copyField source="dc.contributor" dest="names_t" />
  <copyField source="dc.coverage" dest="names_t" />

The new edismax query parser provides a great balance of flexibility, advanced query features, and ease-of-use that it seems like an obvious choice here.

The only penalty you pay by using solr is having to keep the solr index synchronized with your data sources. For synchronizing data from Fedora, there are now a proliferation of options, ranging from the task-specific with java plugins like GSearch and Shelver to the more generic (ESBs and all that) like Apache Camel or the Ruote-based Fedora Workflow component. Because DAM likely involves many different workflows, I lean towards the more generic solutions. Lately, I’ve given Camel a try, and after a couple days of java-dependency-induced head pounding, I have something that works.

On twitter, John Tynan requested a virtual machine image to encourage others to begin playing with this software, so I’ve actually begun building some of these pieces. Currently, I have Fedora/Camel/Solr/Blacklight installed and functional, but before I try to package it us, I feel like I should add an easy-to-use ingest system to get data in.

Posted in Repository, TODO.

Tagged with , .


Digital Asset Management for Public Broadcasting: Fedora Commons Repository (Part 1 of ??)

In my previous post, I provided a broad overview of the challenges and opportunities for developing an open source digital asset management system within the public broadcasting community, and described some fundamental technology that is already being developed and deployed within institutions. In this post, I want to look specifically at the role the Fedora Commons repository architecture can play in this environment. Additional reading is available from the Fedora Commons wiki, especially the Getting Start with Fedora article, which articulates some of the strengths of their approach in the abstract.

The Fedora Commons data model is built on top of the Kahn/Wilensky Architecture, which describes a data structure for primary digital objects (irrespective of the data or formats contained within). Already, this is an improvement over some systems, which differentiate between content types, relegating some content formats to second-class citizenship. By providing a single, fundamental data type, one can build consistent user experiences on top of the discoverable components and interact with the digital objects to GET THINGS DONE.

Within digital objects are datastreams, which may include both data and metadata about the object, and are treated equally (more or less…) Datastreams can carry revision information, integrity checks, and other provenance information. By not distinguishing between “digital” assets (for which data (e.g. the media files) are available electronically) and other kinds of assets (physical tapes, abstract entities, etc), an asset management system can encompass the full range of materials within an active media archive.

Digital objects can be assigned content model types, which stipulate the required (and optional) component datastreams, as well as define the services that operate on objects of that type. These content types are simply structured digital objects within the repository, allowing repository managers (and content creators, given a sufficient interface) to define the structure of their content rather than structuring their content to meet the needs of the digital asset management system.

Types of datastreams natively supported include Inline XML datastreams, Managed Content, Externally Referenced Content, and Redirects. The datastream types do not speak to the format of content stored within them (except for inline XML), which allows content creators to easily provide content to the repository without first worrying about transcoding materials or other barriers to accessioning content (which is certainly not to say that standardizing content types archived within the repository is problematic — just that it shouldn’t interfere with getting the materials in the first place). This variety of types allows content to be stored and managed in the most appropriate places, rather than arbitrarily requiring centralization or “physical” ownership of content. Within a distributed organization like public broadcasting, this could be a powerful concept that allows content creators to control and manage their content at various stages of distribution (and, while this could be accomplished within traditional database driven systems, it would require custom application logic to do, which is likely not scalable across a wide variety of applications, frameworks, and languages).

While all datastreams are equal, there are four (or more?) that are more equal than others:

- AUDIT, which stores the history of the digital object as it is modified.

- DC, a Qualified Dublin Core datastream, that provides a minimal level of interoperability for the most generic of repository management interfaces. This is also the only fundamentally required datastream (without specifying required elements within it), and really is the bare minimum of information necessary to assert the existence of an object (if it doesn’t have a title, identifier, or description, what is it we’re talking about exactly?)

- RELS-EXT (and INT), an RDF-XML datastream in which one can assert relationship to other digital objects (which may exist within the repository, but may also exist (or not exist) elsewhere). These relationships can be from any vocabulary and reference any type of object, which is handy when you are dealing with complex relationships between media archives assets. This datastream is also generally indexed in an RDF triple-store to provide relationship querying.

- POLICY, which stores XACML security policies for the digital object, which can be used to restrict access to the datastreams, services, or the object based on whatever the security needs are. Within the digital asset management context, this could also be used to restrict access to only media files, while still providing the metadata (so one could assert and describe the existence of an object, without actually sharing it for whatever reason, which seems atypical for some commercial solutions)

By default, these datastreams (and the digital object wrapper) are stored on the file system in relatively comprehensible ways, which is a bonus to implementors who can set up underlying hardware or other technology in traditional ways and just begin to use the software without too much fuss. There is ongoing development to build in support for additional and evolving standards around digital object storage, serialization, access, and other services which should only help with making the process as transparent as possible.

All of this technology and flexibility comes “free” with the repository architecture and doesn’t try to interfere with actually making use of the assets (except as restricted by security policies, of course), which allows different use cases to be expressed in the most logical and straightforward way (rather than trying to bend the use cases or system in an attempt to mimic some of the elements the user needs). As a starting point for developing a digital asset management solution for media, I believe it offers a good balance of flexibility and requirements that can ensure user needs are met without sacrificing durability.

So, how can Fedora be applied in a digital asset management context for public broadcasting? First and foremost, Fedora provides a trusted platform for managing and maintaining content for many different contexts (production, long-term archiving, etc) on top of a variety of hardware and standards. By managing metadata and data together, physical and digital assets can be revealed in a common interface (when appropriate) to meet the needs of researchers and scholars (for whom the knowledge of the existence of the asset is more essential than on-demand access). Finally, by offering a stable API to a variety of resources, use-case driven interfaces can be developed, shared, and maintained to meet different needs sensibly.

Posted in Repository, TODO.

Tagged with , .


Digital Asset Management for Public Broadcasting (Part 0 of ?)

Digital asset management is hard. Many people have solved many parts of the problem, but for a reasonably complex use-case, many of the existing solutions just aren’t there yet, especially within a vendor-driven world for a niche market within a niche market, which is concerned with all levels and life-cycles of an asset (from production, to reuse, to archiving and back again), which is almost certainly not a profitable market given public broadcasting budgets. I believe this is an ideal area for the development of open source solutions based on some existing works of open source software.

The “easy” part in the DAM ecosystem, I would argue, is archiving the material and ensuring its long-term preservation (and accessibility!). I’ve done a couple projects and prototypes now based on the Fedora Commons repository architecture, and it seems to be a promising platform for this kind of development. Objects and datastreams are stored on the file-system, which IT staff are traditional prepared to manage (vs some unique database structure almost certainly obfuscated in layers of (de-)normalization). Fedora will happily manage security policies, object relationships, data transformation services, and (shortly) more advanced file system interactions, which exposing a (relatively) consistent HTTP interface.

Discovery interfaces are probably the next easiest piece, having been examined and developed out of the information sciences communities. Using a combination like Solr and Blacklight (deployed successfully for WGBH’s Open Vault website), one can rapidly create interfaces to the underlying content that satisfy the many use cases. With Solr, you get a bunch of discovery mechanisms and options, including relevancy, term highlighting, faceting, etc.

From here, we start getting into the hard parts. Ingest and metadata editing is difficult to solve well in a content- and use-case- agnostic way, which is the approach most Systems seem to take. While the need for a generic asset management view is important (and solved!), if the collection of services fail to meet the needs of the users, encouraging adoption (nicely) is problematic. By using infrastructure elements with open and well-documented APIs, developers can extend and customize the user experiences to match the underlying data and processes. This is an area for which the adoption and support for open source projects can encourage sustainable development of these interfaces.

It seems like, after clearing these obstacles, many systems fail to account for the use and re-use of these objects within the media communities. Few systems account for batch encoding video and audio for web distribution, one-click publishing systems to blogs, social networking sites, or video portals, integration into broadcasting chains, etc — for very good reasons, there simply isn’t the incentive when faced with large upfront development costs for unique development. Given an open source platform, however, that supports (and encourages) sharable development of solutions, maybe we could start finding answers to these persistent problems (without re-inventing the wheel!).

I believe most of the core infrastructure pieces are there:
- Fedora, as I mentioned, which provides preservation and management services;
- Solr, which provides a discovery framework (and associated metadata extraction utilities like Tika);
- Blacklight, which provides discovery and access services;
- ESB or other workflow solutions like Camel, Ruote, or otherwise;
- Generic metadata editing options, like XForms, Django, etc;
- Open standards that allow for publishing and reuse (Atom, MediaRSS, RDF, ???);
- FFMPEG, which offers encoding and transcode services.

It isn’t an extensive development problem, these are well-established communities in their fields, it’s a simple matter of getting initial momentum in tying the complex pieces together and creating interesting and useful services on top.

So, why aren’t we doing this? Money, time, lack of a collaborative/communicative culture, and apathy (and acceptance) of second-rate, buggy commercial solutions that fail to address all aspects of a media objects life-cycle as it goes from the rapid iterations in production to many different distribution channels back to relative obscurity in an archival context (until a new production pulls it out again). Without full support, no step in the process can realize the potential of the content and have the incentive to put in the hard work to ingest and describe the asset.

Posted in Repository, TODO.

Tagged with , , .


Linked data and public broadcasting

Lately, I’ve been talking up linked data and the semantic web to some of my colleagues in US-based public broadcasting, which is heavily fragmented (by design) and operates on a number of levels (producers, distributers, and broadcasters at both local and national levels) with many competing interests, funding models, and missions. Linked data seems to offer a common framework to disseminate, describe, and aggregate information, beyond one-way APIs, custom solutions, and one-size-fits-all software. It seems elegant to pair the organizational models with a data model that already deals with issues of authority, distributed information, and relationships between objects. Further, the BBC have done or enabled some exciting linked-data based projects that expose the programme catalog, mash-up BBC content with user-generated content, and contextualize BBC content within the wider web in a way that makes it useful and discoverable outside of a walled garden.

Getting started seems easy enough, and at least a few of us on the inside are making some quiet progress. Glenn Clatworthy at PBS has done some very early RDF experiments with the PBS catalog, which could unlock a valuable resource, that has the potential to tie together programs assets, extra production material, and all manner of external resources.

So, why should public broadcasting begin this process now?
- it frees and decentralizes information, making it available for new applications and better resource discovery (especially within news and public affairs programming, which has many different outlets gathering different pieces and angles on a story)
- legacy content is already being moved into new content management and asset management system, so additional overhead is minimal.
- it can begin at any level of effort and still produce valuable results — and it can begin as unilateral collaboration, without the need for extensive oversight, project planning, or finalized use-cases.

Posted in Uncategorized.

Tagged with , .


NPR API + Solr = ?

Adapted from an email to the pubforge list.

Solr is a great application, and its out-of-the-box features still amaze me. With the newer versions, it’s incredibly easy to hook Solr up to any data source (using the Solr Data Import Handler) and just let it do its thing.

I don’t have any thoughts about communication, but one of the tennents of the code4lib community is “less talk, more code”. Public media spends a lot of time planning collaborations or trying to find funding (or worse, talking about doing those things) instead of actually doing it. I’d love to see more prototyping, iterative development, and open sharing and discussion about what new and interesting services we can provide.

On an earlier post to the list, John Tynan suggested the potential of providing a “More Like This” service for NPR News data, and in the interest of just getting something out there, I spent a little bit of time hooking everything together. To give it a pretty front-end, I also hacked in a Solr AJAX interface.

The NPR/Solr demonstrator uses this solr endpoint. I’ve locked down the indexes, but left everything else open so you can see how the pieces fit together. If there is enough interest in this application, I would be willing to develop it out further if you provide ideas, use-cases, etc in the comments.

The source code is available from the github project npr-solr.

None of this took very long to develop, the most time consuming part was importing from the paginated NPR API (with its absurdly low 20 records-per-request maximum..).

Posted in Uncategorized.

Tagged with , , .


Public media links for the week of 3/6

Some thoughts on curation – adding context and telling stories

Just over two years ago I wrote a post about the importance of the resource and the URL — and I still stand by what I said there: the core of a website should be the resource and its URL. And if those resources describe real world things and they are linked together in the way people think about the world then you can navigate the site by hopping from resource to resource in an intuitive fashion. But I think I missed something important in that post — the role of curation, the role of storytelling.

Tom Scott’s article is particularly interesting as public broadcasting begins to transform from distribution into conversation. It’s great to see some thinking about the interaction of user generated content and programming.


Consolidation: CPB renews its economy push for shared master control
facilities

CPB has come up with another incentive and a new demonstration project in its long and sporadic
campaign for the cost savings of shared technical facilities and staff. ¶Under a new rule adopted by the
CPB Board, public TV stations won’t be eligible for master control equipment funding unless they share the facility with one or more other stations, according to Mark Erstling, senior v.p. for system development and media strategy.

I’m probably biased because I started in public media as a master control op, but consolidation worries me. One of the great things about public media has been its incredible local efforts. While it seems like public media drifted away from that, local master control should encourage better programming for local communities.


This is a little older, but the announced cuts to BBC Online (and the
responses to it) are interesting, and I’d love to see a discussion
in US public media about digital vision going forward:

The BBC: still no digital vision

I’ve been meaning for a while to write about a growing sense of frustration with the BBC (and, for that matter Channel 4) for their continuing failure to establish a strategy repositioning them in a way that makes sense for a public service media organisation in the emerging digital ecology. I drafted this before Mark Thompson’s recent announcement of cuts in BBC Online; the decisions he has announced recently only confirm a view that the BBC has yet to find a direction in the new media landscape.

What is the BBC?

Well, so what? What’s so special about the BBC that we should have a right to public money? Well, we have no intrinsic right to this money in the same way that, say, the police and fire departments don’t have an intrinsic right to public money. However, like any public good, society cooperates to share certain resources for public gain.

Posted in Uncategorized.


Fedora and Microservices

In this post, I want to discuss repository architecture philosophies, although I will focus primarily on Fedora and California Digital Library microservices, there are some generalizations one can pull out of this. It would also be interesting to pull in some very different repository models, like iRODS or a triple-store-backed system, but that’s outside of my expertise.

The basics

This is not a section I really want to write, but I don’t know of a high-level answer to “when we say repository, this is what we mean”. I spent a little time looking around for a summary, but more often than not I found more questions (or, perhaps more useful yet inappropriate for my purposes, technology-based answers rather than use-driven), so I’ve taken a stab at addressing what I believe are some key issues:

Repositories are a collection of services, with well-defined interfaces, for storing and managing data (both content and metadata) in a format-neutral, display-independent manner way. Repositories can be used as preservation repositories, as access repositories, as centralized aggregations of far-flung data, etc and operate on any scale for any audience. Furthermore, there are existing standards and agreements about what it means to be a certain type of repository (TDR, OAIS, etc). All of these repositories, however, share some common services — whether implemented as software, external processes, or manual processes.

Some essential repository services are:

  • Identifier services, which may include assignment + registration
  • Storage services (although the content stored may be only pointers to the “actual” content)
  • Content identification, matching identifiers to content items
  • Ingest workflows
  • Access mechanisms

Without these services in place, a repository system would face some difficult obstacles in creating and providing value-added services. Repositories may provide multiple flavors of these services, some of which may be defined in generally accepted standards, models, and specifications.

Other basic services which operate on top of the above services are fairly common in most well-developed repository frameworks include:

  • Dissemination services, to transform repository data into other forms + formats
  • Authorization services

More advanced services may include:

  • preservation services, including checksum (generation + verification), file format migration, support for models like LOCKSS
  • relationship services, using an RDF triplestore or similar, offering SPARQL endpoints, interferencing, etc
  • discovery services, using Lucene/Solr/etc, to provide relevancy, optimized user experience, drill-down faceting

These more advanced services are likely separate applications in the repository ecosystem and are generally useful utilities independent of any repository system. Repositories generally integrate with these external applications in a modular, mix-and-match manner using well-defined interfaces.

Fedora

One approach to repository services is the “repository-in-a-box” model, where you can install and configure a base set of services provided by a single application. Within this group of services, Fedora provides a very basic implementation of the core repository services (vs a full-stack application like DSpace, which provides production-ready user interfaces). Fedora bills itself as a Flexible, Extensible Digital Object Repository Architecture.

  • Identifier services, through PIDGen which provides sequential identifiers per-namespace
  • maps http uris to deferenceable uris to files
  • REST + SOAP APIs for Ingest + Delivery
  • Dissemination services using WSDL
  • Authorization using XACML (and authentication using a number of plugins)
  • Integrates with the Mulgara triplestore and a Lucene index (by default)

Fedora provides a many opportunities for customization and enhancements through custom development:

As services go beyond the basic, common applications present in institutional repositories, enhanced repository services require custom development or supplemental services outside of the repository services. For most, this includes integration with a more advanced search provider (like Solr). At some point, additional services can blur the lines between the repository services and front-end user interfaces (which have to respond to local customization to meet user needs).

Repository-independent services, or third-party services, require some wrapper to make them interoperable with the Fedora APIs, which makes integration with existing technology more difficult. Even Duraspace’s Duracloud offering is (currently) built as separate services with some possibility of storage-level integration. Preservation support services will bypass the repository APIs and provide those services against the file system instead.

Considering the services Fedora doesn’t provide or the obstacles Fedora creates in integration, many ask why they should start using Fedora anyway. The strongest response to this, I believe, is that it provides a common structure to basic repository services, while at the same time not creating major obstacles to future expansion or migration outside Fedora. Out of the box, Fedora provides a set of “training wheels” (ht Mike Giarlo <http://lackoftalent.org/michael/blog/>) for repository services development that can be removed when unnecessary, but in the meantime offers structure for the creation of new repositories and support for repository services as needed.

CDL Microservices

Another approach to repository services are “microservices” like those designed by the California Digital Library (CDL), provide standards and specifications for individual repository services, which form a structure for standardized, mix-and-match repository services that can integrate, interoperate and take advantage of existing technology independent of a repository application like Fedora. This, conceivably, allows all domain developers to take advantage of these common projects without using a specific technology. CDL provides microservices specifications for:

  • identifier assignment + registration, using NOID, which can act as a CLI tool or a CGI service
  • file-system structures, using the Pairtree convention
  • data exchange and verification, using BagIt
  • access standards, using the ARK URL format

The standards are developed inline the “UNIX philosophy”:

Write programs that do one thing and do it well. Write programs to work together. — Doug McIlroy

These basic services can be organized and crafted using the existing capabilities in web servers, file systems, etc. More advanced services can act within this structure, using individual standards when needed. While significant development and customization may be required to get a microservices architecture to a useable state, the end result is more flexible and targeted to an institutions needs.

Flexing Fedora

These two approaches are certainly not incompatible, and Fedora is quite capable of using some of these micro-services standards under the hood (replacing custom developed approaches to these basic services). By taking this approach, Fedora could act as a management application on top of generic repository data, allow both Fedora-based and microservices-based services to operate on the data, and make it easier to reach around Fedora when necessary (or, go so far as to remove it entirely).

What follows is a short summary of on-going work in this area, which mostly focus on removing the Fedora-centric definitions of /how/ or /where/ services act. The majority of these ideas build on new developments and best practices (since Fedora was initially created) in the repository community as a result increased adoption or awareness of issues. Where available, I’ve included links to projects in-the-works.

Some of this work is quite easy to do:

Other projects that are more involved, and require more work than just creating new modules for Fedora:

More advanced microservices integration is highly involved and would require a major re-work of the application:

  • Two-way messaging queues (or file alteration monitors, or database update hooks) to allow Fedora to receive updates
  • decreased reliance on self-generated registries, I think the situation is getting better, but I’m not sure its fully there..
  • pluggable storage modules with intelligent filtering, routing, multiplexing, and rules mechanisms — the Akubra project may be doing (part of?) this <http://www.fedora-commons.org/confluence/display/AKUBRA/Akubra+Project>
  • workflow support hooks, to allow integration and automation of workflow tools (possibly a result of Hydra?)
6 people like this post.

Posted in Repository.

Tagged with , , .