MALLET topic analysis of JCDL + Open Video tweets

I’m working towards some interesting visualizations of the twitter streams from a number of conferences (starting with JCDL and Open Video this last week). I’m using Judith Bush’s very cool gawk script to parse up the raw atom files. My first step was to get topics for the corpus as a whole:

/Applications/mallet/bin/mallet train-topics --input data.mallet --num-topics 10 --output-state topic-stat.gz --output-doc-topics doc-topics --output-topic-keys doc-keys --num-iterations 2000 --optimize-interval 2500

JCDL

0	5	http bit ly org interesting marshall analysis wolf week existing pizza people
1	5	jcdl books data don works problem target foundation facilitate creating
2	5	jcdl libraries evaluation future discussion day multiple public lots univ
3	5	jcdl paper lightweight music back issues funny build dog
4	5	session user talk search talking papers documents great collection type tatted
5	5	conference library good mentors content students focus run building pints
6	5	jcdlgoogle www law participation dl dchud online nice bats duck
7	5	jcdl austin poster google tomorrow small librarian tonight nice
8	5	jcdl digital tags question collections social wikipedia war
9	5	workshop people time quality study alan live archive idea lots

Open Video

0	5	video conference open source net making metadata mozilla developers adobe brokep learned system presentation long openvideo ly app msf
1	5	openvideo ovc tv time gd week vlc html stuff folks nyc platform google meet checking slides startrek kdnlf ll
2	5	media goodman amy watch good great mainstream idea war days im tr flash tpb change put films class devine
3	5	openvideo youtube rt videos session world xenijardin system room art doesn show iran channel film audio totally activism presentation
4	5	openvideo content pirate public live sunde peter cc jardin project creative keynote speaker ogg sweden twitpic licensed seminar fisl
5	5	openvideo people internet talk access day tinyurl conf vid online storytelling awesome working hack digital miro final evolution similar
6	5	openvideo de free en la xeni el years amazing copyright film blog education works closed msurman tk iranian tagged
7	5	openvideo amp ted don great work culture fair back editing question technology site cable id lecture wiki form youtube
8	5	http bit ly check interviews wrap royblumenthal creativecommons based casts ll website footage archives ogg rad blogposts
9	5	openvideo org www openvideoconference make http web watching foss roflmemes put hope sessions online cool launches marketing rest rt

Future work will include temporal analysis and “speaker” analysis.

Posted in Uncategorized.


The Problem With Zend Framework

So there I was, well into my second hour of attempting to implement a simple authentication scheme doing things the “correct” way using the Zend Framework, and I had not written any code yet. After looking at the Zend_Auth documentation I felt pretty good about it, it seemed that all I needed to do was pick the right adapter to interact with a database, which was conveniently listed: Zend_Auth_Adapter_DbTable. Now, I didn’t want to overlook security entirely, so I used the MD5 hash, which was sufficient for my purposes. Using a parameter in the constructor, I can specify that I should use the SQL command MD5() to apply the hash in a transparent manner. But then it dawned on me (first in a sequence of revelations) that the entire supposed beauty and simplicity of this abstract interface, which claims to allow for modularity and portability, increase code reuse and avoid copy/paste coding, and allow me to swap underlying schemes without rewriting my code, is a lie. If I’m required to insert specific, database-centric code directly into the topmost layer of my application code, then the portability and flexibility is immediately shattered, as I must update all of my code if I subsequently choose to switch to a database where the md5 function is not “MD5()” but rather “MD5HASH().”

In my brief experience attempting to create a new website with it, the Zend Framework has been a nightmare to use. I’m not a professional, and it has been a few years since I’ve done any serious programming with PHP, but that shouldn’t impact my ability to effectively use something that should reduce the headache associated with solving common problems. I’ve spent a lot of time, over the past two years, doing microcontroller programming in C, and it has given me a new perspective on some things, which complements my intuition about this. There is a fundamental flaw with the design and implementation approach used in Java, there is a fundamental flaw with the design and implementation approach used with the Zend Framework, and there is a fundamental flaw with the concept of design patterns as a go-to solution for any non-trivial problem. I’ll address the other two later, right now I’m just going after ZF, because it happens to be the cause of my concern tonight.

ZF suffers from acute over-engineering, with not enough practical implementation to shape the structure of the design decisions. Like most libraries that I hate using, in the unrealistic examples that are posted with the documentation, using the library seems straightforward, but when one goes to implement it, it either comes out not as simple or self-contradicting. In this case, let’s continue further with my problem: using Zend_Auth along with a MySQL database to implement user authentication. In order to construct an instance of Zend_Auth_Adapter_DbTable, I need a subclass of Zend_Db_Adapter_Abstract to pass to the constructor for Zend_Auth_Adapter_DbTable, so that it has a connection through which to send queries. Ok, that seems reasonable. But where do I define this instance? Most likely, I’ll use a database connection throughout the application, so the correct place is in my bootstrap class. Now, here’s the first problem that actually using ZF in a real application would uncover: I can’t automatically have Zend_Config pick up my database parameters and throw them into Zend_Db if necessary. Instead, I have to create my own instance of Zend_Config or specify the parameters myself in the source. In the spirit of avoiding more bloat code and wasted memory space, I’ll specify the parameters myself. Also, I’ll “do the right thing” and register the database as a resource available to the application through the bootstrap class, in order to avoid polluting the global space with an evil global variable:

protected function _initDatabase() {
	$db = Zend_Db::factory('Mysqli', array(
				'host' => 'localhost',
				'username' => 'article',
				'password' => 'example',
				'dbname' => 'article_example'));
	return $db;
}

Now, by calling getResource(’database’) on the Bootstrap class instance, I can access my database singleton. But, there’s another problem. The bootstrap class isn’t a globally accessible variable (although it could be made one when creating it during application initialization by the end programmer), we have to find a way to access it, in order to access the database. Now that that is taken care of, let’s get back to using the authentication adapter. I can now use my Auth Adapter in a manner similar to this, within the login action for the main controller for the site:

public function loginAction() {
	$bootstrap = $this->getInvokeArg('bootstrap');
	$db = $bootstrap->getResource('database');
	$auth_adapter = new Zend_Auth_Adapter_DbTable($db, 'users', 'user', 'pass', 'MD5(?)');
	$auth_adapter->setIdentity($_POST['username'])->setCredential($_POST['password']);
	$result = $auth_adapter->authenticate();
	if (!$result->isValid()) {
		$this->content = 'Login failed.';
		return;
	}
	// Authenticated user code here ...
}

Note that I’m not doing any input validation or anything fancy and we’re already at 5 lines of code to call a single class method that should authenticate my user, along with 5 programmer-initiated function calls (line 5 has two on it) just to set up the necessary glue code. Then there’s another function call just to read off whether the authentication was a success, instead of returning a straightforward boolean variable, but I’ll afford them that one call, because most likely a better solution to this problem would also involve one authentication function call.. In reality, this would be wrapped around Zend_Form, further complicating matters, but it would cloud my example.

This is totally unacceptable. Some of you may know, however, that I could use Zend_Registry to track the Zend_Db_Adapter class, so let me address that really quickly before explaining why the above code is not what I, as a programmer, want to see or write, and why nobody should be willing to accept it. Zend_Registry is actually a pretty neat class, and I wouldn’t mind using it on a regular basis, or using a concept similar to it. However, it is effectively equivalent to using global variables, something that most programmers who scoff at the notion of global variables are unwilling to admit to themselves. It has the one advantage of being able to partition the global namespace and have multiple values stored for a particular key. However, in order to use any Zend_Registry instances other than the currently selected static instance would require either recursively storing Zend_Registry objects inside the global Zend_Registry (negating the benefit of moving the database adapter to the Registry, as two function calls are required to access it again), or additional global variables to store multiple registry instances, which is self-defeating, as if global objects were allowed, Zend_Registry wouldn’t be necessary at all.

Back to the matter at hand: why is the above code bad? The entire purpose of frameworks and libraries like ZF is to prevent programmers from repeating code…but, at the same time, we’ve already found a situation where I am forced to repeat code every single time that I want to use it. Namely, in each method where I want to make a database call, or in every class I define which stores a pointer to the database object, I have to obtain a pointer to the Zend_Db_Adapter_Abstract subclass. If I’m in the controller object, this is as simple as copying and pasting 2 lines of code, and if I’m anywhere else, I simply require that a reference to a subclass of Zend_Db_Adapter_Abstract be passed in as a parameter. Now, many people who work in languages where this doesn’t really matter may not consider the stack as a resource but it is ultimately still a fundamental feature of modern programming languages, and requiring that I pass around a pointer to a singleton which should be a global variable or a statically accessible variable (through Zend_Db::getInstance() or similar) is a waste of space and processing time (one PHP opcode when calling the function, another opcode when the function is entered, to load the data from the stack, plus the memory used to pass around a copy of a copy of a copy … of a pointer). So here I am, as a programmer, having been sucked in by the promises of never having to copy and paste code again and increasing my productivity by reusing well-engineered solutions to common problems, and all I see is that an enormous amount of overhead has been introduced to my program and I am still copying and pasting code (or just rewriting it over and over), except now it is glue code instead of feature-related code.

To that end, I have a message for the designers of the Zend Framework, and to libraries I often hate using everywhere. First, thank you for working on this. I’m serious. If it weren’t for people like you, we wouldn’t have libraries at all, and despite the fact that I’m criticizing what you’re doing right now, you have done a lot of things correctly, even in the libraries I hate, and even if just that one feature is right, that makes it worth something. Second, you’re going about your libraries the wrong way, and this can be fixed. I see the mess of design patterns and abstraction layers as a simple manifestation of Creeping Featurism: not as a specific view towards an unrealistic cadre of features, but rather as an over-optimistic desire to remain open to completely redesigning the application at any stage in development, an ability that must be sacrificed to produce quality code that is not bloated. To help counter this, here are some principles to consider when building a library, some of which are often respected, and others which are consciously violated:

  1. The goal is to make the programmer’s life easier in every way.
  2. Unless providing complex functionality, the library itself should not be complicated internally. Ideally it would primarily consist of canned solutions to common problems, solved in straightforward ways.
  3. The library should minimize useless abstraction and provide concrete implementations. Useless abstraction leads to bloated code mixing generic solutions into concrete problems that have precisely one answer.
  4. An absolute minimum of code should be required to deploy a basic “do I like it” version of the library/feature in question.
  5. The pre-packaged code should not introduce large amounts of overhead unless it simultaneously introduces a lot of commonly used and readily usable functionality.
  6. The simple mode of operation should be the default and it should be fast. If more complicated solutions are required, they should work correctly, but they should not preclude the efficiency of the simple case.
  7. Everything should come with reasonable defaults, except where defaults are unreasonable (e.g. database username/password) or dangerous.
  8. Corollary: Subclasses or static configuration data should be encouraged to change default behavior.
  9. The library (even a loosely coupled one) should have basic core components that are used internally and automatically, to provide a more fluent interface. These core components should be required.
  10. Unacceptably implemented or undesirable components that are not part of the core should not be required for use. Effectively, libraries should be as loosely coupled as possible without sacrificing reasonable internal interoperability.

Some of these are self-evident. For instance, #1. I decided to throw a freebie on there so that most libraries would have at least one positive comment for them. And the ones that aren’t designed with that basic idea in mind aren’t intended for production use, they’re intended humorously or as an example of what is possible if one really tries. For the rest of the list, I’ll explain why I think so and discuss how ZF meets or does not meet this criteria.

#2. Occam’s Razor. KISS. This is not a novel idea. It doesn’t make sense to apply a complicated solution to a simple problem. Problems like user authentication, database abstraction, and other common web programming scenarios are well understood, and almost everyone has written a personal solution to them. A framework or library should provide a familiar solution, plugging up common holes and pitfalls to ensure reliability, but it should not burden the implementation with features that are not reasonable. This ties in to #3, because the inclusion of or provision for “unreasonable” features often comes in the form of what I call “useless abstraction,” when an abstraction is artificially introduced in order to create a perceived increase in potential functionality. Zend_Auth is an excellent example of this. Look again at the example code I posted above, on using Zend_Auth_Adapter_DbTable to authenticate a user. For all this code, and all of the 480 lines of code in the class definition, this class does not: persistently store the user identity (this must be done manually) or automatically retrieve authorization information. It doesn’t even have hooks to integrate with the authorization mechanisms. By contrast, I can create my own User class which handles authentication; persistent, session-based storage; and authorization all as a single, easy-to-use solution. It may lack a certain “elegance” that the ZF solution purports to have, but it is more readily accessible, is more cleanly integrated with the rest of my code, and does not waste a lot of time dealing with abstraction mechanisms that should have been omitted.

This brings us to #4. ZF loses here again, most specifically if someone is trying to deploy an application using the MVC implementation it contains. Why? Because the amount of nonsense code and set up involved in simply trying to use the most basic of controller implementations (an empty one, by the way) is such a daunting task that Zend_Tool was created to simplify the process of creating the behemoth directory structure and stubbing out the required files into an automated procedure. However, here, for the first time, ZF also does a few things correctly. A few of the components, like Zend_Acl, Zend_Registry, and Zend_Session provide short development times before they are usable. Although, the necessity and utility of Zend_Session is dubious at best, which brings us to #5, the problem of overhead in library code.

A class like Zend_Session does nothing but wrap the already excellent set of primitives supporting session data in PHP. There might be an argument for only using OO code in an application that is designed with OO in mind, but if you believe it, you’re probably wrong. PHP is an interpreted language. That means it is slow. And, to combat that, PHP ships with a really strong suite of functions and language features implemented in C. If you pull up the code for Zend_Session’s start() method, you will see that it consists of a bunch of consistency checks followed by starting the session using the session_start() primitive function. It does some fancy error handler juggling to ensure that the appropriate messages are generated, but ultimately it provides no extra functionality, yet it has turned a single line consisting of a native function call into a static method call implemented in PHP, which includes the native function call. Beyond this increased overhead associated with starting the session, Zend_Session provides no benefits. Through the use of Zend_Session_Namespace objects it supposedly partitions the $_SESSION array to avoid conflicts. The problem, however, is that one could simply use $_SESSION['namespace']['variable'] and achieve exactly the same result, but with significantly less overhead in terms of both execution time and memory usage.

This overhead problem is ubiquitous throughout the ZF, and not just a problem with Zend_Session, but it’s easiest to spot there. Most of the time, the overhead is a result of a combination of over-zealous “architect-driven” design and the desire to over-abstract things on the assumption that it could potentially be useful, with no clear indication as to how, when, or why. In discussing problems #5 and #6 (the simple case should be default), it is easiest to compare performance to microprocessors. Prior to the creation of RISC microprocessors in the early ’80s, all chipsets had monumental instruction sets, implementing every possible instruction as a single command. This is comparable to the design philosophy used by ZF and other web-based libraries: they try to be everything to everyone, and as a result people use only a small percentage of their available functionality in any particular application, but are forced to suffer the consequences of the overhead incurred by accounting for the unused possibilities. By contrast, RISC microprocessors decided to simplify every aspect of the design, removing complicated instructions, because they were easily emulated by a short sequence of instructions, which directly led to increased performance. The proof provided was so significant that today all microprocessors (even those supporting x86) implement a RISC architecture, with additional decoding stages to automatically convert the more complicated instructions into several of the shorter ones. An important principle in this design is that the simple case (the one implemented by RISC) should be very fast. The complicated case, which happens far less frequently, should be correctly implemented, but it should not interfere with the design of the simple case.

#7 and #8 are again complaints I have about ZF, stemming mostly from using Zend_Auth. In order to create an instance of Zend_Auth_Adapter_DbTable, I have to specify all of the information about the database as arguments to the constructor (or I can set that information later with function calls). Why is the information not hard coded into the class itself or a configuration file? Hard coding things is bad you might say, but it can easily be used very effectively: simply have the parent class use a parameterized approach, and dictate that a subclass must be created which stores the specific configuration data. Then it is possible to simply create an instance of MyAuthClass which extends Zend_Auth_Adapter_DbTable and supplies the correct information so that I, as the programmer, don’t have to remember what the arguments are, everywhere that I need to use Zend_Auth. Granted, this is possible with the current implementation of Zend Framework (which is a very good thing), but it is not encouraged by the online documentation, which means that most users would not do it. There is a good reason for this: they don’t want to encourage you to create more glue code—that’s what a subclass that declares three parameters and does nothing else is—because it makes them look bad if they need more and more support code just to make the basic features work.

An alternative solution, which is much more elegant, to the problem above, would be addressed by paying attention to principle #9: increased coupling for core library components. If the Zend_Config class was used implicitly (ultimately the main application’s configuration file should be the “global” configuration, and additional instances can be created for smaller .ini files as necessary), then it would be possible to specify more configuration information (which tables to use, how passwords and users are stored) in one place, which makes it easier to reconfigure an application without having to sort through the functionality, which is not changing. Before I go on to #10, I have another important comment about the combination of #7 (sensible defaults), #1 (ease of use) and #4 (easy to deploy). The automatically generated index.php which drives the entire application is a disaster. In order for configuration file processing to happen correctly, the environment must be set up. The automatically generated code uses this line:

defined('APPLICATION_ENV') || define('APPLICATION_ENV', (getenv('APPLICATION_ENV') ? getenv('APPLICATION_ENV') : 'production'));

Prior to automatic generation, this code had to be written out by hand. It is completely unacceptable to require someone to write this line of code when it is clearly the same from program to program, and can therefore be included within one of the internal classes as a sensible default (because defining a value for APPLICATION_ENV will overwrite the default), thereby simplifying the end programmer’s interactions with the library. Perhaps I sound like I’m overly lazy, but for the rapid application development paradigm being encouraged by next generation web development, it is imperative to eliminate all glue code and it is imperative to simplify all of the decisions that the programmer must make to get a basic application online, because his efforts should be focused on the application itself, not the library on which he is building it.

At last, we come to #10: libraries should be loosely coupled. This is something that ZF actually does a very good job of doing. Almost all of the components can be used without mandating the use of anything else. If every component relied on every other component being properly configured, and configuration continued to be the nightmare that it is (no reasonable defaults plus a multitude of nested classes and abstraction layers), it would be completely impossible to use. It also decreases overhead to allow for loose coupling, which allowing for increases in functionality and modularity, because I can use Zend_Acl by itself to implement the authorization layer of my application, and then if a new version of Zend_Auth is released that I am satisfied with, I can easily replace my custom User class, without any major changes to the existing code using Zend_Acl (I merely replace the appropriate variables pertaining to the user account name).

Finally, a quick summary of the important things I want anyone who is thinking of designing an application framework or library to come away with. The design of a library should be driven by practical use. Don’t sit down and draw out everything you want your library to do, or you will be stuck with a mess of abstraction mechanisms that could have been omitted. Of course, some abstraction is very important, and foresight is necessary, but you should choose the lowest common denominator of all of the features that you want as the basis for abstraction. Do not bend and definitions or force anything, simply to increase the utility of an abstract class, because when (if) you ever define the necessary concrete classes to implement real functionality, you will find that what was originally a little bit of twisting words in the definition will become a special case that must be explicitly supported with multiple versions of the code separated by if constructs. This causes everyone to pay the price in terms of memory footprint (all of this code must be loaded into memory, even if it isn’t used) and extra overhead for checking for the special cases. Focus on being lightweight but complete. Do not go out of your way to do anything that isn’t immediately useful to most people. If you satisfy most people most of the time with your features, then the additional ones can easily be added through external mechanisms that wrap existing behavior. And, finally, be sure to implement as large of projects as you can with your library, perhaps in parallel to the design phase, because it will give you a much better idea of what works, what doesn’t, and what will annoy people before you’ve cemented it into the library.

1 people like this post.

Posted in PHP.

Tagged with , .


TMP: Master Plan

I spent a lot of time debating which ISA to base this on. MIPS has the advantages of being easy to explain, being a good example of RISC, and I’ve implemented large sections of it before for class. However, that last bit makes me wary, because it means that more students in a similar situation will need to implement a MIPS-like processor. I don’t want to just give away final projects here, so anything resembling MIPS enough to allow people to literally copy large sections of the code I release is undesirable. The other option that comes to mind is something based on ARM’s ISA. ARM is well-established and is a very common platform for embedded systems, but at the same time. So I’ll use a handful of the really interesting features available in ARM’s ISA, such as conditional execution of instructions. Finally, it’s never fun to blindly implement someone else’s ISA, it’s much more interesting to create your own design, so ultimately I will draw on both MIPS and ARM, as well as my own intuition, and create something that incorporates what I think are important features from both, while at the same time lending itself well to explanation of key features.

The Master Plan, then, loosely resembles the following, with many stages happening in parallel:

  • Design Instruction Set Architecture (ISA)
    • Should be Load-Store and adhere to basic RISC principles.
    • Should make it easy to explain and demonstrate basic design principles, while allowing for powerful features and extensibility.
  • Create a Simulator
    • Written in C
    • Will require an additional, simple assembler to convert instruction mnemonics to a binary execution image
    • Initially command line based (or ncurses frontend), with the possibility of adding a GUI if it seems worthwhile
  • Design ALU
    • This part, and everything leading up to the actual processor implementation connecting everything together, can happen in parallel with simulator programming
  • Create Register File
  • Miscellaneous Processor Structures
    • Memory interface (initially, an ideal memory, so this will just decode/translate requests)
    • Forwarding/Hazard Detection and Handling
    • Pipeline Registers
  • Assemble the Microprocessor
    • This includes pipelining
    • Support for multiply and divide instructions
      • Requires significant pipeline extensions and an additional parallel processing unit
    • Support for interrupts
      • This is necessary for implementing any reasonable, albeit basic, operating system
      • Also necessary for I/O, which is quite reasonable for anyone using integrated FPGA development boards (which often have a lot of I/O and the necessary support circuitry already implemented)
      • Not covered/explained often in microprocessor courses, yet integral to the functionality of all modern computers (interrupts are used to make system calls in both Windows and Linux).
    • A basic OS
      • Flat memory model, without memory protection (dangerous, but simple)
      • Cooperative Multithreading (because it’s easy to implement)
      • Implements basic system calls to provide common sets of functionality for user programs

    With this list of central goals complete, a few additional features seem quite reasonable and would serve to be important in understanding the functionality of a much more advanced processor. Therefore, these are secondary targets, to be completed once the basic design is done:

    • Instruction and Data Cache
    • Branch Predictor

    I’ve even got some genuinely crazy ideas, but I’ll keep them to myself for now (most fall into the realm of “not existent on 99% of available commercial microprocessors/microcontrollers”). If this can be established as a platform on which I can test less mainstream concepts and designs, then this will be a success for me.

    All that said, I may have missed a few features that won’t be apparent until the time comes, but I think this is a good starting point. Next time, the ISA will be unveiled (hopefully a week), or at least the first revision of it. Design flaws will always become apparent through actually working with it (something a lot of software library designers haven’t realized, but I digress), and so things will be amended as time goes on.

Posted in The Microprocessor Project.

Tagged with , .


Managing Wordpress plugins with svn:externals

I’m using subversion to make the process of updating wordpress easy (installing an upgrade is a simple `svn update`). To extend that simplicity to plugin management, I’m using the svn:externals property to automatically update my plugins.

Installing it is simple: `svn propset svn:externals -F externals .`

My externals file looks like:

advertising-manager http://svn.wp-plugins.org/advertising-manager/trunk/
akismet http://plugins.svn.wordpress.org/akismet/trunk/
amazon-showcase-wordpress-widget http://svn.wp-plugins.org/amazon-showcase-wordpress-widget/trunk/
google-analyticator http://svn.wp-plugins.org/google-analyticator/trunk/
google-sitemap-generator http://svn.wp-plugins.org/google-sitemap-generator/trunk/
google-syntax-highlighter http://svn.wp-plugins.org/google-syntax-highlighter/trunk/
latex http://svn.wp-plugins.org/latex/trunk/
smart-archives-reloaded http://svn.wp-plugins.org/smart-archives-reloaded/trunk/
stats http://svn.wp-plugins.org/stats/trunk/
wp-rdfa http://svn.wp-plugins.org/wp-rdfa/trunk/
xrds-simple http://svn.wp-plugins.org/xrds-simple/trunk/

Posted in Uncategorized.


Debian: Installed packages

After a week of recovery, my VPS is finally back where I want it. First, I’d like to applaud the work of vaserv/fsckvps for getting things back up; they had a rough start, but once they started working and communicating, I was impressed by their dedication.

Because my VPS suffered data loss, I had the (mis-)fortune of re-installing packages, which presents this opportunity to present my list of necessary Debian 5 steps:

  • useradd -m -G sudo (username) && passwd (username)
  • hostname (hostname)
  • apt-get install aptitude
  • aptitude install apache2 php5 mysql-server php5-mysql openssl libsasl2 libsasl2-modules sasl2-bin postfix dovecot-pop3d
  • aptitude install subversion bzr bison
  • aptitude install autoconf automake libtool gcc g++ gperf sun-java5-jdk
  • a2enmod userdir && a2enmod rewrite

After that, it’s a matter of editing configuration files to add postfix+sasl, dovecot+ssl, apache2 vhosts.

Finally, restoring user-land data from backups (you made backups, right?). This includes re-establishing ssh authorized_keys, htdocs, and the mysql databases.

Posted in Uncategorized.


The Microprocessor Project: Overview

This summer, I need something to do. I’ve got a few ideas, but in particular, one desire I’ve had since I was a young teenager was to design my own computer, which I’ve now learned enough to understand means “to design my own microprocessor.” Software for it can come later, if I still like the idea after designing the hardware (or, if I’m really clever, I can try to port Linux to it, and not have to write my own software at all). However, I see now that the reason I didn’t do this sooner, and my new reason for doing this project, is that there are virtually no in-depth, beginner-level resources available on the Internet (if anyone wants to try to look at OpenSPARC and without any understanding of what it’s based on, deduce how to build a basic microprocessor, please, be my guest). To that end, today I begin my next big project: designing, simulating, testing, and maybe putting on an FPGA a real, functional, 32-bit microprocessor. Then, explaining it in detail, putting my results on the Internet, and releasing it under the GNU GPL, so that there exists a simple platform for people to play around with microprocessors, extending it with additional features or modules as they see fit. I don’t intend to create something remotely commercially viable, because I will, initialy, purposely not include features found even in the early 90s on Intel chips, in order to simplify the design process, and help explain the basic concepts. If I like what I’ve done, when the time comes, maybe I’ll add more realistic features, such as multiple pipelines (superscalar) and out-of-order execution, or I’ll design support for more esoteric features (like barrel processing), to see how readily (if at all) they are implemented and to provide some context about where microprocessors can go from the basic 5-stage pipeline.

The real goal here is educational in nature. Microprocessors are incredibly complicated devices, and so most online descriptions are terse at best. Despite that, I think people should be able to learn, for free, exactly how a microprocessor works. However, I can’t explain how an x86 chip works, for a number of reasons. First, I don’t actually know. I have a pretty good understanding of the elements from which it is built, but I’ve spent almost no time studying the ISA or reading papers on how Intel has actually implemented things. Second, I’d hate to speculate and be totally wrong, and then have people read it and think that was how it works. In contrast, I do understand the principles upon which RISC microprocessors (like MIPS and ARM) are based, so I can readily make up an instruction set that it is similar to MIPS, implement it, and then talk at great length about how it works. Hopefully, then, a younger version of me can find this online, look through it, and then be able to answer all those questions I had until college about how computers actually do things. If software and programming can be well documented and available for free to learn from, then hardware should be too.

If you would like to contribute descriptions, Verilog, or simulation code (which will mostly be written in C), please let me know by posting in the comments.

Coming soon: The Master Plan. I’ll lay out precisely what the objectives are and give an overview of how the project will progress as a whole.

Posted in The Microprocessor Project.


Archive temporarily unavailable

This site was affected by the great hypervm/fsckvps calamity over the weekend. We’ll be slowly restoring posts and content, however this time seems appropriate for a little housecleaning.

If you’ve arrived here looking for specific content, please leave a comment or send us an email and we can pull it out of our archives.

Posted in Uncategorized.


Multiplexers and Decoders

Here’s something that wasn’t emphasized in my digital design courses, which every amateur digital designer should be aware of: decoders and three-state buffers use fewer logic gates than large multiplexers. Obviously when one uses commercial devices with an output enable, the decoder functionality is implicit, but whenever it comes to incorporating selection logic into our own designs acting on our own data, the more natural choice is a conventional multiplexer. Both approaches have benefits and drawbacks, but which is “better?” First, a quick look at the differences between the two: a multiplexer uses a combination of AND and OR gates to select between a number of inputs and always drives the output accordingly. A decoder/three-state selection device (which is really also called a multiplexer, but I’ll try to avoid some confusion) uses only AND gates and single transistors to drive the output with just one of the inputs when it is selected. The key aspect here is that there is not always an output of 0 or 1—a third option, high impedance (or Hi-Z), is available, corresponding to “no value.” This happens because each of the combination of select pins controls whether each input drives the output or not, rather than all inputs driving something regardless of utility. The differences between the two can be seen by implementing generic multiplexer functionality for a 1-bit, 8:1 multiplexer. Without three-stated buses, this requires 24 two-input AND gates and 7 two-input OR gates. The schematic shows that there are six logic gates between the input and the output of the multiplexer:

8 to 1 mux

8 to 1 mux

By comparison, the same functionality can be realized using only 16 two-input AND gates and 8 three-state output buffer elements. The schematic is also much simpler:

8 to 1 decoder

8 to 1 decoder

This reduces the gate delay between input and output to two gates plus a transistor. This would make it seem like three-state buffers and decoders are the obvious choice for every situation, because they have a shorter gate delay and use fewer transistors, which has many benefits. However, there are hidden problems associated with this design, making it unsuitable for some situations.

First, if the design is being developed for an FPGA, the AND/OR approach is more versatile. Why? Because FPGAs do not implement traditional combinational circuits. Often, combinational blocks are instead broken into units that can be implemented by 4-input Look-Up Tables (LUTs), which means that any function that can be described by such a lookup table has the same speed cost, regardless of the number of gates that would be required to implement it otherwise. Next, three-state elements are not readily available within most of the logic blocks (although an FPGA could be built to not have this restriction), meaning that more interconnect must be used to route signals through three-state elements, although it is more common for the compiler to simply recognize the situations and convert it into a multiplexer that can be implemented easily with the LUTs.

Second, when working with CMOS, each transistor has an capacitive property. Therefore, as more transistors are tied to the same bus, even in a Hi-Z mode, the capacitance will cause the rise and fall times to increase, in a manner similar to having an increased fanout. This can create problems with setup time in very high speed situations, whereas the rise and fall time of an AND/OR multiplexer output will not vary with the number of inputs (although the total propagation delay will—the two must be balanced, so it is possible that the longer rise time is shorter than the longer propagation delay).

Third, although this isn’t really a big problem, is that during transients when switching, it’s possible for two three-state elements to have their enables asserted at the same time, potentially causing the bus to be driven both low and high at the same time. This is going to happen for an extremely short period of time, as the switching time is usually less than the delay of a single inverter, so I think the additional power dissipation caused by this situation is negligible compared to the sum of leakage current and normal switching power consumption

But, these three considerations aside, if you’re designing an ASIC or wiring together something with discrete components (a rarity in itself), it’s possible that replacing AND/OR multiplexers with decoders and a three-state buffer could increase the throughput of the digital circuit you’re designing, or just make it easier for you to wire, with fewer things to keep track of. It’s hard to say which approach is better, but it’s certainly important to understand how to use both effectively, because many situations will call for one or the other for optimal performance.

Finally, a quick Verilog reference (because the Internet is thoroughly lacking in this department), I find the easiest way to implement a three-state bus is with structural Verilog, defining a module to handle the three-state output (although sometimes it can be incorporated into a larger module):

module en_buffer(A, EN, O);
	input A, EN;
	output O;

	assign O = (EN ? A : 1'bz);
endmodule

Then simply create instances of the module with the same wire tied to all of the output pins:

...
wire [2:0] A;
wire [2:0] EN;
output out;
genvar i;
...
generate
	for (i = 0; i < 3; i = i + 1) begin: threestate_output
		en_buffer out_buf(.A(A[i]), .EN(EN[i]), .O(out));
	end
endgenerate
...

Posted in Hardware.

Tagged with .


Open Repositories ‘09, May 18-21

I attended the Open Repositories conference, May 18-21 in Atlanta, which “attempts to create an opportunity to explore the challenges faced by user communities and others in today’s world”. In general, the OR community is very relevant to our work with repositories (for Mellon/OpenVault, Teachers’ Domain, the DAM system, etc), and so many people are facing the same problems with cataloging, preservation, and dissemination.

The California Digital Library had a presentation that provided a connection between curation and preservation goals (which I think is something we’re very interested in), saying: Lots of [copies, description, services, uses] keeps stuff [safe, meaningful, useful, valuable].

John Wilbanks, VP of Science at Creative Commons, gave the keynote — “Locks and Gears: Digital Repositories and the Digital Commons” — stressing the importance of Open Data, helps bring together isolated knowledge pools. The ultimate goal is to turn databases into the web, to allow useful “stuff” to happen rather than locking it away. Making this information available, linked, shared could help solve existing problems that lack funding (cure for Huntington’s Disease was one example). The HD Foundation is funding some of the Science Commons’ work opening up genetic databases, and creating semantic web endpoints (with SPARQL) for that data to make it more accessible; Wilbanks had an analogy between the ability to easily edit an HTML page with the ability to easily edit a SPARQL query, which allows for more “hackability” (potentially in the face of copyright or IPR).

The OR community is finally starting to think about video material, which makes our appearance very timely, and allowed us to make several excellent connections, both on a technical level — Glasgow’s Spoken Word project, U. of Alberta’s digitization, encoding, and cataloging workflows, Rutgers’ work with NJVid + RUcore to form a state-wide educational video delivery network, etc — but also around content and preservation — the educational TV collection ofIndiana University and a collection at Northwestern.

We also connected with a community group interested in creating repository-backed tools for scholarly research, trying to provide solutions and tools to support scholars and make repositories useful and exciting new mediums and doing so in an open manner to “cross-polinate” across disparate groups, which can lead to previously unrealized benefits.

There was also a lot of interest around creating multiple, light-weight interfaces to collections to meet the needs of a group of users, rather than “building the death star”. This community seemed split into people using existing applications (Drupal — UPEI among others)) on top of Fedora or building front-ends on top of a framework (PHP/Zend Framework (WGBH, NASA), Ruby on Rails (MediaShelf, Hydra), Django). On the other end, there was interest around Sun’s OpenStorage platform (which apparently will still have life inside Oracle, the iRODs distributed storage repository, and DuraCloud, a cloud/distributed storage abstraction layer for repositories.

Tony Hey, VP Microsoft External Research, convinced me that MS isn’t wholly evil, and is trying to do the right thing among scholarly communities by embracing open standards and interoperability (obviously, when it suits them, but still an improvement). They’ve done some great work with MS Office add-ins to connect the suite with institutional repositories. Finally, on Wednesday, MS launched Zentity, their new repository offering build on the MS stack (IIS, MSQL, etc), perhaps useful for institutions to get up and running with a repository; everyone recognizes this is not a new product line, but a research project, and MS is trying to break into a monopolized market.

Our presentation was well received, and our poster won the poster session (out of 30+ posters; note: in future posters, specify the pantone/CMYK/etc colors + don’t be afraid of obvious branding).

Posted in Uncategorized.


inbflat mixer

I’ve quickly hacked together a mixer-style interface (using jquery + the youtube chromeless player) for the inbflat youtube mashup media project. If I get really motivated, there are a couple of features that would be nice:

  • Solo-mode
  • Media scrub bar — I’m not sure what the appropriate interface would be for this
  • Add/remove video clips
  • Port this to the HTML5 canvas/audio system for fun

Posted in Uncategorized.