Matt Jibson's Blog

Careers Localization, part 3: Extraction with Roslyn and Uglify

| Comments

Previously we have discussed our reasoning and API for localization. Here, we would like to continue with the next topic: text extraction from the source code. That is, after we have indicated the text we would like translated, how we extract that text out of our codebase for translation. (Note: we used Acclaro for the actual translation, and have been very happy with them. Recommended.)

The difficulty is to extract all of the strings so that they can be sent to translators. Compounding this difficulty, we also must make sure to send the correct number of plural forms to the translators. Look at the example below. __count does not appear in the translated string, only in the values object. This means we must, at extraction time, understand these value objects and be able to reason about them—we must know their types.

1
_s("There is a unicorn", new { __count = 3 }) // "There are some unicorns"

There are a few bad ways to do this:

  • build a parser by hand to deal with this (possible, but fragile, error-prone, and silently oblivious to things it does not support)
  • search or grep
  • regex

Our first solution was the hand-built parser. It worked reasonbly well, but there were known, unfixable bugs. We needed a solution that allowed us to have full syntatic and semantic understanding of source code files. Syntax for knowing the tree of the source code: function calls, argument lists, object access. Semantic for seeing into object types and values.

C# and Razor extraction with Roslyn

The C# team at Microsoft has been working on a project called Roslyn. It is a programmatic way to access the syntax tree and semantic model of C# files. That means that we can hand it a C# file and then search through it looking for certain kinds of things, and act on them. Roslyn comes with a SyntaxWalker class that walks over each node. You can override the one you want. In our case, we want any invocation (function call) named _s or _m:

C#
1
2
3
4
5
6
7
8
9
10
11
12
private static readonly IEnumerable<string> SINGLE_LINE = new[] { "_s", "_m" };
public override void VisitInvocationExpression(InvocationExpressionSyntax node)
{
  base.VisitInvocationExpression(node);
  var callname = node.Expression.GetText().ToString();
  if (SINGLE_LINE.Any(callname.EndsWith))
  {
      var stringInfo = StringInfo.Create(node.ArgumentList, model);
      if (stringInfo != null)
          // save it somewhere
  }
}

Now we’re trying to create a new StringInfo (just a container class). We need to examine our node.ArgumentList and see if the first argument is a string. We use C#’s dynamic feature coupled with various ways to call StringInfo.Create to easily support different types of arguments (noted in comments above the functions):

C#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// _s("string", object) (object may contain __count)
public static StringInfo Create(ArgumentListSyntax args, ISemanticModel model)
{
  var first = args.Arguments.FirstOrDefault();
  if (first == null || first.Expression.Kind != SyntaxKind.StringLiteralExpression)
      return null;

  var stringInfo = new StringInfo {
      Text = first.Expression.GetFirstToken().ValueText,
  };

  var second = args.Arguments.Skip(1).FirstOrDefault();
  if (second != null)
      stringInfo.Count = HasCount((dynamic)second.Expression, model);

  return stringInfo;
}

// _s("string only") (no object, no __count)
public static StringInfo Create(LiteralExpressionSyntax literal, ISemanticModel model)
{
  if (literal == null)
      return null;

  return new StringInfo {
      Text = literal.Token.ValueText,
      Count = false,
  };
}

Ok, now we’ve got our string. Next we need to figure out if __count is part of the object. We’ve implemented various ways to determine HasCount(), as referenced above on line 14. Again, C#’s dynamic proves useful.

C#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
private const string COUNT = "__count";

// _s("string", new { __count = 1 })
public static bool HasCount(AnonymousObjectCreationExpressionSyntax arg, ISemanticModel model)
{
  var result = arg.Initializers.Any(x =>
      {
          if(x.NameEquals != null)
              return x.NameEquals.Name.Identifier.ValueText == COUNT; // see if __count is part of the object
          return HasCount((dynamic)x.Expression, model); // otherwise throw it back
      });
  return result;
}

// _s("string", obj)
public static bool HasCount(IdentifierNameSyntax arg, ISemanticModel model)
{
  // we already have an object, just see if it has __count
  var symbols = model.LookupSymbols(arg.Span.Start, name: arg.Identifier.ValueText);
  var first = symbols.First();
  TypeSymbol type = ((dynamic)first).Type;
  return type.GetMembers(COUNT).Any();
}

// _s("string", obj.member)
public static bool HasCount(MemberAccessExpressionSyntax arg, ISemanticModel model)
{
  // extract out member...
  var name = arg.Expression as IdentifierNameSyntax;
  if (name != null)
      // ...and try some of our above functions
      return HasCount(name, model);
  return false;
}

// anything else
public static bool HasCount(ExpressionSyntax arg, ISemanticModel model)
{
  return false;
}

The above tries all variants (that were in our code) that might contain __count somehow. The use of dynamic allows us to not care about the kind of expression, which is determined during runtime, and then the appropriate version of HasCount() is called. Here’s a gist of it all (it’s got some other features not listed here).

The above covers our C# controllers. For the Razor views, we cheat a bit. The ASP.NET compiler is invoked which converts the views into C# files, then the same extraction logic is run on those. This is much easier than trying to write a full razor parser.

JavaScript extraction with UglifyJS

For JavaScript, we use a similar technique. UglifyJS is a JavaScript-based JavaScript parser. It also can return a syntax tree, which we can examine for uses of _s, from which we can extract strings and objects. It’s a pretty hideous function and not worth pasting, but here it is.

Both of these extractors yield JSON, which is processed in node and sent off to our translation service. We also have a dashboard (built with AngularJS) to view all translations, fix them, and reprocess all the files.

Conclusion

Having two full language parsers allows us to be flexible. Both our C# and JavaScript implementations are equally functional at all levels—API, extraction, display—allowing us to have one unified API in our codebase.

Translation and localization are hard. We spent months refactoring and extending our codebase to support it. But now, we can change, add, and translate strings with little effort, because our tools handle all the hard stuff. If you care about having quality translations (supporting dynamic text insertion and pluralization), whichever solution you use (resource files or gettext-style) must be able to fully understand your code so that it can generate high-quality output to send to translators.

Careers Localization, part 2: The API

| Comments

Previously we discussed our design requirements and solution for localization. Here we will continue with a description of the API. This means the way in which we would craft strings to be localized.

Markup

We chose to use gettext-style translation strings (i.e., not resource files where strings are identified somehow, instead we just indicate the English string right where it will be shown). We chose a syntax that allows dynamic variable replacement and easy pluralization, and can work similarly in JavaScript and C#. Here’s what we came up with:

C#
1
2
3
4
5
6
7
8
_s("Unicorns prance") // "Unicorns prance"
_s("$name$ have horns", new { name = "Unicorns" }) // "Unicorns have horns"
_s("There are $__count$ unicorns", new { __count = 1 }) // "There is 1 unicorn"
_s("There is a unicorn", new { __count = 3 }) // "There are some unicorns"
_s(@"
  Multi
  line
  supported.") // "Multi line supported." (whitespace collapsed)
JavaScript
1
2
3
4
_s('Unicorns prance') // "Unicorns prance" (single or double quotes accepted - any valid JavaScript string)
_s("$name$ have horns", { name: "Unicorns" }) // "Unicorns have horns"
_s('There are $__count$ unicorns', { __count: 1 }) // "There is 1 unicorn"
_s("There is a unicorn", { __count: 3 }) // "There are some unicorns"

We chose a function named _s that takes a string and an optional values object used for variable replacement. There is a special __count member of the values object which, if it exists, indicates to our translation engine that this is a pluralizeable string. Calls to _s could be in our C# files anywhere in the tree. They could also be in our razor views:

Razor
1
2
3
<div>
  @_s("translate this")
</div>

Finally, it also supports existing objects:

C#
1
2
3
4
5
6
class User {
  public string Name { get; set; }
}

var user = new User { Name = "Jimmy the Unicorn" };
_s("$Name$ is here!", user);

Markdown

Requirement 5 from part 1 is to support markdown formatting. The purpose of this requirement was so that we would never send HTML to translators. We do not trust someone else writing our HTML. Thus, we needed support for simplified markdown (only bold, italics, and links). Out of this grew the _m function (m for markdown, in contrast to s for string in _s). This puts certain tweets in perspective.

And so begins the drive to redefine “S&M” in the company lexicon.

With _m, we can do things like:

1
_m("Hi, **we hate fun** here at [stack overflow](stackoverflow.com).")

Limitations

The above API has some problems, or doesn’t support the following cases:

Best-practice DOM handling

Due to the use of markdown instead of HTML, there are some annoyances about how we must treat DOM elements. Let’s say that, before translation, we had something like this:

1
2
3
4
<div>Hello, <span id="click">click here</span> for magic.</div>
<script>
  $("#click").click(cb);
</script>

In (incorrect) translated form, we would have this:

1
<div>@_s("Hello,") <span id="click">@_s("click here")</span> @_s("for magic.")</div>

This is incorrect because the sentence has now been fragmented into three. Translators will then get these three fragments and must translate each on its own, with no power to reorder them if needed in the target language. But, how can we retain the JavaScript action? The <span> in there is not supported since we don’t allow HTML in translated strings (or, more accurately, we encode all output strings so they are HTML-safe). The solution is to use our markdown formatting and do funky stuff with the DOM. In markdown, we have access to <strong> and <em>, thus:

1
2
3
4
<div id="click">@_m("Hello, **click here** for magic.")</div>
<script>
  $("#click strong").click(cb);
</script>

The **click here** will render as <strong>click here</strong>, which we can use in a selector. You must then also apply CSS to disable the normal bolding. Annoying, but it works.

Gender variants

Some languages vary articles by the gender (or pluralization) of its object. This is difficult to do because sometimes we don’t know the gender, and so it’s impossible to tell the translation engine which gender to use. Consider:

1
_s("I'm going to $location$.", new { location = "Brazil" })

In Portuguese, “to” varies based on $location$. We need to know the gender and pluralization of $location$ (for example, the United States is masculine plural, and so “to” would change to “aos”). But we allow any location to be entered, so we would have to have that data for each of our target languages for all (or at least many) possible places. We’re not sure where to get that data, so we’re not supporting this yet.

Multiple pluralized words

We support exactly one __count instance, which means you can’t have two pluralized words in a sentence. Let’s consider the following case, assuming we wish to support any number of pluralizable words:

1
_s("I have $count1$ rainbows and $count2$ pandas.", new { count1 = 2, count2 = 1 })

English, German, French (and many others) have two plural forms (singular and other). This means we would have to generate 2 ^ 2 = 4 combinations of sentences to be pluralized:

1
2
3
4
I have 1 rainbow and 1 panda.
I have 1 rainbow and N pandas.
I have N rainbows and 1 panda.
I have N rainbows and N pandas.

There are languages that have up to six plural forms: zero, one, two, few, many, other. So, for the above example string, we would have to generate 6 ^ 2 = 36 combinations:

1
2
3
4
5
6
7
8
I have 0 rainbows and 0 pandas.
I have 1 rainbows and 0 pandas.
I have 2 rainbows and 0 pandas.
I have few rainbows and 0 pandas.
I have many rainbows and 0 pandas.
I have other rainbows and 0 pandas.
// Etc., with all combinations:
I have [0, 1, 2, few, many, other] rainbows and [0, 1, 2, few, many, other] pandas.

These must all be generated for translation because it is possible that in the target language, the surrounding words could change. For instance, maybe “have” changes based on the object, instead of the subject, as in English. Due to this complication, we capped pluralization at one item per string.

Other things we don’t know about

There are many languages, and we’re sure some of them have constructs that the above API doesn’t support at all.

Conclusion

The API allows us to do almost everything we need. The limitations ended up being truly limiting in only a few cases, and those could be worked around be restructuring the sentence. The next problem: how do we actually find the usages of _s and send them off to translators?

Part 3: Extraction

Careers Localization, part 1: Why Roll Our Own?

| Comments

Localization is a difficult feature to add to a website. A few months ago, we at Stack Overflow Careers localized into German. This took months of effort from much of the team. I would like to describe our design and implementation over a few blog posts.

This post will discuss our design choices.

Requirements

Localization consists of

  1. indicating text that should be localized
  2. translating it
  3. showing the correct translation to the user

There are various existing solutions in most programming languages that can do this. We looked at some, but none were able to meet all of our design and support requirements:

  1. C#
  2. JavaScript
  3. Razor views
  4. data attributes for localized error messages on form validation
  5. markdown formatting (so that no HTML is ever sent to translators)
  6. dynamic text replacement
  7. pluralization-aware
  8. gettext-style (English text in a function - no resource files and identifiers)

We discussed some of these requirements with others who had previously localized their sites. Some had decided to forfeit features due to difficulty of implementation (pluralization, dynamic text). Supporting each of these requirements is difficult, so we understood the decision to omit them. However, we were not willing to do the same.

Our Solution

The other solutions missed some our of requirements, or functioned in a way we didn’t like. We were, however, heavily influenced by existing work, like that of Daniel Crenna. We have a long history of rolling our own software when the 90% solution isn’t enough, and that tradition was continued here. We ended up writing the entire implementation from scratch. This includes designing an API, extraction process (to determine what strings the translators should translate), and localization engine to support numbers, dates, and translated strings. This allowed us to have a unified pipeline for text extraction and processing (i.e., the C# and JavaScript implementations could be identical at each step of the process), and give us the power and speed we wanted in the implementation.

As will be described in future posts, our implementation met all of our requirements, but with some limitations. Localization is hard, and doing everything is near impossible. If we wanted something better than the 90% solution, then what we have now is maybe in the 95% range: there are still missing features that we haven’t figured out yet.

Part 2: The API

Appstats for Go

| Comments

I would like to announce the release of appstats for go. Installation instructions are available on godoc.org. I’d like to thank Dave Symonds for his help on this project.

A demo site is available.

Appstats is an incredibly useful library for the python (and java) runtimes. The go runtime has had no similar library, adding to the difficulty of developing significant apps. I’d like to describe a bit about how appstats is implemented (applies to python, as well) and where I think the go runtime is today.

implementation

Appstats for go was implemented by copying the python HTML templates, examining the source, and attempting to make something work in go.

intercepting the data

In order for appstats to work pleasantly, it needs to automatically populate data from the HTTP request and intercept all RPC calls to app engine’s services (datastore, memcache, etc.). This is done by a wrapper that provides an appengine.Context instance to a handler function. Both the http.ResponseWriter and appengine.Context variables are actually appstats structs that forward importart calls and record timings and data. Go routines are fully supported: all of the internals are thread-safe, and appstats waits for all RPC calls to complete before serializing and saving its data.

persist to memcache

I had always wondered how python appstats stored its data. It was obviously in memcache, but it was interesting how it was able to so quickly store so many requests and fetch them all. How did that work? The timestamp on a request (well, just time.Now() when appstats starts) is used. The current count of microseconds is the memcache key to a request. If it overwrites another request, that’s ok. For example, a request that happened at 04:10:40.40567 gets the key 40500. The final two digits are converted to 0 so that we can limit the number of keys to 1000. Then, to fetch them all, appstats generates all 1000 possible keys and requests them from memcache. Only existing keys are returned. Each request stores a partial (for the index dashboard) and full (for the details page) item in memcache, with the type (“part” or “full”) appended to the key name.

things it doesn’t have

There are a few things that python appstats does that this implementation does not (yet). Some may not be possible ever, and some may require Google to help us out a bit:

  • full RPC cost information (only writes cost, currently; reads and small ops not yet implemented)
  • stack traces with variables and values at each stack frame
  • protobuf examination for better details (in the “Response” and “Request” lines of each RPC call on the details page)

today’s app engine go runtime

The two killer features of the python runtime on app engine are NDB and appstats. I have been refusing to make a serious app with the go runtime because of this lack. To address those concerns, I have created goon (an NDB-like library for go) and appstats. With the addition of these two libraries, I believe the go runtime can now compete with python.

Job offers are not demands

| Comments

I am contacted around once per week by people interested in hiring me. Usually email, sometimes on Careers, even phone calls. I don’t think I’ve ever responded to these. This is not because I’m against this kind of recruiting, it’s because I haven’t been persuaded in the least by these offers. My careers profile even lists me as “Passive candidate interested in full-time position.”. That means I have a job, but if you have a better one, maybe I’ll take it. No one has even piqued my interest, so far. The job offers here (well, not offers, but I’m not sure of a better term) tell me all about the company and the job, but nothing about how my life will be better there.

Let’s make one up and look it it:

Software Developer at Fun Startup!

We’re a great funded startup with a small and growing team looking to hire the best. We have fun, work on hard problems, and love exciting technologies. You’ll own large portions of the code and have the freedom to work how you want.

Skills

Experience in MVC-based technologies and databases …And many other things

About Us

We are fun people and awesome. We offer good salaries and benefits.

Attraction

The above is about 90% of all the job offers I see, and I will never respond to them. Nowhere in it do they entice me to work for them. So many places already offer the same things: startup, small, growing, fun technologies, good salary. I want to see something better than those. What’s worse, the only list there is a list of demands about what I must have. Where is the list of things you’ll give me? There’s multiple descriptions of skills I must have, and none about why I should provide those skills to you. I don’t like job offers that sound like they’re scrutinizing me. I want something that dazzles me.

For example, a few months ago I was contacted by a company on Careers. I responded not interested. I received a message back asking why. I said: “I love my current job. Your message didn’t convince me that I’d be happier at your company than I am at mine”. The response was so interesting: “can you truly get the sense of what a company is all about from an e-mail message?” Fixing that idea is the purpose of this blog post. Here was my final message:


I think that is possible to a large degree. For example, my company is also hiring. Here’s our job post.

The benefits list is of such high quality that unless I see such a list on another job post, I’m not interested. I assume that if they had benefits this good, they would list them, since so few places do. The paragraph before the benefits list discusses the quality of the leadership and that they care deeply about making an awesome place to work. I read those two paragraphs alone and I don’t care about the rest of the job post, because I get a sense that they’ll treat me like a king, which they do.

The work descriptions of your message were great. I’m sure the work would be fun and satisfying, and that I would like working with the team. But without advertising the actual working conditions and how I’ll be treated beyond salary, it’s hard to switch.

Conclusion

If companies want to attract the best developers, they must publicly advertise why their job is better. Without that, no amount of “we’re awesome” talk is going to get anywhere.

go-dsp FFT performance with go routines

| Comments

go-dsp has been around for almost a year now. Recently I have been working on performance improvements. The low-hanging fruit was easy (removing duplicate calculations, smarter array copying, efficient data reordering with bit reversal). The difficult part was getting go routines to work as intended. That is, to have them improve the performance of go-dsp. This turned out to be more difficult than I expected.

Why the fast Fourier Transform is parallelizable

The FFT is paralellizable because of how the “fast” part is implemented. Given an input of length L, if there exist integers M and N such that L = M * N, then the original problem (one transform of size L) can be restated as M problems of size N. These M problems can be run in parallel. For example, say I have an input of size 8. I can reform this as 2 inputs of size 4. These 2 inputs can be run in parallel.

Attempts

Go’s easy support of go routines was the obvious solution here. I went through a few solutions until I found one that was not slower. What I discovered was that, although the FFT is highly parallelizable, setting up parallelization can easily take more time than it saves. The actual unit of work that is done is just multiplying two pairs of numbers and saving the result.

My first idea was to use wait groups. This involves a synchronized counter. One go routine is spun up per block (there are M blocks, as described above) and the counter incremented. The go routine decrements the counter when it is done. We wait until the counter is back to zero. Since the number of blocks varies from 2 to L / 2, this means that for about half of the time, so many go routines are spun up to do a very tiny amount of work that overall runtime increases. Ok, so, only do the wait group solution if it’ll actually be faster. I ran some tests and found out that (on my machine), if the block size is over 128, it’s worthwhile to spin up a new go routine. Remember that this solution was always using a single go routine when we were on smaller block sizes, ignoring any potential multicore speedup.

The second idea was to use worker pools. Since the main problem is the creation (not use) of go routines, spin up as many as we will need up front and then send work off to them. From testing I found that putting the number of workers at the value of GOMAXPROCS worked out well. Each worker’s job is to multiply the number pair described above. So, for each of the n log n iterations, we are sending off L / 2 jobs. This ended up performing almost exactly as good as the above solution for larger block sizes, but had the added benefit of working with smaller block sizes, too. I guessed the reason for the lack of more speedup was the communication overhead. The jobs were sent using channels, which aren’t free.

The final solution addressed that problem by changing the number of jobs from L / 2 to the block size (which, as a reminder, goes from 2 to L / 2). So only during one iteration are we sending the same number of jobs over. Almost always are we sending less. Previously, the jobs were specified by an index. This new solution instead specified a min and max index. The subsequent indicies are calculated in the worker itself. This results in much less channel overhead and distributes some work (index calculation) out to workers.

Results

The original, single-thread solution contains no multicore logic. It benchmarked at around 505ms. The graphs below show performance at GOMAXPROCS = 6 and a FFT size of 2 ^ 20 = 1048576.

Above are results for the first two attempts. The blue line is the original, single-thread control. The green line shows the very poor performance at small block sizes. The red line shows similar performance as green for large block sizes but that it was able to handle smaller block sizes better (but still not great). Minimum runtime was 1.7x faster (267ms minimum).

Here we see the final solution with indexed worker groups. Minimum runtime was 252ms (2.0x speedup). Not the 6x increase I wanted, but it’s not bad.

The final code (using the indexed worker group solution) is now available.

Great Piano Pieces

| Comments

I have compiled some of my favorite piano recordings on spotify. (Sorry rdio folks, I tried to make an equivalent playlist, but rdio doesn’t have some of the recordings I wanted and I couldn’t find suitable replacements.) Below are some short program notes.

Ravel Piano Concerto 2

One of the best overall. If I had to choose, the second movement is my favorite piano piece. I play it often, when I can. It is soft and meandering. The middle section’s intensity is a thrilling lift. Goose bumps come regurlarly while listening.

Stravinsky Three Movements of Petrushka

Originally a ballet for orchestra, he arranged it for piano by request. I first heard this in 2004 at CSU for the opening of their new music hall. The excellent Yakov Kasman performed and taught a master class. It was the first time I saw someone stand up during a piano solo due to his own intensity and interest in the piece. One of the most exciting and helpful performances I ever saw. I found a copy of the sheet music and was unable to play certain parts. (Being an orchestra original, it is for many instruments. The piano reduction is reduced to sometimes three staves. Having only two hands presents a problem. Somehow Mr. Kasman gets around that limitation.)

John Adams Phrygian Gates

A half-hour piece in a minimalist style. It starts simply enough with short phrases repeated many times over. The complexity and beauty increase throughout. Some people do not like minimalism. I feel about it the same as I feel about all music genres: the genre is unimportant, the quality of the sound is what makes it good. Phrygian Gates has many new ideas and sounds, many pleasing. This is why I enjoy it. But for those unfamiliar with minimalism: it has no melody and little harmony. The joy felt here is from understanding what is happening and the genuine feeling of pleasure when hearing something new and original (to me, this is the mark of a great composer).

Prokofiev Piano Concerti 1, 2, 3

The Russians are my favorite composers (and often performers; see above). Continuing on my last point about enjoying new sounds, Prokofiev was one of the most inventive composers. Although only a handful of years behind Rachmaninoff, Prokofiev was a century ahead in terms of originality. His concerti are my favorite compositions of his. Unlike Chopin (who was a poor arranger for orchestra, and thus produced piano concerti that are little more than excellent piano over orchestral chord changes), Prokofiev excelled at all voices of the orchestra. His concerti are thus more like duets between piano and orchestra, where each has an important part in the music. I have heard these concerti so many times they no longer sound dissonant. Instead I love hearing the chords because they are so new. My ears don’t hear things like this. The second concerto, although the least popular of the three, has some of the best rhythms and chords I’ve heard.

Success by selective education

| Comments

Or: how to win by not studying

I once took a junior-level course in college about Metal-Oxide-Semiconductor Field-Effect Transistors. As part of the electrical engineering cirriculum, it was an important course. I had no interest whatsoever in the course, and to this day I’m still not sure what it was about. For example, I once showed up to class (I suspect my attendance was around or below 50%) to turn in a homework assignment and discovered that a midterm was scheduled for that day. I used the homework as the single sheet of notes we were allowed. I similarly neglected many of the courses I took.

I would like to describe here why I did that, my expectations about the future (due to that behavior), and what the future actually brought.

Succumbing to laziness

Some facts: In eight years of college, I ended up with two undergraduate degrees (music and computer engineering) and one Master’s degree (electrical engineering). In the six years of undergrad, I took some 220 credits (averaging 18 per semester). My GPA was around a 3.2 the entire time. During this time, I also wrote a lot of code. I worked on a game, a startup, and consulted, some OpenBSD ports.

I found that my grades were not related to the number of credits I was taking. Study time was not the limiting factor, interest was. What I wanted to do was program, and studying was a distraction. Programming is what I wanted to do, so I did it. Uninteresting classes just got in the way of writing software. I rarely studied for those classes (I don’t recall ever once reading any textbook unless it was to solve a specific homework problem), and passed with usually a C. Interesting classes were fun, so I worked on them (the fun kind of work) and generally got As or Bs. This was my standard: I’ll only work if I like it.

Impressions

At the time I felt that I was making mistakes and slacking off. I thought: instead of programming I should be studying for those other classes. I felt like I was messing myself up as an engineer since I was completely ignoring half of my major courses. For example, after 6 years of electrical engineering courses, I still didn’t know what an op-amp was, even though I had studied them in at least 4 courses. I was convinced that I was making bad choices. I was paying to go to school and not study. I was spending my time doing something that I didn’t plan on pursuing as a career (I wanted to design processors for Intel or AMD, not program full-time).

But even given this internal doubt, I still couldn’t bring myself to change. I just didn’t care about some of my courses, and I enjoyed making software so much that I wasn’t going to stop working on it to study.

Results

While finishing up my Master’s thesis, I had a great time writing the software to collect and process my lab data, and was totally bored by the actual thesis. It was sometime around there that I realized: I’m a software engineer, not an electrical engineer. All of my degrees pointed to EE, but my background, skills, and experience were strongly in the software camp.

Today I have a job at Stack Overflow. Working here has been a dream for around 10 years. I consider myself supremely lucky to have met this major, personal goal. Now, I believe that I could not have gotten this job unless I pursued programming skills to the detriment of college achievement. If I had spent my time studying and working on those classes that I didn’t enjoy, I would not have had the time to work on software. I do not think I would have my current job if I had studied more instead of programming. I’m convinced that I’m happier and in a better job because of poor college performance than if I had excelled at college. I suspect in that case I would have a less interesting job doing something I’m less excited about.

For others

This is what worked for me. I wrote this blog post not to encourage students to slack off their course work, but to offer a real story. I was disappointed in myself for many years until my current job. I think there are some current students who have a similar outlook on their college career, and I wanted to offer this as a potential future.

NYC Now

| Comments

This weekend I went to hackn’jill. I made NYC Now, a web app to find something to do right now. It shows a map with events around it. It can detect your location or you can set one by clicking. Ajax updates via json. Events and map markers highlight or animate on hover of the corresponding item. It is quickly filterable (easy to remove things you don’t want to see), so something can be found fast. Code is available on github. BSD license.

Although the idea was one of the simpler at the hackathon (aggregation of some events APIs + a map), I think the implementation and interface are quick and pleasant enough to make it a pleasure to use. It is fast and unobtrusive. I found myself browsing around it tonight, and found quite a few things I will actually go do at some point (which I would have not found before).

Things I learned:

  • async app engine url fetch
  • html5 url rewriting
  • html5 location fetching
  • more jquery
  • I still think app engine is the best web platform around

It is powered by:

It has no:

  • social media connections
  • users or registration
  • database
  • cache

re2net - part 2: DFA

| Comments

I have added a DFA state machine to the re2net library, as described here. This method computes the DFA states on demand, which makes subsequent matches with the same instance faster. The crude benchmark below shows run times for the NFA (as in part 1), DFA’s first run, DFA’s second run, and C#’s Regex library as a comparison.

The results below show that the DFA is about an order of magnitude slower than the NFA on the first run (as expected, since the cache is being computed), but an order of magnitude faster on the second run (since the cache is being used).

Until now, this has been an academic exercise to learn C# and build a simple regex parser. With that done, the next step is to support the full regex syntax. The RE2 library is this project, but I’m going to use Go’s regexp package since I think the code will be easier to read, and it’s the same implementation. This is all assuming I maintain interest.

Column headers, all times in seconds.

n, NFA, DFA, DFA2, C# Regex:

01: 0.0046062, 0.0025143, 0.0000043, 00.0000278
02: 0.0000148, 0.0000108, 0.0000024, 00.0000040
03: 0.0000222, 0.0000231, 0.0000034, 00.0000034
04: 0.0000145, 0.0000216, 0.0000077, 00.0000021
05: 0.0000170, 0.0000355, 0.0000049, 00.0000030
06: 0.0000213, 0.0000476, 0.0000102, 00.0000052
07: 0.0000334, 0.0000742, 0.0000064, 00.0000092
08: 0.0000395, 0.0000971, 0.0000077, 00.0000176
09: 0.0000504, 0.0001212, 0.0000080, 00.0000346
10: 0.0000621, 0.0001552, 0.0000086, 00.0000702
11: 0.0000766, 0.0001821, 0.0000086, 00.0001184
12: 0.0000899, 0.0008207, 0.0000092, 00.0002622
13: 0.0001011, 0.0002600, 0.0000092, 00.0004787
14: 0.0001178, 0.0003105, 0.0000111, 00.0010533
15: 0.0001252, 0.0003729, 0.0000120, 00.0020185
16: 0.0001759, 0.0004468, 0.0000160, 00.0041630
17: 0.0001945, 0.0004914, 0.0000120, 00.0088549
18: 0.0002310, 0.0005789, 0.0000157, 00.0160122
19: 0.0002409, 0.0007215, 0.0000139, 00.0318414
20: 0.0003145, 0.0008251, 0.0000170, 00.0634503
21: 0.0003732, 0.0009216, 0.0000148, 00.1416363
22: 0.0004308, 0.0009741, 0.0000185, 00.2809800
23: 0.0004156, 0.0010700, 0.0000160, 00.5715502
24: 0.0004849, 0.0011817, 0.0000204, 01.1391405
25: 0.0005489, 0.0017962, 0.0000216, 02.2826699
26: 0.0005520, 0.0014535, 0.0000228, 04.5484486
27: 0.0010332, 0.0016941, 0.0000213, 09.0551722
28: 0.0010230, 0.0018286, 0.0000216, 18.4857312
29: 0.0010944, 0.0019780, 0.0000207, 37.1557184