Entity Resolution, Mobile Apps, Thinking, and Language History
- End-to-end Entity Resolution for Big Data — Introduction to the entity resolution pipeline and the algorithms at the different stages. Includes a summary of open source tools and their features. (via Adrian Colyer)
- 33 Engineering Challenges of Building Mobile Apps at Scale — Part 1, covering the first 10, is up. They are: 1. State management; 2. Mistakes are hard to revert; 3. The long tail of old app versions; 4. Deeplinks; 5. Push and background notifications; 6. App crashes; 7. Offline support; 8. Accessibility; 9. CI/CD and the build train; 10. Device & OS fragmentation.
- Cognitive Effort vs Physical Pain — We found that cognitive effort can be traded off for physical pain and that people generally avoid exerting high levels of cognitive effort. This explains why more people don’t use (your favourite editor).
- If-Then-Else Had to be Invented — The history of where “else” came from, and it’s a fascinating archaeological romp through the ages of programming. E.g., Flow-Matic, Grace Murray Hopper’s predecessor to COBOL, made the three-way if a little easier to think about by talking about comparing two numbers instead of about the signs of numbers. It introduced the name “otherwise” for the case where the comparison wasn’t what you were looking for.
NLP Text Attack Framework, Developer Metrics, Legacy Code, and Distributed Systems Reading List
- TextAttack — Framework for generating adversarial examples for NLP models. (Paper) (via The Data Exchange)
- Measuring Developer Productivity — There is no useful measure that operates at a finer grain than “tasks multiplied by complexity.” Measuring commits, lines of code, or hours spent coding, as some tools do, is no more useful at a team scale than it is at an individual scale. There simply is no relation between the number of code artifacts a team produces, or the amount of time they spend on them, and the value of their contributions. When engineering managers gather in the hotel bar after the conference day ends, this is one of the subjects they will debate endlessly.
- Legacy Code — All the things I wish I’d known twenty years ago. The top-level bullet-points: (1) Writing code isn’t the limiting factor; (2) Start with “why”; (3) Reduce the feedback loop; (4) Make people collaborate; (5) Different strategies to approach Legacy Code.
- Distributed Systems Reading List — I often argue that the toughest thing about distributed systems is changing the way you think. Here is a collection of material I’ve found useful for motivating these changes.
NAND Game, Game AI, In-Database Machine Learning, and Datastores at Scale
- NAND Game — You start with a single component, the nand gate. Using this as the fundamental building block, you will build all other components necessary. (See also NAND to Tetris)
- Facebook’s Game AI — today we are unveiling Recursive Belief-based Learning (ReBeL), a general RL+Search algorithm that can work in all two-player zero-sum games, including imperfect-information games. ReBeL builds on the RL+Search algorithms like AlphaZero that have proved successful in perfect-information games. Unlike those previous AIs, however, ReBeL makes decisions by factoring in the probability distribution of different beliefs each player might have about the current state of the game, which we call a public belief state (PBS). In other words, ReBeL can assess the chances that its poker opponent thinks it has, for example, a pair of aces.
- In-Database Machine Learning — We demonstrate our claim by implementing tensor algebra and stochastic gradient descent using lambda expressions for loss functions as a pipelined operator in a main memory database system. Our approach enables common machine learning tasks to be performed faster than by extended disk-based database systems or as well as dedicated tools by eliminating the time needed for data extraction. This work aims to incorporate gradient descent and tensor data types into database systems, allowing them to handle a wider range of computational tasks.
- Scaling Datastores at Slack with Vitess — Vitess is YouTube’s MySQL horizontal-scaling solution. This article is a really good write-up of what they were doing, why it didn’t work, how they tested the waters with Vitess, and how it’s working for them so far.
AlphaFold, Purpose-First Programming, 2000 to 2020, and 3D in DNA
- AlphaFold — This is astonishing: protein-folding solved by Google’s DeepMind. Figuring out what shapes proteins fold into is known as the “protein folding problem”, and has stood as a grand challenge in biology for the past 50 years. In a major scientific advance, the latest version of our AI system AlphaFold has been recognised as a solution to this grand challenge by the organisers of the biennial Critical Assessment of protein Structure Prediction (CASP). And from Science: The organizers even worried DeepMind may have been cheating somehow. So Lupas set a special challenge: a membrane protein from a species of archaea, an ancient group of microbes. For 10 years, his research team tried every trick in the book to get an x-ray crystal structure of the protein. “We couldn’t solve it.” But AlphaFold had no trouble. It returned a detailed image of a three-part protein with two long helical arms in the middle. The model enabled Lupas and his colleagues to make sense of their x-ray data; within half an hour, they had fit their experimental results to AlphaFold’s predicted structure. “It’s almost perfect,” Lupas says. “They could not possibly have cheated on this. I don’t know how they do it.” Far more useful (and to me, more impressive) than AlphaGo.
- Purpose-First Programming — Some students resist the cognitively-heavy tasks of simulating program execution. The secret to teaching those folks to program may be “purpose-first programming”: She used Github repositories and expert interviews to identify a few programming plans (just like Elliot Soloway and Jim Spohrer studied years ago) that were in common use in a domain that her participants cared about. She then taught those plans. Students modified and combined the plans to create programs that the students found useful. Rather than start with syntax or semantics, she started with the program’s purpose. Very reminiscent of the late 90s Perl and PHP copy-and-change coding boom that got orders of magnitude more people programming than were coming through CS courses at the time.
- Conversations with The Year 2000 — Paul Ford is a genius.
’00: How does HTML work now?
’20: It’s pretty simple, you define app logic as unidirectional dataflow, then fake up pseudo-HTML components that mirror state, and a controller mounts fake-page deltas onto the browser surface.
’00: How do you change the title?
’20: You can’t.
- cube3d.dna — A raytracer implemented in DNA. How to deploy: (1) Synthesize the oligonucleotides from the cube3d.dna file. (2) Arrange the test tubes as shown in the diagram below. (3) Don’t forget to provide the initial concentrations according to the table below. (4) Use a pipette to encode the position (row and column) of each tube to start the computation.
Unix History, Robot OS Talks, Learning with Language, and Easy Theory
- Brian Kernighan Interviews Ken Thompson — From a fun interview: McIlroy keeps coming up. He’s the smartest of all of us and the least remembered (or written down)… McIlroy sat there and wrote —on a piece of paper, now, not on a computer— TMG [a proprietary yacc-like program] written in TMG… And then! He now has TMG written in TMG, he decided to give his piece of paper to his piece of paper and write down what came out (the code). Which he did. And then he came over to my editor and he typed in his code, assembled it, and (I won’t say without error, but with so few errors you’d be astonished) he came up with a TMG compiler, on the PDP-7, written in TMG. And it’s the most basic, bare, impressive self-compilation I’ve ever seen in my life. (via Hacker News)
- ROS World 2020 Videos — all of the ROS World videos, including all the lightning talks. ROS = Robot Operating System.
- Learning from Language — we propose a simple approach called Language Shaped Learning (LSL): if we have access to explanations at training time, we encourage the model to learn representations that are not only helpful for classification, but are predictive of the language explanations. (Paper)
- Easy Theory — YouTube lectures on computer science theory. Mondays: Algorithms; Wednesdays: Theory of Computation; Fridays: Theory of Computation; Sundays: Livestream/bonus.
OpenStreetMap Numbers, Drone Warfare, Pattern-Aware Graph Mining, and Declining Researcher Productivity
- OpenStreetMap is Having a Moment — Apple was responsible for more edits in 2019 than Mapbox accounted for in its entire corporate history. See also the 2020: Curious Cases of Corporations in OpenStreetMap talk from State of the Map. (via Simon Willison)
- Drone Warfare — The second point, “SkyNet”, is the interesting bit. Azerbaijan and Armenia fought a war and drones enabled some very asymmetric outcomes. Quoting a Washington Post story, Azerbaijan, frustrated at a peace process that it felt delivered nothing, used its Caspian Sea oil wealth to buy arms, including a fleet of Turkish Bayraktar TB2 drones and Israeli kamikaze drones (also called loitering munitions, designed to hover in an area before diving on a target). […] Azerbaijan used surveillance drones to spot targets and sent armed drones or kamikaze drones to destroy them, analysts said. […] Their tally, which logs confirmed losses with photographs or videos, listed Armenian losses at 185 T-72 tanks; 90 armored fighting vehicles; 182 artillery pieces; 73 multiple rocket launchers; 26 surface-to-air missile systems, including a Tor system and five S-300s; 14 radars or jammers; one SU-25 war plane; four drones and 451 military vehicles. (via John Birmingham)
- Peregrine — an efficient, single-machine system for performing data mining tasks on large graphs. Some graph mining applications include: Finding frequent subgraphs; Generating the motif/graphlet distribution; Finding all occurrences of a subgraph. Peregrine is highly programmable, so you can easily develop your own graph mining applications using its novel, declarative, graph-pattern-centric API. To write a Peregrine program, you describe which graph patterns you are interested in mining, and what to do with each occurrence of those patterns. You provide the what and the runtime handles the how.
- Declining Marginal Returns of Researchers — (Tamay Besiroglu) I found that the marginal returns of researchers are rapidly declining. There is what’s called a “standing on toes” effect: researcher productivity declines as the field grows. Because ML has recently grown very quickly, this makes better ML models much harder to find. (Dissertation)
CLI ePub Reader, Biology, Technical Debt, and APIs for Databases
- epr — Terminal/CLI Epub reader.
- I Should Have Loved Biology — Conveys well the magic of the field. Notable also for the reference to A Computer Scientist’s Guide to Cell Biology, which I didn’t realise existed.
- Ur-Technical Debt — Reviving Ward Cunningham’s take on technical debt. Simply put, ur-technical debt arises when my ideas diverge from my code. That divergence is inevitable with an iterative process. […] “[I]f you develop a program for a long period of time by only adding features and never reorganizing it to reflect your understanding of those features, then eventually that program simply does not contain any understanding and all efforts to work on it take longer and longer.”
- Directus — a real-time [REST and GraphQL] API and App dashboard for managing SQL database content.
Security Papers, Security in Syntax, Mistakes, and GraphQL Editor
- NDSS Symposium 2020 Papers — Large pile of security research from the 2020 Network and Distributed System Security Symposium, including papers on topics as wide-reaching as hypervisor fuzzing and The Attack of the Clones Against Proof-of-Authority which sounds like a very niche Star Wars sequel indeed.
- Liquid Information Flow Control — We present Lifty, a domain-specific language for data-centric applications that manipulate sensitive data. A Lifty programmer annotates the sources of sensitive data with declarative security policies, and the language statically and automatically verifies that the application handles the data according to the policies. Moreover, if verification fails, Lifty suggests a provably correct repair, thereby easing the programmer burden of implementing policy enforcing code throughout the application.
- So You’ve Made a Mistake and It’s Public — Wikipedians’ excellent advice for what to do when you’ve been busted making a mistake.
- GraphQL Editor — Create a schema by using visual blocks system. GraphQL Editor will transform them into code.
SoC Lecture Notes, Flix, Megatrends, and Credential Management
- Advanced System on a Chip Lecture Notes (2016) — Topics: 1. Basic Processor & Memory hierarchy; 2. Advanced Out-of-Order Processor; 3. Data-parallel processors; 4. Micro-controller introduction; 5. Multicore; 6. RISC-V core; 7. Advanced Multicore; 8. Multicore programming; 9. Graphics Processing Unit (GPU); 10. Heterogeneous SoC; 11. GPU Programming; 12. Application-Specific Instruction-Set Processor (ASIP); 13 PULP: Parallel Ultra-Low-Power Computing; 14. Architecture in the Future – Wrap-up (via Hacker News).
- Flix — Next-generation reliable, safe, concise, and functional-first programming language.
Flix is a principled and flexible functional-, logic-, and imperative- programming language that takes inspiration from F#, Go, OCaml, Haskell, Rust, and Scala. Flix looks like Scala, but its type system is closer to that of OCaml and Haskell. Its concurrency model is inspired by Go-style processes and channels. Flix compiles to JVM bytecode, runs on the Java Virtual Machine, and supports full tail call elimination. And supports first-class Datalog constraints enriched with lattice semantics.
- 20 Megatrends for the 2020s — Abundance, connectivity, healthspan, capital, AR and Spatial Web, smart devices, human-level AI, AI-Human collaboration, software shells, renewable energy, insurance industry switches to prevention, autonomous vehicles and flying cars, on-demand production and delivery, knowledge, advertising, cellular agriculture, brain-computer interfaces, VR, sustainability/environment, and CRISPR. Even if you don’t believe these are the trends of the future, it’s worth knowing what your customers/partners are being told.
- Credential Management — Level -2: No Authentication; Level -1: All Passwords = “password”; Level 0: Hardcode Everywhere; Level +1: Move Secrets into a Config File; Level +2: Encrypt the Config File; Level +3: Use a Secret Manager; Level +4: Dynamic Ephemeral Credentials.
- Hypothesis as Liability — Would the mental focus on a specific hypothesis prevent us from making a discovery? To test this, we made up a dataset and asked students to analyze it. […] The most notable “discovery” in the dataset was that if you simply plotted the number of steps versus the BMI, you would see an image of a gorilla waving at you (Fig. 1b).
- Tesla Engineering Inside Goss — Lots and lots of inside engineering horror stories (2 years old by now). my issue was the fact that the systems doing the flashing were running the yocto images and perl and the guy writing the perl was also responsible for writing the thing that actually updates the car. that thing (the car-side updater) is about ~100k lines of C in a single file. code reviews were always a laugh riot.
- Teach Testing First — An extremely good idea. Testers and security specialists have a different mindset to regular programmers: they look to pervert and break the software, not simply to find the golden path whereby it produces the right behaviour for the right inputs. Perhaps if more people learned testing first, we’d end up with more secure software.
- Realistic and Interactive Robotic Gaze — Astonishingly creepy prototype with astonishingly life-like eyeballs. Great work from Disney Research. (Paper)
Hardware Security, Ubooquity, Noisepage, and Technical Debt
- Dealing with Security Holes in Chips — system security starts at the hardware layer.
- Ubooquity — free home server for your comics and ebooks library. “Like plex for books.”
- Noisepage — a relational database management system developed by the Carnegie Mellon Database Group. The research goal of the NoisePage project is to develop high-performance system components that support autonomous operation and optimization as a first-class design principle. Also interesting in databases this week: a rundown on Procella, YouTube’s analytical database.
- Technical Debt — Where I first found this excellent description of technical debt, by Ward Cunningham: “If you develop a program for a long period of time by only adding features but never reorganizing it to reflect your understanding of those features, then eventually that program simply does not contain any understanding and all efforts to work on it take longer and longer.”
Bald AI, Dates and Times, UX, and Tools
- The AI Who Mistook a Bald Head for a Football — Second-tier Scottish football club Inverness Caledonian Thistle doesn’t have a camera operator for matches at their stadium so the club uses an AI-controlled camera that’s programmed to follow the ball for their broadcasts. But in a recent match against Ayr United, the AI controller kept moving the camera off the ball to focus on the bald head of the linesman, making the match all but unwatchable. No fans allowed in the stadium either, so the broadcast was the only way to watch. Watch the video, it is hilarious and tragic. I’m sure there’s a serious lesson to be drawn from this, but I’m too busy snickering to draw it.
- Why Is Subtracting These Two Times (in 1927) Giving a Strange Result? — You already knew timezones are a hellmouth, but now you have another example of how deep the hellmouth goes. Basically at midnight at the end of 1927, the clocks went back 5 minutes and 52 seconds. (via Jarkko Hietaniemi)
- Average UX Improvements Are Shrinking Over Time — On average, UX improvements have substantially decreased since 2006–2008: from 247% to 75% (a 69% decrease). This difference is statistically significant (p = 0.01) — we can be quite confident that average improvement scores are lower now than they were 12–14 years ago.
- CS294: Building User-Centred Programming Tools — This hands-on course explores a selection of techniques from Programming Languages and Human-Computer Interaction that can help us create useful, usable programming languages and programming tools. We will cover strategies for designing programming systems—e.g., need finding, formative studies, user-centered design broadly. We will also cover tools and techniques that help us build user-friendly programming systems—e.g., program synthesis, structure editors, abstraction design, program slicing. For the final project, individuals or teams will develop a usable abstraction, language, or programming tool of their own design. What looks like an awesome course at Berkeley. The readings alone are excellent.
Mutation Testing, Causal Reasoning, Pointing and Calling, and Software Architecture
- Mutation Testing — in this paper, we semi-automatically learn error-inducing patterns from a corpus of common Java coding errors and from changes that caused operational anomalies at Facebook specifically. We combine the mutations with instrumentation that measures which tests exactly visited the mutated piece of code. Results on more than 15,000 generated mutants show that more than half of the generated mutants survive Facebook’s rigorous test suite of unit, integration, and system tests.
- Causal Reasoning in Probability Trees — A Colab notebook tutorial that is the companion tutorial for the paper “Algorithms for Causal Reasoning in Probability trees” by Genewein T. et al. (2020). Probability trees are one of the simplest models of causal generative processes.They possess clean semantics and are strictly more general than causal Bayesian networks, being able to e.g. represent causal relations that causal Bayesian networks can’t. Even so, they have received little attention from the AI and ML community. In this tutorial we present new algorithms for causal reasoning in discrete probability trees that cover the entire causal hierarchy (association, intervention, and counterfactuals), operating on arbitrary logical and causal events.
- Pointing and Calling — a method in occupational safety for avoiding mistakes by pointing at important indicators and calling out the status. The name of the technique that prevents mistakes.
- Grand Unified Theory of Software Architecture — reduce the imperative shell and move code into the functional core.
Split-Second Phantom Attacks, Deep Fakery, Serverless, and RE McDonalds API
- Phantom of the ADAS — In this paper, we investigate “split-second phantom attacks,” a scientific gap that causes two commercial advanced driver-assistance systems (ADASs), Telsa Model X (HW 2.5 and HW 3) and Mobileye 630, to treat a depthless object that appears for a few milliseconds as a real obstacle/object. Turns out neural networks can have subliminal message problems. (via Gradient Flow)
- Photoshop Adds Deep Fakery — Everyone’s going to be doing it. Photoshop adds controls to let you change facial expressions in a photo, among other things. Digital images weren’t real, but somehow they feel less real now.
- Serverless & Distributed Systems Productivity — 1. Prefer running your system in the cloud over local emulation. 2. CI/CD Pipelines are not enough, local deployment automation is crucial. 3. For AWS Serverless & “Function-as-a-Service” – monolithic functions are OK. 4. Consider adopting a monorepo, with tooling appropriate for monorepos. 5. Implement all three pillars of observability.
- Reverse-Engineering McDonalds’ Internal API — I reverse engineered mcdonald’s internal api and I’m currently placing an order worth $18,752 every minute at every mcdonald’s in the US to figure out which locations have a broken ice cream machine.
Antitrust, Differential Dataflow, Multilanguage Translation, and Time
- Justice Department Antitrust Filing Against Google — Claims advertising was an artificially limited market because of their exclusivity deals, including the Android anti-forking arrangements and Browser paid-default-search deals. New York Times has a write-up. This may be more important, historically, than the election results.
- Differential Dataflow — Differential dataflow programs look like many standard “big data” computations, borrowing idioms from frameworks like MapReduce and SQL. However, once you write and run your program, you can change the data inputs to the computation, and differential dataflow will promptly show you the corresponding changes in its output. Promptly meaning in as little as milliseconds. A book by Microsoft Research, with open source.
- https://github.com/pytorch/fairseq/tree/master/examples/m2m_100 — In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively with the best single systems of WMT. Open source from Facebook.
- Falsehoods Programmers Believe About Time — I love these ‘falsehoods’ articles. We model the world in our code, and we find it easy to forget how complicated the world is. (Especially with artificial constructs like calendars and clocks)
Data in Spreadsheets, Implementing Math, Skin-Printable Sensors, and Hype Cycle History
- Data Organization in Spreadsheets — Focusing on the data entry and storage aspects, this article offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files. A “must-read” for anyone who works with data. (via Thomas Lumley)
- Toward an API for the Real Numbers — To our knowledge, this is the first exploration of a practical general purpose real number type that both reflects the mathematical laws of the real numbers, and also supports exact comparisons in situations in which that’s normally expected. (via Morning Paper) (via Tim Bray)
- Sensors Printed Directly Onto Skin — Here, we report a universal fabrication scheme to enable printing and room-temperature sintering of the metal nanoparticle on paper/fabric for FPCBs and directly on the human skin for on-body sensors with a novel sintering aid layer. Consisting of polyvinyl alcohol (PVA) paste and nanoadditives in the water, the sintering aid layer reduces the sintering temperature. Together with the significantly decreased surface roughness, it allows for the integration of a submicron-thick conductive pattern with enhanced electromechanical performance. Various on-body sensors integrated with an FPCB to detect health conditions illustrate a system-level example. (paywalled paper)
- A Quarter Century of Hype – 25 Years of the Gartner Hype Cycle — A presentation of several novel ways to visualize 25 years of the Gartner Hype Cycle. The good stuff starts about 1m40s in.
Stored Procedures in SQLite, Healthy Tech, Social Retro Gaming, and Glitch Chasm
- T-SQL in SQLite — CG/SQL is a code generation system for the popular SQLite library that allows developers to write stored procedures in a variant of Transact-SQL (T-SQL) and compile them into C code that uses SQLite’s C API to do the coded operations. CG/SQL enables engineers to create highly complex stored procedures with very large queries, without the manual code checking that existing methods require. (Open Source from Facebook)
- Four Myths of Healthy Tech — (1) Social media is addictive, and we are powerless to resist it. The concept of addiction does not encompass the full range of pleasures, risks, and uses that people create with technology. (2) Technology companies can fix the problems they create with better technology. Some technology cannot be fixed by more design, and some technology should not be built at all. (3) Growth and engagement metrics are the best drivers of decision making at tech companies. Many of the most important parts of digital well-being cannot be captured by quantitative metrics. (4) Our health and well-being depend on spending less time with screens and social media platforms. Health and well-being cannot be reduced to the single variable of screen time. There’s detail to the alternatives presented to the myths, and it forms an interesting framework for thinking about “harmful social media”.
- Telemelt — a web-based multi-emulator (RetroArch/libretro) designed to recreate the experience of playing console games with a single controller in a room full of friends.
- Glitch Chasm — After the 212-story skyscraper in Melbourne, there’s a chasm to an airfield in the centre of the earth. Presumably some bad data for the airfield’s elevation.
Algorithmic Collusion, Hedgemony, Git Exercises, and Android in a Box
- Algorithms Can Collude — To analyze the possible consequences, we study experimentally the behavior of algorithms powered by Artificial Intelligence (Q-learning) in a workhorse oligopoly model of repeated price competition. We find that the algorithms consistently learn to charge supracompetitive prices, without communicating with one another. The high prices are sustained by collusive strategies with a finite phase of punishment followed by a gradual return to cooperation. This finding is robust to asymmetries in cost or demand, changes in the number of players, and various forms of uncertainty. (via Marginal Revolution)
- Hedgemony — Regular readers know I am fond of instructive games. RAND researchers developed Hedgemony, a wargame designed to teach U.S. defense professionals how different strategies could affect key planning factors in the trade space at the intersection of force development, force management, force posture, and force employment. The game presents players, representing the United States and its key strategic partners and competitors, with a global situation, competing national incentives, constraints, and objectives; a set of military forces with defined capacities and capabilities; and a pool of periodically renewable resources. The players are asked to outline their strategies and are then challenged to make difficult choices by managing the allocation of resources and forces in alignment with their strategies to accomplish their objectives within resource and time constraints.
- Git Exercises — A repo that is its own set of git exercises.
- Android in a Box — Run Android applications on any GNU/Linux operating system.