Mon blog-notes à moi que j'ai

Blog personnel d'un sysadmin, tendance hacker

Compilation veille Twitter & RSS #2015-22

La moisson de liens pour la semaine du 1er</sup> au 5 juin 2015. Ils ont, pour la plupart été publiés sur mon compte Twitter. Les voici rassemblés pour ceux qui les auraient raté.

Bonne lecture

Software Engineering

Deploying branches to GitHub.com
At GitHub, we use a variant of the Flow pattern to deploy changes: new code is always deployed from a pull request branch, and merged only once it has been confirmed in production. master is our stable release branch, so everything on master is considered production-ready code. If a branch deploy ships bad code containing a bug or performance regression, it is rolled back by deploying the latest master to production.
Data Privacy and Blackfire
Very early on, we decided that Blackfire would be a SaaS offering. It gives us a lot of flexibility, and thanks to our continuous integration workflow, we can deploy very frequently; at least once a week and sometimes several times a day. That’s the power and one of the promises of Software as a Service.
Natural Language Toolkit (NLTK) Sample and Tutorial: Part 1
Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data (Natural Language Processing). It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit. NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning.
Using Perl and MySQL to Automatically Respond to Retweets on Twitter
In my previous post, I showed you a way to store tweets in MySQL, and then use Perl to automatically publish them on Twitter.
In this post, we will look at automatically sending a « thank you » to people who retweet your tweets — and we will be using Perl and MySQL again.

Mobile

7 Tips for Mobile App Project Management
It’s safe to say that mobile app development is just as complex a process as any other IT project. That’s why strategies used in projects aimed at developing mobile apps are to a large extent similar to those used by managers in other fields. Here are practical tips for making sure you pass the 7 following steps of app development smoothly and efficiently.
Swift Meetup Paris recap
After frequent exchanges with the Swift Meetup Paris organisers, BlaBlaCar finally hosted one of their meetups on Tuesday 26 May. An interesting topic emerged from discussions with another speaker of this meetup: unit tests on Swift.

UX

Key Points Developers Should Focus for Excellent UX while Developing Business Apps
User interface (UI) and User experience (UX) have become especially important in the context of enterprise applications. In case of UX, its significance multiplies since users are constantly switching between the oh-so-sleek apps on their smart-phones to enterprise software at their workplaces. A swipe here, a tap there and you get your work done.
In business, UX defines everything that a product or service has to offer to its users. By ‘everything’, we mean those attributes that a user values or finds meaningful. This maybe as basic as the appearance of the app to a more functional feature of how it is being used.

Localization

Localization Technologies at Netflix
The localization program at Netflix is centered around linguistic excellence, a great team environment, and cutting-edge technology. The program is only 4 years old, which for a company our size is unusual to find. We’ve built a team and toolset representative of the scope and scale that a localization team needs to operate at in 2015, not one that is bogged down with years of legacy process and technology, as is often the case.

System Engineering

Understanding Official Repos on Docker Hub
Official Repositories (« Repos ») are a curated set of image repositories that contain content packaged and maintained directly by Docker, our upstream partners, and the broader community. The repository itself contains the same software you can get directly from the upstream project, but has been packaged as a Docker repository for distribution on Docker Hub. Currently, there are 74 Official Repos on Docker Hub, and these images have been pulled over 53 million times to build their applications.
Where are the self-tuning systems?
Computer science has gone a long way. And machine learning has transformed things that looked like sci-fi a couple years ago into a reality.
However, there is a major thing that totally sucks in 2015, to the point of still being virtually nonexistent: self-tuning systems.
Nice looking Docker contains management platform
I have just discovered LastBackend which is a nice platform to manage your Docker containers with a nice UI and drag & drop principle.
Finding Bottlenecks in Your Computing System
Read data in, write data out. In their purest form, this is what computers accomplish. Building a high performance data processing system requires accounting for how much data must move, to where, and the computational tasks needed. The trick is to establish the size and heft of your data, and focus on its flow. Identifying and correcting bottlenecks in the flow will help you build a low latency system that scales over time.
A Toolkit to Measure Basic System Performance and OS Jitter
To complement the great information I got on the « Systematic Way to Find Linux Jitter », I have created a toolkit that I now used to evaluate current and future trading platforms.
In case this can be useful, I have listed these tools, as well as the URLs to get the source code and a description of their usage. I am learning a lot by reading the source code, and the blog entry associated.

DevOps

Quick Tip: Stubbing Library Helpers in ChefSpec
I’m currently updating my vagrant cookbook, and adding ChefSpec coverage. Each of the different platform recipes results in slightly different resources to download the package file and install it. To support this, I have helper methods that calculate the download URI, the package name, and the SHA256 checksum based on the version of Vagrant (node['vagrant']['version']), and the platform (node['os'], node['platform_family']).
Integrating etckeeper with Logentries & Chef
When working within a team to maintain system infrastructure, properly documenting and communicating changes made to configuration files within /etc is fundamental to preventing knowledge gaps throughout your team.
While version control tools like git are helpful in tracking standard changes to a code base, git doesn’t capture metadata important to /etc like permissions of /etc/shadow. To address this need, we’ve been exploring etckeeper – a small the version control application developed by Joey Hess (of Debian fame) for recording packaging installed or removed from /etc. While working with etckeeper, it became apparent that tracking changes over time in context of other events occurring within our systems would be useful and easily accomplished with Logentries.

Monitoring

Metrics, metrics everywhere (but where the heck do you start?)
I just had the privilege of speaking with Cliff Crocker at Velocity Santa Clara. Our talk centered around web performance metrics — what they are, what they mean, how to use them, and how to find the right one for your purposes.
It was a 90-minute talk, so we covered a huge swath of ground. Based on people’s reactions on Twitter, these were a few of the slides and takeaways that resonated most. (I’ve also embedded our slide deck at the bottom of this post, if you’d like to check out the entire thing.)
Monitoring Business Metrics and Refining Outage Response
Whether your server’s CPU is pegged at 100% or someone is chopping down your rainforest, PagerDuty has no opinions on how you use our platform to trigger a response from your on-call team. But here’s one area where we do have a strong opinion: alerting on business metrics. You should do it.

Log management

Migrating to Micro-services with CoreOS & Logentries
It’s no secret that container-based architectures have become an extremely popular choice amongst organizations looking to simplify and automate their deployments. One of the leaders of this movement, along with Docker, is CoreOS providing teams with the ability to manage their services through neatly packaged Linux containers.

DataOps

Flying faster with Twitter Heron
We process billions of events on Twitter every day. As you might guess, analyzing these events in real time presents a massive challenge. Our main system for such analysis has been Storm, a distributed stream computation system we’ve open-sourced. But as the scale and diversity of Twitter data has increased, our requirements have evolved. So we’ve designed a new system, Heron — a real-time analytics platform that is fully API-compatible with Storm. We introduced it yesterday at SIGMOD 2015.
Closing the loop : why DSS is really production ready after a data analysis?
At Dataiku we want to shorten the time from the analysis of the data to the effective deployment of predictive applications.
We are therefore incentivized to develop a better understanding of data scientists daily tasks (data exploration, modeling, and deployment of useful algorithms across organizations) and where the frictions occur causing relevant analytics to not be delivered in production smoothly.
Logstash - Quirky ‘multiline’
Ever since I decided to use the ELK stack for parsing application logs (for example a Java stack trace), I have faced many hurdles making it work. After multiple iterations and explorations, I believe I have found the proper method to use the ‘multiple’ feature of Logstash. This article is not a claim of original invention. It is an attempt to document a widely used and queried feature of the Logstash tool. My initial experience with ‘multiline’ led me to stop trying to make it work. Instead, I ended up treating each line of input separately, an experience that I described in the article titled ‘Using Multiple Grok Statements to Parse a Java Stack Trace’ (http://java.dzone.com/articles/using-multiple-grok-statements).
Enabling DataOps with Easy Log Analytics
DataOps is becoming an important consideration for organizations. Why? Well, DataOps is about making sure data is collected, analyzed, and available across the company – i.e. Ops insight for your decision-making systems like Hubspot, Tableau, Salesforce and more. Such systems are key to day-to-day operations and in many cases are as important as keeping your customer facing systems up and running.

Architecture

Paper: Heracles: Improving Resource Efficiency at Scale
Underutilization and segregation are the classic strategies for ensuring resources are available when work absolutely must get done. Keep a database on its own server so when the load spikes another VM or high priority thread can’t interfere with RAM, power, disk, or CPU access. And when you really need fast and reliable networking you can’t rely on QOS, you keep a dedicated line.
Google’s Infrastructure Chief Talks SDN
Urs Hölzle knows a thing or two about software-defined networking. As senior VP for technical infrastructure at Google, he oversaw the Web giant’s foray into SDN a few years ago with its B4 project, a private WAN connecting the company’s global data centers. B4 uses OpenFlow on Google’s custom switches with merchant silicon. At Interop Las Vegas, Network Computing sat down with Holzle after his keynote to discuss B4, SDN benefits and challenges, and the future of networking.
Side by Side with Elasticsearch and Solr: Performance and Scalability
Back by popular demand! Sematext engineers Radu Gheorghe and Rafal Kuc returned to Berlin Buzzwords on Tuesday, June 2, with the second installment of their « Side by Side with Elasticsearch and Solr » talk. (You can check out Part 1 here.)

Web Performance

Optimizing a Complex Site for Pagespeed
No matter how complex the website, it can be speed optimized. In this article we show how to boost the Pagespeed scores of mymanhattancosmeticdentist.com with mod_pagespeed, some manual tweaks, and the WPOptimize Speed Plugin.

Hadoop & Big Data

Aerosolve: Machine learning for humans
In this dynamic pricing feature, we show hosts the probability of getting a booking (green for a higher chance, red for a lower chance), or predicted demand, and allow them to easily price their listings dynamically with a click of a button.
Many features go into predicting the demand for a listing among them seasonality, unique features of a listing and price. These features interact in complex ways and can result in machine learning models that are difficult to interpret. So we went about building a package to produce machine learning models that facilitate interpretation and understanding. This is useful for us, developers, and also for our users; the interpretations map to explanations we provide to our hosts on why the demand they face may be higher or lower than they expect.
Airflow: a workflow management platform
Airbnb is a fast growing, data informed company. Our data teams and data volume are growing quickly, and accordingly, so does the complexity of the challenges we take on. Our growing workforce of data engineers, data scientists and analysts are using Airflow, a platform we built to allow us to move fast, keep our momentum as we author, monitor and retrofit data pipelines.
Today, we are proud to announce that we are open sourcing and sharing Airflow, our workflow management platform.
New in CDH 5.4: Sensitive Data Redaction
Have you ever wondered what sort of « sensitive » information might wind up in Apache Hadoop log files? For example, if you’re storing credit card numbers inside HDFS, might they ever « leak » into a log file outside of HDFS? What about SQL queries? If you have a query like select * from table where creditcard = ‘1234-5678-9012-3456’, where is that query information ultimately stored?
Ecosystem of Hadoop Animal Zoo
Hadoop is best known for Map Reduce and it’s Distributed File System (HDFS). Recently other productivity tools developed on top of these will form a complete Ecosystem of Hadoop. Most of the projects are hosted under Apache Software Foundation. Hadoop Ecosystem projects are listed below.
Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop
The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems.

Databases

InfluxDB

InfluxDB Clustering Design - neither CP or AP
A few weeks ago I hinted at some big changes coming in the InfluxDB clustering design. These changes came about because of the testing we’ve done over the last 3 months with the clustering design that was going to be shipped in 0.9.0. In short, the approach we were going to take wasn’t working. It wasn’t reliable or scalable, and made sacrifices for guarantees we didn’t need to provide for our use case.

Redis

10 quick tips for Redis
Redis is hot in the tech community right now. It’s come a long way from being a small personal project from Antirez, to being an industry standard for in memory data storage. With that comes a set of best practices that most people can agree upon for using Redis properly. Below we’ll explore 10 quick tips on using Redis correctly.

MySQL & MariaDB

The InnoDB Change Buffer
One of the challenges in storage engine design is random I/O during a write operation. In InnoDB, a table will have one clustered index and zero or more secondary indexes. Each of these indexes is a B-tree. When a record is inserted into a table, the record is first inserted into clustered index and then into each of the secondary indexes. So, the resulting I/O operation will be randomly distributed across the disk. The I/O pattern is similarly random for update and delete operations. To mitigate this problem, the InnoDB storage engine uses a special data structure called the change buffer (previously known as the insert buffer, which is while you will see ibuf and IBUF used for various internal names).
Protect Your Data: Row-level Security in MariaDB 10.0
Most MariaDB users are probably aware of the privilege system available in MariaDB and MySQL. Privileges control what databases, tables, columns, procedures, and functions a particular user account can access. For example, if an application stored credit card data in the database, this data should probably be protected from most users. To make that happen, the DBA might disallow access to the table or column storing this sensitive data.
Is 80% of RAM how you should tune your innodb_buffer_pool_size?
It seems these days if anyone knows anything about tuning InnoDB, it’s that you MUST tune your innodb_buffer_pool_size to 80% of your physical memory. This is such prolific tuning advice, it seems engrained in many a DBA’s mind. The MySQL manual to this day refers to this rule, so who can blame the DBA? The question is: does it makes sense?
JSON Support in PostgreSQL, MySQL, MongoDB, and SQL Server
Unless you’ve been hiding under a rock for a few years, you probably know that JSON is quickly gaining support in major database servers. Due to its use in the web front-end, JSON has overtaken XML in APIs, and it’s spreading through all the layers in the stack one step at a time.
Most major databases have supported XML in some fashion for a while, too, but developer uptake wasn’t universal. JSON adoption amongst developers is nearly universal today, however. (The king is dead, long live the king!) But how good is JSON support in the databases we know and love? We’ll do a comparison in this blog post.
New PERFORMANCE_SCHEMA defaults in MySQL 5.7.7
I thought it was worth a moment to reiterate on the new Performance Schema related defaults that MySQL 5.7.7 brings to the table, for various reasons.
For one, most of you might have noticed that profiling was marked as deprecated in MySQL 5.6.7. So it is expected that you invest into learning more about Performance Schema (and Mark’s sys schema!).

Management & organization

Managing Change with DevOps
A couple of years ago I wrote a blog post titled « The Many Layers of DevOps. » At the time, I wanted the post to speak to all of the different aspects of managing change through a DevOps delivery process. It was important from my perspective to acknowledge that there was more to DevOps than simply infrastructure as code. I went on to explain that we still need to manage the application layer. Application deployments, database updates, static content deployments, application configuration all play a vital part in DevOps. Lately I have been asked to revise this post and expand upon other areas of change.