Mon blog-notes à moi que j'ai

Blog personnel d'un sysadmin, tendance hacker

Compilation veille Twitter & RSS #2016-33

La moisson de liens pour la semaine du 15 au 19 août 2016. Ils ont, pour la plupart, été publiés sur mon compte Twitter. Les voici rassemblés pour ceux qui les auraient raté.

Bonne lecture

Security & Privacy

Using security features to do bad things
We have quite a few security features at our disposal to help us better protect our websites and our visitors. I talk about them a lot on my blog and a few of them, mainly security headers, get a lot of coverage. Is it possible to use these security features for bad things?
The road to TOR
Happy Independence Day to all. I had been looking forward to this day so I can use to share with my brothers and sisters what little I know about TOR . Independence means so many things to many people. For me, it means having freedom, valuing it and using it to benefit not just to ourselves but to people at large. And for that to happen, at least on the web, it has to rise above censorship if we are to get there at all. I am 40 years old, and if I can’t read whatever I want to read without asking the state-military-Corporate trinity than be damned with that. Debconf was instrumental as I was able to understand and share many of the privacy concerns that we all have. This blog post is partly a tribute to being part of a community and being part of Debconf16.

System Engineering

Don’t Let Linux Control Groups Uncontrolled
The Linux kernel feature of cgroups (Control Groups) is being increasingly adopted for running applications in multi-tenanted environments. Many projects (e.g., Docker and CoreOS) rely on cgroups to limit resources such as CPU and memory. Ensuring the high performance of the applications running in cgroups is very important for business-critical computing environments.
Building a Remote Caching System
Last month, our engineering team shipped a large update to how Docker images are cached and stored when using Jet, our Docker platform. In this post, we’ll discuss the motivation for the update, the design and implementation of the feature, and cover some of the tricky engineering problems we faced.
Autoscaling: Its Purpose and Strategies
In my previous article, « Autoscaling on Complex Telemetry », we discussed a method for determining effective autoscaled cluster size from internal application metrics. That article assumed you wanted to autoscale and so did not discuss under what conditions you might choose to, let alone what your choices for scaling are. Let’s go ahead and delve into all that here.
Just Enough Redis for Logstash
Message queues are used in Logstash deployments to control a surge in events which may lead to slowdown in Elasticsearch or other downstream components. Redis is one of the technologies in use as a message queue. Due to its speed, ease of use, and low resource requirements it is also one of the most popular ways as well.
Distributed delay queues based on Dynomite
Netflix’s Content Platform Engineering runs a number of business processes which are driven by asynchronous orchestration of micro-services based tasks, and queues form an integral part of the orchestration layer amongst these services.
How PayPal Scaled to Billions of Transactions Daily Using Just 8VMs
How did Paypal take a billion hits a day system that might traditionally run on a 100s of VMs and shrink it down to run on 8 VMs, stay responsive even at 90% CPU, at transaction densities Paypal has never seen before, with jobs that take 1/10th the time, while reducing costs and allowing for much better organizational growth without growing the compute infrastructure accordingly?

Software Engineering

Revisiting the Anatomy of HTTP: Part I
One of the key driving factors behind the various web/mobile performance initiatives is the fact that end-users’ tolerance for latency has nose-dived. Several studies have been published whereby it has been demonstrated that poor performance routinely impacts the bottom line, viz,. # users, # transactions, etc. Examples studies include this, this and this.
Fast Document Rectification and Enhancement
Dropbox’s document scanner lets users capture a photo of a document with their phone and convert it into a clean, rectangular PDF. It works even if the input is rotated, slightly crumpled, or partially in shadow—but how?
In our previous blog post, we explained how we detect the boundaries of the document. In this post, we cover the next parts of the pipeline: rectifying the document (turning it from a general quadrilateral to a rectangle) and enhancing it to make it evenly illuminated with high contrast. In a traditional flatbed scanner, you get all of these for free, since the document capture environment is tightly controlled: the document is firmly pressed against a brightly-lit rectangular screen. However, when the camera and document can both move freely, this becomes a much tougher problem.
Search Relevance Infrastructure at Twitter
Millions of people all over the world search on Twitter every day to see what’s happening. During major events such as the recent Euro 2016 final, we observe record traffic spikes as people turn to Twitter to find timely information and perspectives, and overall traffic volume has been steadily increasing over time. The Search Quality team at Twitter works on returning the best quality results for our users.

Databases Engineering

MySQL & MariaDB

How Apache Spark makes your slow MySQL queries 10x faster (or more)
In my previous blog post, I wrote about using Apache Spark with MySQL for data analysis and showed how to transform and analyze a large volume of data (text files) with Apache Spark. Vadim also performed a benchmark comparing performance of MySQL and Spark with Parquet columnar format (using Air traffic performance data). That works great, but what if we don’t want to move our data from MySQL to another storage (i.e., columnar format), and instead want to use « ad hock » queries on top of an existing MySQL server? Apache Spark can help here as well.
Context aware MySQL pools via HAProxy
At GitHub we use MySQL as our main datastore. While repository data lies in git, metadata is stored in MySQL. This includes Issues, Pull Requests, Comments etc. We also auth against MySQL via a custom git proxy (babeld). To be able to serve under the high load GitHub operates at, we use MySQL replication to scale out read load.
Tracker: Ingesting MySQL data at scale - Part 2
In Part 1 we discussed our existing architecture for ingesting MySQL called Tracker, including its wins, challenges and an outline of the new architecture with a focus on the Hadoop side. Here we’ll focus on the implementation details on the MySQL side. The uploader of data to S3 has been open-sourced as part of the Pinterest MySQL Utils.

Network Engineering

Bandwidth Costs Around the World
CloudFlare protects over 4 million websites using our global network which spans 86 cities across 45 countries. Running this network give us a unique vantage point to track the evolving cost of bandwidth around the world.
Two years ago, we previewed the relative cost of bandwidth that we see in different parts of the world. Bandwidth is the largest recurring cost of providing our service. Compared with Europe and North America, there were considerably higher Internet costs in Australia, Asia and Latin America. Even while bandwidth costs tend to trend down over time, driven by competition and decreases in the costs of underlying hardware, we thought it might be interesting to provide an update.
BGP Routing Tutorial Series, Part 4
In Part 3 of this BGP routing tutorial, we looked at how to establish peering sessions with neighbor networks. This time we’ll take look at the impact of using BGP with upstream service providers, whether you have only one (single homed) or several (multi-homed).