lae's notebook

Monappy Reopening Announcement (Translation)

The following is a translation of IndieSquare's announcement post.

To begin with, we'd like to extend our apologies to the users of this service and related persons affected by this incident. We're profoundly sorry that it took so long for us to resolve this incident and get to the point of resuming service and for keeping you in the dark.

To briefly recap: on 1 September 2018, an unauthorized party was able to access the hot wallet in use by Monappy and drained the wallet's Monacoin balance. In response, to prevent further damage to customer assets and to perform a root cause analysis, the service was taken offline. Since then, we've completed improvements to secure the system as well as improvements to the administrative side of the service, and are pleased to announce that we are making final preparations to reopen the service on 7 October, 2019.

Going forward, we will be periodically auditing the service and making security improvements wherever possible, and hope to regain your trust in the process.

As the service reopens, affected users will be fully reimbursed for damages. Details are below:

Reimbursement amount: 93078.7316 mona
Reimbursed persons: 7735 people
Reimbursement procedure: Once the service reopens, affected persons may login to their account and be able to withdraw their Monacoin balance as normal.

Service Reopening Details

In order to provide the safest environment for the affected Monappy users to withdraw their funds, only a subset of users will be able to access Monappy at first. We will be incrementing this number regularly, so we hope for your cooperation during this process.

On 7 October, 2019 at 10AM JST, an email from an @monappy.jp email address will be sent to all existing Monappy users regarding the reopening.

  • On login to Monappy, you will be strongly urged to withdraw your funds to a safer wallet (tl note: an offline wallet on your PC, for example), leaving only the bare minimum you need in your account (tl note: should you want to keep using the service).
  • Please note, if you make a mistake with the recipient address during the withdrawal process, due to the nature of blockchain we will be unable to assist you in this matter. Please double check your address before clicking withdraw.
  • Should there be a rapid increase of traffic or some other unforeseen problem, we may temporarily suspend the service without notice.
  • You may not be able to login to your account if accessing from a different IP from before. If this turns out to be the case, please email support@monappy.jp.
  • For those accounts with significantly large balances, identity verification may be required.
  • We expect all users to be able to start using Monappy in approximately 1-2 weeks.

*Caution*

  • We will never ask you for your email address or password on the basis of this incident via any communication method, including email, phone and postal mail.
  • We will never send you any email including an attachment regarding this incident. Please do not open any if you do receive something like this.
  • We will never ask you to input your Monappy account's email address or password on a domain other than monappy.jp. Please take due caution if you receive a suspicious email requesting you to do so.
  • Should we discover an issue not currently known, we may postpone reopening of the service to a later date without prior notice.

Events Leading Up to Reopening

In an effort to restore service, IndieSquare has hired a new member who has experience with cybersecurity and supporting financial systems to the team. During this process we revised the application's architecture in an endeavor to improve security, which includes changes to the monitoring system as well. We intend to introduce changes as needed along the way in an attempt to keep the service secure.

(tl note: I'm leaving this part untranslated as it's just review of what architectural changes they made, much of which should be matter-of-fact knowledge to system administrators. main thing for non-nerds is that there's more anomaly detection. also, I'm tired.)

Future Changes

In accordance with amendments to the Payment Services Act to go into effect in April 2020, we intend to remove the service's dependency on a hot wallet. We're also looking to developing integrations between Monappy and dApps, so that users can eventually have full control over their private keys.

standard apology and forward looking statement here

The translator of this article can be found on Twitter at @sleepingkyoto. Please send any corrections that way if needed. The translator is not affiliated with IndieSquare.

Notes from Zaif Attack

The following is primarily a translation of this blog post.

On September 20, 2018, Tech Bureau sent out a notice that they suspended deposits and withdrawals for three currencies (BTC, MONA, BCH) on the Zaif cryptocurrency exchange due to unauthorized access to its systems. This post is an aggregation of the details of that event.

Press Releases

Tech Bureau

Incident Timeline

TimeEvent
2018.09.14 between 17:00-19:00Approximately 6.7 billion JPY worth of assets were withdrawn without authorization.
2018.09.17Tech Bureau detected an anomaly within the environment.
- eveningTech Bureau suspended withdrawals/deposits for 3 currencies on Zaif.
2018.09.18Tech Bureau identified they had suffered a hacking incident.
- same dayTech Bureau reported the incident to the local finance bureau and started filing papers with the authorities.
- same dayThe official Zaif Twitter account tweeted that customer financial assets are safe.
- same dayIn accordance with the Payment Services Act, the FSA issued a Request for Report to Tech Bureau.
Post-identificationTech Bureau enters into a contract with Fisco for financial support.
Post-identificationTech Bureau enters into a contract with CAICA for assistance in improving security.
2018.09.20 ~2amTech Bureau issues a press release declaring that deposits/withdrawals were suspended due to a hacking operation.
- same dayThe Japan Cryptocurrency Business Association appealed for a member to perform an emergency inspection.
- same dayThe FSA sent an on-site inspection crew to Tech Bureau.
2018.09.21ETA for the FSA to issue a report on its investigation about the status of customer assets to the cryptocurrency exchange's traders.

Damage

  • Approximately 6.7 billion JPY worth of 3 different currencies were withdrawn externally without authorization.
  • Withdrawals and deposits for the 3 affected currencies have been suspended since the evening of 17 September.

Itemization of damages

Tech Bureau's own assets~2.2 billion JPY
Customer assets~4.5 billion JPY
  • Tech Bureau has shown that they can cover the 4.5b loss of customer assets through financial assistance from the FDAG subsidiary.

Information around the Zaif hack itself

  • Funds were withdrawn from the server managing the Zaif hot wallet.
  • Tech Bureau is still investigating the exact method of intrusion, but it doesn't look like they'll publicly announce it as a protective measure.

Details on the unauthorized transactions

Total (estimated) damages on the 3 currencies

CurrencyAmount transferredJPY conversionUSD conversion
Bitcoin5966 BTC4.295 billion JPY38.207 million USD
MonacoinUnder investigation, but sources estimate 6,236,810 MONA650 million JPY5.782 million USD
Bitcoin CashUnder investigation, but sources estimate 42,327 BCH2.019 billion JPY17.954 million USD

Assumed recipient addresses of the hack

CurrencyAddressTime of transaction
Bitcoin1FmwHh6pgkf4meCMoqo8fHH3GNRF571f9w2018.09.14, between 17:33:27 and 18:42:30
Bitcoin Cashqrn0jwaq3at5hhxsne8gmg5uemudl57r05pdzu2nyd2018.09.14, between 17:33:15 and 17:51:24
MonacoinMBEYH8JuAHynTA7unLjon7p7im2U9JbitV2018.09.14, between 17:39:01 and 18:54:10

work in progress



Disclaimer: I make no guarantees of the accuracy of the above article.
Please see the official press releases and/or PR department at Zaif. I am also not affiliated with Zaif or any of the companies mentioned in this article.

A Practical Behind the Scenes, Running Mastodon at Scale (Translation)

The following is a translation of this pixiv inside article.

Good morning! I'm harukasan, the technical lead for ImageFlux. 3 days ago at Pixiv, on April 14, we decided to do a spontaneous launch of Pawoo—and since then I've found myself constantly logged into Pawoo's server environment. Our infrastructure engineers have already configured our monitoring environment to monitor Pawoo as well as prepared runbooks for alert handling. As expected, we started receiving alerts for the two days following launch and, despite it being the weekend, found ourselves working off hours on keeping the service healthy. After all, no matter the environment, it's the job of infrastructure engineers to react to and resolve problems!

pawoo.net Architecture

Let's take a look at the architecture behind Pawoo. If you perform a dig, you'll find that it's hosted on AWS. While we do operate a couple hundred physical servers here at Pixiv, it's not really that possible to procure and build up new ones so quickly. This is where cloud services shine. nojio, an infrastructure engineer who joined us this April, and konoiz, a recent graduate with 2 years of experience, prepared the following architecture diagram pretty quickly.

Pawoo Architecture Diagram Using as many of the services provided by AWS as we could, we were able to bring up this environment in about 5 hours and were able to launch the service later that day.

Dropping Docker

One can pretty easily bring up Mastodon using Docker containers via docker-compose, but we decided to not use Docker in order to separate services and deploy to multiple instances. It's a lot of extra effort to deal with volumes and cgroups, to name a few, when working with Docker containers - it's not hard to find yourself in sticky situations, like "Oh no, I accidentally deleted the volume container!" Mastodon does also provide a Production Guide for deploying without Docker.

So, after removing Docker from the picture, we decided to let systemd handle services. For example, the systemd unit file for the web application looks like the following:

Description=mastodon-web After=network.target [Service] Type=simple User=mastodon WorkingDirectory=/home/mastodon/live Environment="RAILS_ENV=production" Environment="PORT=3000" Environment="WEB_CONCURRENCY=8" ExecStart=/usr/local/rbenv/shims/bundle exec puma -C config/puma.rb ExecReload=/bin/kill -USR1 $MAINPID TimeoutSec=15 Restart=always [Install] WantedBy=multi-user.target 

For RDB, Redis and the load balancer, we decided to use their AWS managed service counterparts. That way, we could quickly prepare a redundant multi-AZ data store. Since ALB supports WebSocket, we could easily distribute streaming as well. We're also utilizing S3 as our CDN/uploaded file store.

Utilizing AWS' managed services, we were able to launch Pawoo as fast as we could, but this is where we began to run into problems.

Tuning nginx

At launch, we had stuck with the default settings for nginx provided by the distro, but it didn't take too long before we started seeing HTTP errors returned so I decided to tweak the config a bit. That said, the important settings to increase are worker_rlimit_nofile and worker_connections.

user www-data;
worker_processes 4;
pid /run/nginx.pid;
worker_rlimit_nofile 65535;

events {
  worker_connections 8192;
}

http {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;

  sendfile on;
  tcp_nopush on;
  keepalive_timeout 15;
  server_tokens off;

  log_format global 'time_iso8601:$time_iso8601\t'
                  'http_host:$host\t'
                  'server_name:$server_name\t'
                  'server_port:$server_port\t'
                  'status:$status\t'
                  'request_time:$request_time\t'
                  'remote_addr:$remote_addr\t'
                  'upstream_addr:$upstream_addr\t'
                  'upstream_response_time:$upstream_response_time\t'
                  'request_method:$request_method\t'
                  'request_uri:$request_uri\t'
                  'server_protocol:$server_protocol\t'
                  'body_bytes_sent:$body_bytes_sent\t'
                  'http_referer:$http_referer\t'
                  'http_user_agent:$http_user_agent\t';

  access_log /var/log/nginx/global-access.log global;
  error_log /var/log/nginx/error.log warn;

  include /etc/nginx/conf.d/*.conf;
  include /etc/nginx/sites-enabled/*;
}

Afterward, without changing a lot of settings, nginx started to work pretty well. This and other ways to optimize nginx are written in my book, "nginx実践入門" (A practical introduction to nginx).

Configure Connection Pooling

PostgreSQL, which Mastodon uses, by nature forks a new process for every connection made to it. As a result, it's a very expensive operation to reconnect. This is the biggest difference Postgres has from MySQL.

Rails, Sidekiq, and the nodejs Streaming API all provide the ability to use a connection pool. These should be set to an appropriate value for the environment, keeping in mind the number of instances. If you suddenly increase the number of application instances to e.g. handle high load, the database server will cripple (or should I say, became crippled). For Pawoo, we're using AWS Cloud Watch to monitor the number of connections to RDS.

As the number of connections increased, our RDS instance would become more and more backed up, but it was easy to bring it back to stability just by scaling the instance size upwards. You can see that CPU usage has been swiftly quelled after maintenance events in the graph below:

RDS Graph

Increasing Process Count for Sidekiq

Mastodon uses Sidekiq to pass around messages, though it was originally designed to be a job queue. Every time someone toots, quite a few tasks are enqueued. The processing delay that comes from Sidekiq has been a big problem since launch, so finding a way to deal with this is probably the most important part of operating a large Mastodon instance.

Mastodon uses 4 queues by default (we're using a modified version with 5 queues for Pawoo - see issue):

  • default: for processing toots for display when submitted/received, etc
  • mail: for sending mail
  • push: for sending updates to other Mastodon instances
  • pull: for pulling updates from other Mastodon instances

For the push/pull queues, the service needs to contact the APIs of other Mastodon instances, so when another Mastodon instance is slow or unresponsive, this queue can become backlogged, which then causes the default queue to become backlogged. To prevent this, run a separate Sidekiq instance for each queue.

Sidekiq provides a CLI flag that lets you specify what queue to process, so we use this to run multiple instances of Sidekiq on a single server. For example, one unit file looks like this:

[Unit] Description=mastodon-sidekiq-default After=network.target [Service] Type=simple User=mastodon WorkingDirectory=/home/mastodon/live Environment="RAILS_ENV=production" Environment="DB_POOL=40" ExecStart=/usr/local/rbenv/shims/bundle exec sidekiq -c 40 -q default # defaultキューだけにする TimeoutSec=15 Restart=always [Install] WantedBy=multi-user.target 

The most congested queue is the default queue. Whenever a user that has a lot of followers toots, a ginormous number of tasks are dropped into the queue, so if you can't process these tasks immediately, the queue becomes backlogged and everyone notices a delay in their timeline. We're using 720 threads for processing the default queue on Pawoo, but this is a big area for introducing and discussing performance improvements in.

Changing the Instance Type

We weren't quite sure of what kind of load to expect at launch, so we decided to use a standard instance type and change it around after figuring out how Mastodon uses its resources. We started out with instances from the t- family, then switched to using the c4- family after distinguishing that heavy load was occurring every time an instance's CPU credits ran out. We're probably going to move to using spot instances in the near future to cut down costs.

Contributing to Mastodon

Now, we've been mainly trying to improve Mastodon performance by changing aspects of the infrastructure behind it, but modifying the software is the more effective way of achieving better performance. That said, several engineers here at Pixiv have been working to improve Mastodon and have submitted PRs upstream.

A list of submitted Pull Requests:

We actually even have a PR contributed by someone who's just joined the company this month fresh out of college! It's difficult to showcase all of the improvements that our engineers have made within this article, but we expect to continue to submit further improvements upstream.

Summary

We've only just begun but we expect Pawoo to keep growing as a service. Upstream has been improving at great momentum, so we expect that there will be changes to the application infrastructure in order to keep up.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The translator of this article can be found on Mastodon at lae@kirakiratter.